DPX: What Not To Do
Thank You, Next
We got our first set of dpx files at the Hirshhorn back in December. I was really excited about developing a workflow for processing the files, especially in light of the development of Media Area’s new RAWcooked tool. Sure it was a lot of data, but I was ready. How bad could it be? And so began a two month process full of frustration and disappointment.
The impetus for creating these film scans was to make an access copy of a 16mm anamorphic film in our collection. Admittedly, this is a little bit of an excuse. I did not have any experience processing a film scan, and the museum did not have an established workflow for processing film scans, so this seemed like a good opportunity to prioritize some workflow development. With this in mind we opted to scan a second work in our collection as well, a 35mm anamorphic color positive print.
We brought a release print of the 16mm film, and the 35mm anamorphic film to the fabulous Colorlab to have the films scanned on their Lasergraphics ScanStation. Colorlab offers film processing, printing, and scanning as well as videotape reformatting. We had a great experience there and we will definitely be back.
We chose to “overscan” the film to expose the soundtrack and perforations of the film in the digital copies. I have an affinity for scans that include this information, as I mentioned in my home movie day blog post, but for the museum’s specific use case, I was unsure. I tweeted about this before making the decision, and I was surprised by the strong support overscans received.
I’ll admit that my initial tweet lacks nuance (not a quality twitter is known for), and that lack of nuance was a little embarrassing to me as the replies flooded in. But in a way, I think the ambivalence that the tweet suggests got people riled up and maybe inspired more engagement than I would have received otherwise. Obviously the information outside of a film projector’s gate has “real use.” I encourage people to check out the replies to that tweet, there’s lots of good insight from leaders in the field of moving image archiving, home movies, and from filmmakers as well.
Ultimately, I have mixed feelings about our decision to overscan these specific films. One of the films we had scanned, the 16mm print, is exhibited on film - displayed on a film projector in the gallery, with the exhibition print fed to the projector through a looper. That 16mm film will be preserved filmicly. We have the internegative print, and a color match print, which we would use to create future 16mm exhibition prints. Since this film is being scanned to create an access copy, then an overscan makes perfect sense. This access copy we are creating allows us to assess the object in the collection, while exposing the actual object to as little risk as possible. However, the other film we had scanned, a 35mm anamorphic positive, is more commonly exhibited as digital video. The film has little evidentiary value as an artifact. It was made by a vendor hired by the artist, and was laser printed. Just one output from a digital video editing project, along with an HDCAM-SR videotape. As the work is essentially born digital, but delivered on 35mm as a potential exhibition format, by scanning the soundtrack and perforations, we use valuable resolution on information that has little value to us. Were we to use the film scan to create a new print (we won’t) then we will need to rescan, and if we were to make a digital video exhibition copy from the 4k scan (we might), we would need to crop and resize the video significantly. I’m beginning to feel that the transfer of the HDCAM-SR that we created is our best preservation element of the work. The film scans might have been, if we had chosen not to overscan this print.
Mistake #1 - Look Before You Leap
We received the 4k resolution dpx files from Colorlab on two separate hard drives, just under 8 TB of data. Because both films are anamorphic, we received two sequences for each film: a “raw” scan, showing the image just as it was on the print, and a “desqueezed” sequence, a digitally altered version of the scan mimicking the stretching effect of an anamorphic lens.
Fueled by anxiety, I was growing increasingly worried about only having one copy of these four dpx sequences stored on external hard drives. “What if something were to happen? I must get these files backed up!” I was also very excited to start using Kieran O’Leary’s IFI scripts, a suite of pythons scripts for processing archival video adhering to digital preservation standards (Kieran’s outstanding work for the Irish Film Institute was recently recognized by the Digital Preservation Coalition). I have been aware of Kieran’s work for a few years, but, not knowing much of anything about python, I had been intimidated to dive in. Thankfully, developer archivist Joanna White took the time to write up her experience deploying IFI scripts. Joanna’s detailed and unimposing blog posts lowered the barrier to entry for me and helped me identify a script I could start with: copyit.py. If you’re hoping to try out the IFI scripts, I suggest you do what I did, and read Joanna’s post, which explains everything from how to install python using homebrew to what you should expect to see on your command prompt when the scripts are working right. So concludes the “things that I did, which you should also do” section of this post.
Here’s where I messed up - I should have started small. After a few quick tests to make sure I had copyit.py working correctly (I did need to install one dependency, a python module called lxml, easily retrieved using pip), I dove in and started moving huge amounts of data. Instead of restricting my first use of copyit.py to a single dpx sequence (or less) I jumped in and tried to copy almost 6 TB at a time. The script worked well, it created a manifest of all of the data overnight, and got to copying it to our DroBo RAID. The script continued to run successfully all day while I was at work. I left it to run over the weekend, but when I came back on Monday the DroBo had turned itself off! This had never happened before. Flipping the rocker switch on the back of the drive wouldn’t turn the drive back on, but unplugging the RAID and plugging it back in allowed me to reboot it. There are power fluctuations in our building, and I assumed that was all it was, and decided to move on.
I already had a checksum manifest of the files at their source. To determine if any of the files were corrupted during transfer to the DroBo, I created a checksum manifest of the files that had made it to their destination. Fortunately, only the last four dpx files on the destination drive had checksum mismatches. My assumption is that those files were simply incomplete, maybe in the process of being copied when the drive failed? Regardless, it was a relief that the other files, that had been transferring for hours and hours, were not corrupted or useless.
I had an off ramp here, where I could reconsider my approach, and maybe even try something else. But no! I just plunged ahead. The drive failing over the weekend just increased my anxiety. Giving in to this impulse, I pushed forward without a second thought. I copied the remaining files over to the DroBo RAID manually, and then ran checksums on these files, and compared them to the original, ensuring that they were a bit-for-bit match with the dpx sequences stored on the external drive. Six days have now passed since I began copying files off of the external drive to the RAID.
Mistake #2 - Space is Finite
I have now copied around 6 TB of data from an external hard drive to our DroBo RAID. We now have two copies of this data. Hooray? But, now I’m running out of room on the DroBo. After digging around, I found some files that are already stored in our Digital Asset Management System (DAMS) and are backed up on to our backup RAID, just in case. Now I have about 4.5 TB free on the DroBo RAID.
I was excited to try out the new and exciting tool from MediaArea, RAWcooked. RAWcooked takes raw audiovisual data like dpx sequences and wav files, and losslessly encodes them into a MKV video file. This free, open source command line tool is very easy to use, has straightforward syntax, and works great - when, y’know, you’re doing it right. Plus, the team that put RAWcooked together was really helpful (and patient) when I asked for help torubleshooting. There was, sadly, a lot of troubleshooting, but, as I’ll discuss, this has to do with me, and the system I’m using, and not RAWcooked.
I ran RAWcooked, using the command “rawcooked —check 1” on a directory containing one of the dpx sequences created from the 16mm anamorphic film print and the WAV file created from the print’s optical soundtrack. The tool begins with an “analysis” stage, before moving on to an ffmpeg process encoding the files. I ran the command around 11:00 AM that morning and the first stage, “analysis” finished around 4:00 PM. This is longer than this should take on a well resourced machine like ours, even for a 45 minute 4k dpx sequence, but I didn’t know that at the time. The encoding finally finished the next day around 11:00 AM.
I knew from Joanna’s blog post that a surefire way to demonstrate that the RAWcooked created MKV was losslessly encoded was to run RAWcooked on the MKV and reverse the process. This logic is built into the tool, you simply run RAWcooked on the MKV, and it outputs a dpx sequence.
Uh-oh. Running out of room again. I now need space to store the original dpx sequences, the MKV I just made, and the dpx sequence I’m making from the MKV to prove reversibility (a word, it turns out, I have a great deal of trouble spelling). Maybe I should pause, take a step back, and reassess? Nah!
I left RAWcooked running on the MKV overnight, creating the reversibility dpx sequence. When I came back in, the DroBo had turned itself off again! Same story, had to unplug the RAID and plug it back in again to get it to turn on. This issue did not present again, and still hasn’t several months later, so I’m guessing the load on the RAID was a factor. Let me know if you have any insight?
To add insult to injury, the process had been running right up until I came in that morning, only failing around 8:45 AM. I was curious if the process had proved reversible thus far, and so I created a checksum manifest of the successfully created dpx files. I was even toying with the idea that, if the all of the files created were a match, maybe I could forgo a full reversibility test. Most of the checksums did match, but they were not perfect, I think many of the dpx files were incomplete when the drive failed. Regardless, I wanted a full reversibility test so I deleted that set of dpx files and ran RAWcooked on the MKV again. This process completed successfully after about 24 hours. I again created a checksum manifest, and it was an exact match. Including weekends and holidays, it had now been 21 days since I started moving files off of the external drive. I have three more dpx sequences to go.
I deleted the dpx sequence I had losslessly encoded in to a MKV file, and deleted the dpx files that I had created to prove reversibility. With 3.5 TB of room available, I set out to start in on the next dpx sequence.
Mistake #3 - Ignoring Problems
After the hard fought creation of an MKV video file, I decided I wanted to try another of the IFI scripts. Again, I leaned on Joanna White’s documentation for guidance, and confidence. Kieran O’Leary’s seq2ffv1.py contains more of less the same workflow I was employing on the first dpx sequence. This script automates the process of losslessly encoding a dpx sequence to an ffv1 encoded MKV, then reversing the process, and verifying the result using md5 checksums. The whole process is documented through log files, which the script generates as it goes.
I ran the python script on the other dpx sequence, created from the same anamorphic 16mm film print, and the same WAV file (the image information is different - stretched - but the audio is the same). I did at least think to run a test with this script, but it turns out the script has a built in test that it runs on the first few files in the dpx sequence before diving into the process (good thinking, Kieran).
This ran for 7 days. Yeah - might have been a clue something was wrong. To be fair, there are a lot of time consuming processes contained within the script. The script needs to create checksums of thousands of dpx files, then run RAWcooked, create a checksum of the RAWcooked created MKV, then run meidainfo and mediatrace on the MKV, then run RAWcooked on the MKV to create a reversibility dpx sequence, then create a checksum manifest of those thousands of dpx files… But yeah, probably shouldn’t have taken 7 days. Red flag.
Sadly, the script did not recognize the original dpx manifest and the reversibility dpx manifest as an exact match. It did not retain the reversibility dpx sequence, or the manifest, so I’m not exactly sure what happened here, but there was some sort of error. A quick test, running RAWcooked on the seq2ffv1.py created MKV for a few minutes, demonstrated that it wasn’t a catastropihc error, the dpx files RAWcooked was creating did match the checksums of the original dpx sequence. I decided to retain the seq2ffv1.py created MKV and try to create a reversibility dpx sequence from the MKV using RAWcooked.
I ran RAWcooked on the MKV around 4:00 pm on a Tuesday, and for the first hour, it seemed to be doing fine, reporting encoding speeds of 16-20 mib/s and ~ 0.07x speed. Unfortunately by the next morning it had slowed to 1.1 mib/s and a speed of 0.00x. The following day it had slowed even more, going from 46.45% at 9:25 am to only 49.21% at 4:09 pm. Since some progress, even a small amount, was made, I left it running over a long weekend. It was only at 65.54% at 5:50pm on Friday evening. Surprisingly, it had finished by Tuesday morning. I created a checksum manifest from the reversibility dpx sequence, and I remember being suspicious about how quickly it came back complete.
Comparing the md5 checksums of the original dpx files with the ones created by RAWcooked to demonstrate reversibility showed that many of the reversibility dpx files contain no data. They all had the same checksum. Much of the sequence does match, but large chunks contain these “broken” dpx files, that I assume don’t hold any data at all. Looking at this dpx sequence in DaVinic Resolve confirmed this, the segment that is made up of erroneous checksums is unrenderable by the software, displaying “Media Offline” in the playback window.
Success #1 - Troubleshooting With a Little Help From My Friends
At this point I’m pretty disheartened. Over a month has gone by since I started working on this, and I only have one fully completed MKV file to show for it, and a second which hasn’t completed the reversibility QA (Quality Assurance) process. I turned to twitter and asked if anyone else had encountered similar issues or knew of a solution.
Obviously, this is where things start to turn around. Twitter is bad at showing threading, and the various offshoots a single tweet can have, so I’ll try to highlight a few of the very helpful solutions that came out of my dialog with the #digipres community.
Archivist, developer, and technologist Ashley Blewer is right (as per usual). Raising this issue on GitHub yielded a positive exchange with Digital Media Analysis Specialist Jérôme Martinez and Head of Data and Digital Preservation in the BFI National Archive Stephen McConnachie. While the feature I was hoping for does not exist yet, it may in the future. A great example of how working with open source software encourages dialog and helps to build community among users.
MediaArea, the team developing the tool, does offer licenses to users for additional features (less common color spaces, for instance). Funding RAWcooked comes with other benefits as well, such as the ability to request features. I’m currently advocating for the museum to purchase a license. My proposal was positively received, but the federal government moves slowly. Hopefully we will be helping to fund RAWcooked soon. If you are using RAWcooked, please consider purchasing a license, and/or sponsoring one of the open tickets on GitHub.
Speed comparison! Why didn’t I think of this earlier! Big shout out to Lead Technical Analyst at the Met Museum, Milo Thiesen, for helping me establish a baseline for how fast the RAWcooked process should be running. I expected the process to take a long time - it’s a lot of computing - but I didn’t have anything to compare it to.
Nicole Martin, Human Rights Watch Media Archives Manager and all around awesome person, helped me ID some good troubleshooting techniques, and it was cathartic to share some frustrations about this with her and Milo.
Ultimately Jérôme’s suggestions of testing files on my local hard drive yielded results. It turns out the DroBo was the culprit! After running RAWcooked on a set of files on the DroBo, my local hard drive, and a G-technology RAID, the difference was pretty obvious:
Metrics | Analysis CPU | Analysis Time | FFmpeg CPU | FFmpeg rate | FFmpeg time | Total encode time | Decode CPU | Decode rate | Total decode time |
Internal Drive | 30% CPU | 7 minutes | 750%, 32 threads | 3,000,000 kbits/s, speed=0.135x | 11 minutes | 18 minutes | 500%, 259 threads | 30 mib/s - 60 mib/s, 0.12x-0.08x | 16 minutes |
DroBo | 3%-5% CPU | 22 minutes | 160%-336%, 32 threads | 2,675,317.5 kbit/s, speed=0.0417x | 32 minutes | 54 minutes | 112%-200%, 259 threads | 10 mib/s-26 mib/s, .02x-0.08x | 58 minutes |
G-tech RAID | 25%-45% CPU | 4 minutes | 670%-750% CPU, 32 threads | 2,608,529.5 kbit/s, 0.137x | 11 min | 15 minutes | 439%-701% CPU, 259 threads | 38 mib/s - 71 mib/s, 012x-0.17x | 12 minutes |
There’s a clear outlier here. The DroBo was taking significantly longer at every stage of the process, and taking less advantage of the CPU.
Mistake #4 - Race to the Finish
This was certainly a turning point in the project. Running RAWcooked on the DroBo was untenable, and following that realization, solutions seemed to flow freely. The G-Technology RAID mentioned in the tests above was purchased as a backup drive that stores backups from multiple devices through corresponding partitions. After pausing our regular backup schedule, and moving some things around, I was able to create a new partition that could hold one of the dpx sequences, the MKV, and the reversibility dpx sequence, plus a little head room to be safe (the partition was approximately 5 TB). I was able to create a dpx sequence to prove lossless reversibility from my RAWcooked created MKV video file within one day, once moving my process to the G-Technology RAID. That process had previously taken seven days on the DroBo, and ultimately failed.
With a clear path forward, I had another opportunity to pause and reflect, which I wish I had taken advantage of. Again, my anxiety was in the driver’s seat, and I couldn’t be bothered to pull over and switch places. Now that I had repurposed the RAID, I couldn’t run backups regularly, which felt like a security blanket had been taken away from me. Not to mention, at this point I felt so sick of this process! I had begun to turn that resentment inward. Instead of focusing on the positive accomplishment of finding the solution, I passed judgement on myself for not finding it sooner. All of these pressures drove me forward to attempt to complete the project, without much pause for contemplation.
Why was the DroBo performing so poorly? After learning that the RAID configuration of the DroBo was not fixed, but instead always changing when new data was added, I jumped to a conclusion: The extra processing the DroBo would perform in order to shift around redundant data and free space must be slowing everything down. My guess is that, when new data is added to the DroBo, the RAID needs to calculate how much redundancy it can afford. Maybe this extra calculation was a drag on the system? I still have no idea. I did not take the time to test my hypothesis, because I couldn’t wait to get these dpx files the hell off my desk, and into Smithsonian DAMS, where I don’t have to worry about them anymore.
My rush to be rid of this task meant I forgot to test out the Library of Congress’ new dpx metadata editing software! I saw Kate Murray, Digital Projects Coordinator at Library of Congress, and Chris Lacinak, AVP Founder, present on embARC (Metadata Embedded for Archival Content) at the Association of Moving Image Archivist (AMIA) Annual Conference in Baltimore in 2019, and I was looking forward to giving it a test drive. The open source software, which is currently still in beta, allows the user to batch edit metadata in the dpx header, streamlining the process of implementing FADGI’s Guidelines for Embedded Metadata within DPX File Headers for Digitized Motion Picture Film as well as required SMPTE 268 metadata rules. With a clear processing workflow now ironed out, I could have taken this chance to test out the tool, but by then I had the blinders on and wouldn’t allow myself another diversion.
I don’t want to be too hard on myself here, though. My desire to hear the sweet sound of the dpx files being permanently erased from our local storage did motivate me to learn a new application, that allowed me to automate a sequence of tasks. Enter: cron.
Success #2 - Master of Time
To review, I have four dpx sequences total. A “raw” scan of the 16mm anamorphic film, a “desqueezed” scan of the same 16mm film, a “raw” scan of the 35mm anamorphic film, and a corresponding “desqueezed” version of the 35mm print. At this point in the project I had losslessly compressed the “desqueezed” 16mm anamorphic dpx sequence into a MKV video file using RAWcooked, and successfully converted that MKV back into a dpx sequence to confirm lossless compression. I had begun to process the next dpx sequence in my queue, the “raw” scan of the 16mm anamorphic print, by creating a MKV from that dpx sequence. I got stuck trying to “decode” that MKV back into a dpx sequence. Now that I have identified the issue with the DroBo and found a new storage location to process the files, I just need to run RAWcooked on the MKV to create a dpx sequence that, if the MKV was encoded correctly, will be a bit-for-bit match of the original dpx sequence.
I started running RAWcooked on the MKV and everything was looking good, the terminal output displayed speeds of 35-70 MiB/s 0.11x-0.15x. Activity Monitor was showing the CPU at 505%-743%, using 259 threads. The next step in my workflow was to calculate a checksum for each of these “reversibility” dpx files, and then compare them to the checksums of the original dpx sequence. If the checksums matched, then I could delete both dpx sequences, and just keep the MKV, since the conversion process was proven to be mathematically lossless and completely reversible. Makes sense? I’m confused too.
Anyway, the important part is: I’m running RAWcooked on this MKV, and it is making lots of dpx files, which, once created, I’m going to want to checksum. I can’t start on that second step till the first one is finished, but the first step is taking forever. I don’t want to sleep under my desk (my coworkers don’t want that either), and I don’t want to wait till I get back the next day to start in on the second step. I can schedule the second task to start in the middle of the night, when I know the first task will be done, but still hours before I’ll be at my desk.
I used this tutorial by Ole Michelsen to learn how to schedule tasks with crontab: https://ole.michelsen.dk/blog/schedule-jobs-with-crontab-on-mac-osx.html
I started small, scheduling a single-line bash script containing one md5deep command to run in the middle of the night. This wasn’t too tricky. The bash script I made looked something like this:
#!/bin/sh
PATH=/usr/local/bin:/usr/local/sbin:~/bin:/usr/bin:/bin:/usr/sbin:/sbin
md5deep -r -b [directory_full_of_dpx_files] > checksum_manifest.txt
After saving this text into a file with a .sh file extension, then modifying the permissions by running “chmod +x” on the .sh file, I scheduled it for 1:00 AM following the directions in the tutorial mentioned above. Full disclosure, I didn’t have the “PATH=” stuff in there when I first tested the bash script, and just hastily googled it. It’s super messy? Might be needlessly long. As stated earlier, by this point I was full steam ahead, if it works it works.
This worked! Went off without a hitch. I was emboldened by success and decided to double down. I started running RAWcooked on my next dpx sequence (three out of four). It looked like it would be done well before 8:00 PM that night, but I had no interest in staying that late. So, I set up crontab to run 3 scripts sequentially by placing two ampersands between each “.sh” file. Starting at eight o’clock that night, the computer would create a checksum of the RAWcooked MKV, once finished, the computer would move on to create a mp4 from the MKV, and finally create a “reversibility” dpx sequence from the MKV. It looked something like this:
0 20 * * * ./md5deep_on_MKV.sh && ./MKV_into_mp4.sh && ./rawcooked_on_MKV.sh
This worked too! When I came in the next day the “reversibility” dpx sequence was done. Now I’m mad with power. I decide to try to sequence the whole workflow in scripts for the last set of dpx files.
First, a shell script will create a checksum manifest of the dpx sequence, then a second script to rsync the files to the RAID, a third script to create another checksum manifest to verify successful transfer, script number four will run RAWcooked on the dpx sequence, script five will create a checksum for the MKV, six will run ffmpeg to create a mp4 file, the seventh script in the set will run RAWcooked on the MKV to create a “reversibility” dpx sequence, and Shell Script 8: Fate of the Furious will create a checksum manifest of the reversibility dpx sequence. Very ambitious.
Yeah it broke on the second script, haha. I included a “/” on the end of the directory I was rsyncing, which copies the contents of a directory, instead of the directory itself. All of the other scripts were coded to run on the directory, which wasn’t there, so they failed. Not the end of the world, but an inconvenience nonetheless. Certainly a pitfall of trying to automate a large sequence of tasks as a one off.
When I came in the next day, I cleaned up the directory, confirmed that the file transfer was successful by creating a checksum manifest, and started RAWcooked that afternoon. I left the last four scripts to run overnight, which finished successfully around 5:00 AM the next morning. Phew.
Conclusions
I wish people would talk about failure more.
It feels great to present success. And it should! We should be proud of our accomplishments, and proud of our colleagues’ accomplishments, as well. Self-promotion carries all kinds of negative connotations, which I think are unwarranted. The completion of a project is a great occasion for celebration, reflection and rest. I think a conference presentation sometimes takes on this role, and that is one of my favorite “flavors” of conference presentation.
But, I think it would be nice to see more occasions for celebration, reflection and rest following failures, the way there is with success. The stalled out project, the ineffective treatment, the too ambitious policy - previous efforts we have to work around and maybe avoid talking about.
I did not write up a list of mistakes I made during this project to beat myself up, or to publicly shame myself. I hope an insight into potential pitfalls will be helpful for other practitioners, and that some troubleshooting techniques can be extrapolated from sharing my (not always ideal) practice. But, my main objective in writing this post has been to emphasize the need for patience.
It would be silly for me to expect that things will always go smoothly, or that I will somehow respond exactly as I ought to once mistakes are made. Yet, I often fall into the trap of leveling this expectation on myself. Then, upon my realization that, contrary to my expectations, things have not magically worked out, I freak out. My reactions are fueled by anxiety and stress. Stress certainly inspires additional exertion, but few problems I encounter in my work are efficiently solved through “brute force.” In this way, I think more reasonable expectations can help to reduce anxiety.
We can all help others to expect mistakes by sharing our own. My hope is that the more we can do to normalize mistakes, the less frightening it will be to share them. The embarrassment we feel around “bad” choices is unfounded. We learn from reviewing our process, and from interrogating our decisions. We are all figuring this out as we go along, and there is no need to pretend otherwise. Through sharing our mistakes we can learn what not to do, but hopefully, we can learn that sharing our mistakes is what we should do, too.
P.S: I’ve seen some great examples of talking about failure, and I was inspired by those examples to try to highlight challenges I grapple with regularly too. Shout out to Rebecca Fraimow, Julia Kim, Kristin MacDonough, and Shira Peltzman on their 2015 panel presentation at the AMIA conference Mistakes Were Made: Lessons in Trial and Error from NDSR (five years ago! Time is moving too fast!), and Walter Forsberg, Julia Kim, Blake McDowell and Crystal Sanchez for their AMIA presentation the following year FAIL: Learning from Past Mistakes in Ingest Workflows.
Let me know of your favorite failure sharing platforms in the comments below.