A few days ago, I drafted (but had not yet published) a post about using MD5 for validating or authenticating evidence in digital forensics. MD5 has had security problems for twenty years, but it's still been used in forensics, although the trend has been toward SHA-1 (which has some problems of its own) and SHA-2.
After drafting the post, I discovered that the Scientific Working Group on Digital Evidence has released a draft endorsing the use of MD5 and SHA-1. I wrote in to share my concerns, but I also reached out to some cryptographers via Twitter. Dr. Marc Stevens, a cryptographer known for his expertise in attacking MD5 and other hash functions, released a series of tweets that was even more critical of MD5 than I anticipated and that was incredibly damning for any forensic expert who continues to rely on MD5.
First, I'll share my original thoughts in abbreviated form. Then I'll share some highlights from Dr. Stevens' tweets. If you're interested in Dr. Stevens' views, consider reading all of what he had to say on Twitter and in his scientific work. If I have misrepresented or misunderstood his views in any way, I apologize.
When we image and process digital evidence, we use a hash function to fingerprint that data so that we can compare it to other known files and so that, later on, we can verify that the evidence hasn't changed. SHA-1 is probably the most common hash function used in forensics and there is some support for SHA-256, which is what we should be moving toward.
In order to be considered secure, a hash function should be strong against two attacks: collisions and preimages. A collision occurs when we find two "messages" (files, strings, whatever) that have the same hash value. To be secure, it should be hard to find two files that have the same hash. Note that in this scenario we are allowed to to pick both messages. If we can find any two that match, we have a collision. A preimage is a little different because one of messages has already been picked. To find a preimage, we have to find a second message that has the same hash value. The distinction is like the difference between trying to find two people in a room with the same birthday (anybody can match anybody) versus trying to find somebody in a room with your birthday.
MD5 is considered a weak hash function because there are practical attacks for findingcollisions. There aren't any practical attacks for finding preimages for MD5.
If we need to verify that a file hasn't changed, MD5 is plenty good enough to detect accidental modification. If the file was corrupted or inadvertently modified by a careless examiner, there is an infinitesimally small chance that the hash will come out the same. If we're worried that someone has intentionally altered the data, they would have to be able to execute an attack (find a preimage) that is beyond what anyone is currently able to do using publicly-known attacks. Hell, even if the file wasn't hashed, a court would probably not allow someone to assert that the evidence had been altered without some evidence suggesting it had.
So, we can use MD5, right?
I think you do so at your own peril.The problem is that cryptographers, the people who are experts in making hashes and ciphers, have been saying not to use MD5 for 20 years and the attacks against MD5 have gotten much, much better since then. When a forensic examiner goes into court, he or she serves the court as an "expert". I feel like I could offer a reasonable defense/explanation for using MD5. I've read books on cryptography and took a grad-level class in it. I'm knowledgeable (enough to be dangerous). I think I understand it well enough to say that despite the warnings it's okay to use it in certain circumstances. But I'm not an expert in cryptography so why would I try to weigh in as one? [Note: Dr. Stevens' tweets indicate that he disagrees with my contention that MD5 would be acceptable in some circumstances. But, that's my point. Any situation where I think it might be okay to use MD5 is based on my amateur understanding of cryptography, not the expert-level understanding that he or his colleagues would have.]
There's an added complication. Even if MD5 is okay to use in these scenarios, trying to justify it without a good understanding of why could lead you into some murky waters. Simply not being careful about how you answer questions could get you trapped by a well-prepared attorney.
Imagine this: You go into court and explain how you verified the images in your case using MD5. The defense attorney asks you some very innocent questions about it: "What's MD5?", "can two files have the same hash?".
You give the best explanation that you remember from your training: "the odds of two files having the same hash are like 1 in 80 bajillion."
"So", he says "I couldn't just change the file and tweak it so the hash would be the same?"
"No way", you say. "It's like winning the lottery five times in a row."
The defense attorney smiles back at you and grabs a stack of papers off of his table. He has an article about how some researchers forged digital certificates that used MD5. He'd like you to read the highlighted portion. He has another about how the Flame malware hijacked windows Update because of MD5. Would you please read the paragraph he highlighted there as well? He picks up a USB drive and tell you he has pictures of Jack Black, James Brown, and Barry White and they all have thesame hash. He has a picture of aship and a planeand those two have the same hash. He'd like you to hash these files to demonstrate.
"So", he says again. "What you told us a few minutes ago about the hashes. It wasn't true, was it?"
I disagree: cryptography is notoriously hard to get right. You should rely on expert cryptographic advice. And the prevailing expert opinion is: do not use MD5 for security.― Marc Stevens (@realhashbreaker) December 16, 2018
And nowhere MD5 actually helps you in court, and can only hurt, since any cryptographic expert would say it should not be used for that. While SHA2 would help you in court. So what would be the best advice?― Marc Stevens (@realhashbreaker) December 16, 2018
I think these tweets are key because they argue (from his expert perspective) that we should not use MD5 but also point out that this is the prevailing opinion among cryptographers. This is really key because the methods that we use in a legal case are supposed to meet a standard, namely the Daubert standard which considers five factors:
1. Whether a theory or technique can be and has been tested
2. Whether the theory or technique has been subject to both peer review and publication
3. The known or potential error rate of the method
4. The existence and maintenance of standards controlling its operations; and5. Whether it has attracted widespread acce