US District Court Says Calculating a Hash Value = Search
bfwebster writes "Orin Kerr over at The Volokh Conspiracy (a great legal blog, BTW) reports on a US District Court ruling issued just last week which finds that doing hash calculations on a hard drive is a form of search and thus subject to 4th Amendment limitations. In this particular case, the US District Court suppressed evidence of child pornography on a hard drive because proper warrants were not obtained before imaging the hard drive and calculating MD5 hash values for the individual files on the drive, some of which ended up matching known MD5 hash values for known child pornography image and video files. More details at Kerr's posting." Update: 10/28 16:23 GMT by T : Headline updated to reflect that this is a Federal District Court located in Pennsylvania, rather than a court of the Commonwealth itself.
The courts are finally getting up to speed on technology.
"Ein Volk, ein Reich, ein Führer." -Adolf Hitler
"We are one Nation, we are one People." -The One 'leader'
you can't generate md5s w/o actually looking at all of the data in the file.
Comment removed based on user account deletion
When I submitted this story, I gave it the headline "US Court:...". Someone changed that to "PA Court Says...". That's wrong. This is a ruling from a US District (Federal) court, not a Pennsylvania state court, and so carries much more weight. ..bruce..
Bruce F. Webster (brucefwebster.com)
Comment removed based on user account deletion
What evidence? Some md5 hashes that happen to match hashes from a select number of images? Odds are if we hash out every file on your hard drive we will also find matches to that same list.
Actually, odds are the hashes will not match...
Odds yes.
But no guarantee.
A better check is hash and file size, since it is more difficult for two files of the same size to have the same hash by chance. Especially using compression due to images or videos of the same dimensions reducing to different sizes.
Hash and file size checks are useful for checking if a file is intact and possibly not altered. They are great for lookups.
But, in the end, you still need the file to validate the correct item is found. Hashmaps store both the key and hash for this very reason. The hash is a quick lookup, but the key is needed to verify the right element has been found.
Unless the hash is the same size as the key.....
I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
Not only did they search the drive without a warrant, but they also got the defendant to confess to putting the files there by questioning him without reading his rights and telling him that he didn't need an attorney. Genius.
Even dumber: Based on the testimony of the guy who originally found the child porn, they could have gone to a magistrate and gotten a warrant. Then there would have been no issue of a warrantless search.
BTW, for those considering the abandoned-property angle -- the court goes into that. It wasn't a legal eviction and the defendant hadn't abandoned his stuff; he merely hadn't removed it all yet.
Yes, that's the birthday paradox. I'm not sure offhand how big the NCMEC database is, which is usually what they're comparing against, but let's try some math.
Let's say your hard drive has N files and the database has M items (so, comparing a list of N to another list of M hashes). Your hard drive doesn't actually contain any of the files used to generate the "bad" hash list. The probability of a hash collision is approximately P = 1 - exp( -N*M / (2 * 2^128) ). Assuming the value in the exponent is small, this is approximately P = N*M/2^129. 2^129 is in the rough vicinity of 10^43. In order for you to have a one in a billion (10^9) chance of a false positive, the product N*M would have to be ~10^34. If the hash list has a billion items (I think it's smaller than that, by quite a lot), you'd need 10^25 files on your disk -- well beyond the capacity of readily-available desktop storage.
MD5 hashes are useful because they're resilient to even birthday collisions. What they're not resilient to, it turns out, is intentionally creating two files with the same MD5 hash. (Even then, it is infeasible to generate two files with the same MD5 hash and the same size.)