MD5 To Be Considered Harmful Someday
Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."
I guess I know what I'll be coding now - SHA-1 just got a lot more important in my priority list... :(
So does this mean that it's possible to find a useful MD5-equivalent file for any file? Just because someone alters a file does not mean they have done anything destructive. Would one be able to take a binary, make a change of some sort, and then run a tool to determine the block of data to add to the binary to both allow the change to take effect and cancel out the MD5 change? How complex would it be to construct this tool?
You are not the customer.
Other weaknesses were reported in various other secure digests, including MD4, RIPEMD, HAVAL-128, SHA-0.
SHA-256 is a good replacement. SHA-1 should be fine but if you are going through the trouble of an upgrade, why not make it sufficiently future proof?
Slicing and dicing the file into many different overlaping blocks and then computing a md5 sum for each block would be much much stronger.
....
a. MD5 sum entire file
b. MD5 sum each 10K block in the file (overlap blocks by ZZZ bytes)
c. break file into different blocks such as every 40th byte in the file
I was a bit suprised recently when I tried to
sort 15,000 jpg image files to remove duplicates.
Since the names varied, and it is not uncommon
to have many images with the same length, it
seemed like a good idea to use md5 hashes.
So I coded it up in python to do the md5 hash,
and then stuffed the file name into a table
keyed by the md5 hash. Big suprise when multiple
different files landed in the same hash.
Some property of jpgs probably reduces the
randomness of the files and increases the
probability of hash collisions.
I haven't had time to think this out, so I'm throwing it out for you guys:
Is doing a 2nd pass helpful?
In other words, if
ABCfireDEF
and
ABCiceeDEF
hash the same, does
ABCfireDEFABCfireDEF
necessarily, or even frequently, hash to
ABCiceeABCiceeDEF?
Even if it were to provide protection, in practical terms,
1) other hashing althorithms are likely faster than "MD5 times two"
2) there may be some - hopefully a lot fewer - places in a long file where where
ABCfireDEFABCfireDEF
can be replaced with
ABCiceeDEFABCmeltDEF
I'm beginning to like the "pick two algorithms and call me in the morning" approach.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Let's say I have a system that downloads files over the internet. To prevent man-in-the-middle attacks, I digitally sign the files. This prevents me from having to vet all of the code that deals with local files for buffer overflows. I implement the digital signatures by simply encrypting a hash of the file with an RSA private key when I create the file, and decrypting and verifying the hash on the receiving side.
If I had decided to use MD5 for the hash in the digital signature, my scheme is now vulnerable. It's not too far-fetched to imagine that somebody could come up with a way to embed an exploit in one of the files while staying within the limitations imposed by this collision technique. Then if he can accomplish a man-in-the-middle attack, he's successfully used my automatic downloader to compromise somebody's machine. Not fun.
This may not be completely feasible currently, but the technique shows that it may be possible in the future. If you're currently designing a system that you plan to have function for several years, you should not assume that MD5 is cryptographically secure.
Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
True, you'd spend longer validating and decrypting the unholy mess generated than you'd spend downloading it, but I think you'd be fairly safe in assuming that the file was what it claimed to be.
In other words, it isn't impossible to be so close to 100% secure as to make no odds. It is merely a question of whether it's worth it. Is the expense of protecting something greater than the value of what you're protecting? If so, nobody is going to bother, no matter how simple the task of protecting is.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I was thinking the same thing.
See the long list of cryptographic hash functions on Wikipedia. (Bottom of page.)
If you used three or four different hash functions, to produce a long combined message digest (concatenated from the several message digests), it would seem to be much more difficult to produce files that collide. An attack resulting in a collision on one algorithm, is not likely to work for another algorithm. Finding two files that simultaneously fool two or more algorithms seems a much more difficult task than just fooling a single algorithm like MD5.
This is of utmost importance. We can't have the RIAA infiltrating garbage files ("what the fiddlesticks do you think you're doing?") into our p2p networks.
Those who would give up liberty in exchange for security and DRM should switch to Microsoft Palladium!
You are correct that digital signatures w/ certificates is where the real risk lies for a flaw in MD5. Specifically, certificate chains are at risk, since you cannot truly validate a chain. This will have no affect on:
3 1994&cid=11024443)
"In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks)."
The server has a store of acceptable public keys, third paty observable random data from both the client and server are combined and hashed using SHA-1, that data is then signed by the client's private key. Being able to come up with substitute md5 or sha-1 data will not affect the outcome of the signed data algorithm.
For server authentication a certificate chain could be at risk. Although most SSH client implementations simply cache the public key, so they are vulnerable to first time active attacks.
An algorithm like the following would be helpful for certificate chains. See original zero score post here (http://developers.slashdot.org/comments.pl?sid=1
H = H1(m + 0) + H2(m + 0)
for counter = 1 to 999
H = H1(m + counter + H) + H2(m + counter + H)
return H;
H1 and H2 are crypto hash algorithms. This approach is very strong, since both hashing algorithms are interleaved together. If one hashing algorithm is found to be weak, and other is strong, attacks will fail. Additionally, even if both algorithms are found to be weak interlacing and the number of rounds make it much more computationally difficult to attack. In comparison to DSA or RSA signature verification algorithms this additional work for generating the hash is very CPU light weight.
Just finishing up the patent application. Done.
Hey Moderator, this has to be worth more than a zero. Sheesh.
Where time isn't critical (eg: creating and validating checksums for files), I'd say use both. The overhead isn't great, and you'd get much more security.
Where time is critical AND you don't have to be concerned with computers not under your control, use Whirlpool. Rijndael is fast, SHA-1 is slow. Whirlpool also offers a longer hash string than SHA-1.
In any other situation, use SHA-1. Whirlpool might turn out to be the greatest algorithm out there, but that doesn't help if you're trying to talk to a remote computer that doesn't support it.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)