MD5 To Be Considered Harmful Someday

← Back to Stories (view on slashdot.org)

MD5 To Be Considered Harmful Someday

Posted by michael on Tuesday December 7, 2004 @09:06AM from the house-built-on-sand dept.

Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."

16 of 401 comments (clear)

Min score:

Reason:

Sort:

damn by YetAnotherDave · 2004-12-07 09:08 · Score: 1, Interesting

I guess I know what I'll be coding now - SHA-1 just got a lot more important in my priority list... :(
1. Re:damn by WolfWithoutAClause · 2004-12-07 09:22 · Score: 5, Interesting
  
  There will ALWAYS be collisions with any kind of hashing algorythm.
  Yes, but a good hash makes it *extremely* difficult to find them. MD5 is looking pretty mediocre right now.
  
  --
  -WolfWithoutAClause
  "Gravity is only a theory, not a fact!"
2. Re:damn by uncqual · 2004-12-07 11:34 · Score: 2, Interesting
  
  Agreed that folding A with B after A has been compromised is a nonstarter.
  However, isn't there merit to the general idea of combining two 'dissimilar' (i.e., the theoretical basis for the 'security' of each share as few attributes as feasible) hash functions X and Y where both X and Y are currently though to be 'secure'?
  In those very common cases where the hash is protecting something with decreasing time value (such as software binaries which become obsolete in a few years) or where the original source is secure and it is only alterations of copies that are a concern (perhaps historical legal documents), this buys time if X or Y (but not both) is compromised (or seem more likely to be compromised - as MD5 is now)?
  For example, if X=MD5, the concerns about MD5 would result in an effort to discontinue its use and replace it with a new hash function Z and start using Y+Z. While the community switches to a new combination, everything is still pretty safe (as safe as Y in this case) during this transition period.
  It seems that in the private sector (spooks aside), it's quite unlikely that X and Y would both be effectively compromised simultaneously and with little warning unless the notion of dissimilarity is also compromised. Fortunately, the skills needed to compromise such functions seem to be concentrated in the guys with the white hats so they tend to publish their preliminary results.
  Disclaimer: I'm not a cryptomaniac so my observations are worth substantially less than you paid for them.
  
  --
  Why is there an "insightful" mod and why isn't it "-1"? If I wanted insight, I wouldn't be reading /.
3. Re:damn by grumbel · 2004-12-07 15:01 · Score: 2, Interesting
  
  ### If it's a hash, the message CANNOT be reconstructable from it
  
  Depends on the hash function, if the input length is equal or smaller then output length reconstructing the message might be possible, ie:
  
  $ echo "hello" | md5sum
  b1946ac92492d2347c6235b4d2611184 -
  
  Should be easily reversible with a dictonary attack. However in the more common case a hash maps a large input domain, to a much smaller output domain, so yep, hashes are not reversible unless input is somehow smaller then the hash itself.
4. Re:damn by Anonymous Coward · 2004-12-07 20:08 · Score: 1, Interesting
  
  $ dd if=/dev/urandom bs=1M count=300 > yoink
  300+0 records in
  300+0 records out
  314572800 bytes transferred in 63.270583 seconds (4971865 bytes/sec)
  $ time md5sum yoink
  a4a846995ccf65be3bfcd9e69325e69a yoink
  
  real 0m2.482s
  user 0m2.342s
  sys 0m0.131s
  
  This is about 121MB/s.
  Northwood P4 at 2.8 800FSB, DDR400
  linux 2.6.8
  
  Same test, SHA1: 77.3MB/s
  cksum: 246MB/s
Exploit? by Limburgher · 2004-12-07 09:11 · Score: 4, Interesting

So does this mean that it's possible to find a useful MD5-equivalent file for any file? Just because someone alters a file does not mean they have done anything destructive. Would one be able to take a binary, make a change of some sort, and then run a tool to determine the block of data to add to the binary to both allow the change to take effect and cancel out the MD5 change? How complex would it be to construct this tool?

--
You are not the customer.
Not just MD5 by PureFiction · 2004-12-07 09:20 · Score: 2, Interesting

Other weaknesses were reported in various other secure digests, including MD4, RIPEMD, HAVAL-128, SHA-0.

SHA-256 is a good replacement. SHA-1 should be fine but if you are going through the trouble of an upgrade, why not make it sufficiently future proof?
break it down by Anonymous Coward · 2004-12-07 09:27 · Score: 0, Interesting

Slicing and dicing the file into many different overlaping blocks and then computing a md5 sum for each block would be much much stronger.

a. MD5 sum entire file
b. MD5 sum each 10K block in the file (overlap blocks by ZZZ bytes)
c. break file into different blocks such as every 40th byte in the file ....
It actually easy to see this by Anonymous Coward · 2004-12-07 09:32 · Score: 2, Interesting

I was a bit suprised recently when I tried to
sort 15,000 jpg image files to remove duplicates.
Since the names varied, and it is not uncommon
to have many images with the same length, it
seemed like a good idea to use md5 hashes.
So I coded it up in python to do the md5 hash,
and then stuffed the file name into a table
keyed by the md5 hash. Big suprise when multiple
different files landed in the same hash.
Some property of jpgs probably reduces the
randomness of the files and increases the
probability of hash collisions.
Is a two-pass just as vulnerable? by davidwr · 2004-12-07 09:48 · Score: 3, Interesting

I haven't had time to think this out, so I'm throwing it out for you guys:

Is doing a 2nd pass helpful?

In other words, if
ABCfireDEF
and
ABCiceeDEF
hash the same, does
ABCfireDEFABCfireDEF
necessarily, or even frequently, hash to
ABCiceeABCiceeDEF?

Even if it were to provide protection, in practical terms,
1) other hashing althorithms are likely faster than "MD5 times two"
2) there may be some - hopefully a lot fewer - places in a long file where where
ABCfireDEFABCfireDEF
can be replaced with
ABCiceeDEFABCmeltDEF

I'm beginning to like the "pick two algorithms and call me in the morning" approach.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Re:Correct me if I'm wrong, but... by HeghmoH · 2004-12-07 09:54 · Score: 2, Interesting

Let's say I have a system that downloads files over the internet. To prevent man-in-the-middle attacks, I digitally sign the files. This prevents me from having to vet all of the code that deals with local files for buffer overflows. I implement the digital signatures by simply encrypting a hash of the file with an RSA private key when I create the file, and decrypting and verifying the hash on the receiving side.

If I had decided to use MD5 for the hash in the digital signature, my scheme is now vulnerable. It's not too far-fetched to imagine that somebody could come up with a way to embed an exploit in one of the files while staying within the limitations imposed by this collision technique. Then if he can accomplish a man-in-the-middle attack, he's successfully used my automatic downloader to compromise somebody's machine. Not fun.

This may not be completely feasible currently, but the technique shows that it may be possible in the future. If you're currently designing a system that you plan to have function for several years, you should not assume that MD5 is cryptographically secure.

--
Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
Re:Let's face it by jd · 2004-12-07 10:04 · Score: 2, Interesting

Let's say you break a file into blocks, encrypt those blocks with Rijndael or Serpent using a chaining method that authenticates the prior block, digitally sign the result using (seperately generated) RSA, DSA and ECC signatures in turn, and generate SHA-1 and Whirlpool checksums of both the encrypted and unencrypted file.

True, you'd spend longer validating and decrypting the unholy mess generated than you'd spend downloading it, but I think you'd be fairly safe in assuming that the file was what it claimed to be.

In other words, it isn't impossible to be so close to 100% secure as to make no odds. It is merely a question of whether it's worth it. Is the expense of protecting something greater than the value of what you're protecting? If so, nobody is going to bother, no matter how simple the task of protecting is.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Correct me if I'm wrong, but... by ReelOddeeo · 2004-12-07 10:06 · Score: 2, Interesting

I was thinking the same thing.

See the long list of cryptographic hash functions on Wikipedia. (Bottom of page.)

If you used three or four different hash functions, to produce a long combined message digest (concatenated from the several message digests), it would seem to be much more difficult to produce files that collide. An attack resulting in a collision on one algorithm, is not likely to work for another algorithm. Finding two files that simultaneously fool two or more algorithms seems a much more difficult task than just fooling a single algorithm like MD5.

This is of utmost importance. We can't have the RIAA infiltrating garbage files ("what the fiddlesticks do you think you're doing?") into our p2p networks.

--

Those who would give up liberty in exchange for security and DRM should switch to Microsoft Palladium!
Re:You're wrong. by Anonymous Coward · 2004-12-07 10:41 · Score: 1, Interesting

You are correct that digital signatures w/ certificates is where the real risk lies for a flaw in MD5. Specifically, certificate chains are at risk, since you cannot truly validate a chain. This will have no affect on:

"In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks)."

The server has a store of acceptable public keys, third paty observable random data from both the client and server are combined and hashed using SHA-1, that data is then signed by the client's private key. Being able to come up with substitute md5 or sha-1 data will not affect the outcome of the signed data algorithm.

For server authentication a certificate chain could be at risk. Although most SSH client implementations simply cache the public key, so they are vulnerable to first time active attacks.

An algorithm like the following would be helpful for certificate chains. See original zero score post here (http://developers.slashdot.org/comments.pl?sid=13 1994&cid=11024443)

H = H1(m + 0) + H2(m + 0)

for counter = 1 to 999
H = H1(m + counter + H) + H2(m + counter + H)

return H;

H1 and H2 are crypto hash algorithms. This approach is very strong, since both hashing algorithms are interleaved together. If one hashing algorithm is found to be weak, and other is strong, attacks will fail. Additionally, even if both algorithms are found to be weak interlacing and the number of rounds make it much more computationally difficult to attack. In comparison to DSA or RSA signature verification algorithms this additional work for generating the hash is very CPU light weight.

Just finishing up the patent application. Done.

Hey Moderator, this has to be worth more than a zero. Sheesh.
Almost forgot by jd · 2004-12-07 10:42 · Score: 3, Interesting

Whirlpool is a 256-bit hashing algorithm, derived from the Rijndael encryption algorithm. Rijndael is known to be strong, and has been approved by NIST, but the conversion to a hash function has not been sufficiently tested.

Where time isn't critical (eg: creating and validating checksums for files), I'd say use both. The overhead isn't great, and you'd get much more security.

Where time is critical AND you don't have to be concerned with computers not under your control, use Whirlpool. Rijndael is fast, SHA-1 is slow. Whirlpool also offers a longer hash string than SHA-1.

In any other situation, use SHA-1. Whirlpool might turn out to be the greatest algorithm out there, but that doesn't help if you're trying to talk to a remote computer that doesn't support it.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Stop-gap solution: hash offset files by AsciiNaut · 2004-12-08 01:55 · Score: 2, Interesting

Here's a potential stop-gap solution: provide two md5sums per file, one of the whole file as normal, and one of the file offset by one byte. Let's look at the two hash-equivalent files cited by parent:
$ cmp file1.dat file2.dat file1.dat file2.dat differ: char 20, line 1 $ md5sum file1.dat file2.dat a4c0d35c95a63a805915367dcfe6b751 file1.dat a4c0d35c95a63a805915367dcfe6b751 file2.dat
Whoops! Now examine hashes of the same files, omitting the first (identical) bytes:
$ xxd -s 1 -ps file1.dat | xxd -r -p | md5sum 84e6e0a21e2c4c9ef53f3762fdc90bc8 - $ xxd -s 1 -ps file2.dat | xxd -r -p | md5sum a63151008e5f8fc116ba947fd8af8c5a -
Clearly this would not work with all collisions (nothing useful would), but it might hugely limit their frequency. And it would be relatively easy to tag on tables of 1-byte offset md5sums to existing md5sum tables out there.