MD5 To Be Considered Harmful Someday

← Back to Stories (view on slashdot.org)

MD5 To Be Considered Harmful Someday

Posted by michael on Tuesday December 7, 2004 @09:06AM from the house-built-on-sand dept.

Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."

27 of 401 comments (clear)

Min score:

Reason:

Sort:

Re:damn by LiquidCoooled · 2004-12-07 09:14 · Score: 2, Informative

There will ALWAYS be collisions with any kind of hashing algorythm.

Thats the nature of the game we play.

The only hashing system without collisions is sending the original file itself.

--
liqbase :: faster than paper
md5 vs sha1 vs ? by alexandre · 2004-12-07 09:17 · Score: 4, Informative

http://en.wikipedia.org/wiki/Md5

here is a very good link about the algo... :)
The "Detailed Summary" by Effugas · 2004-12-07 09:22 · Score: 5, Informative

[This is the author]

I've been doing some analysis on MD5 collision announced by Wang et al. Short version: Yes, Virginia, there is no such thing as a safe hash collision -- at least in a function that's specified to be cryptographically secure. The full details may be acquired at the following link:

http://www.doxpara.com/md5_someday.pdf

A tool, Stripwire, has been assembled to demonstrate some of the attacks described in the paper. It may be acquired at the following address:

http://www.doxpara.com/stripwire-1.1.tar.gz

Incidentally, the expectations management is by no means accidental -- the paper's titled "MD5 To Be Considered Harmful Someday" for a reason. Some people have said there's no applied implications to Joux and Wang's research. They're wrong; arbitrary payloads can be successfully integrated into a hash collision. But the attacks are not wildly practical, and in most cases exposure remains thankfully limited, for now. But the risks are real enough that responsible engineers should take note: This is not merely an academic threat, systems designed with MD5 now need to take far more care than they would if they were employing an unbroken hashing algorithm, and the problems are only going to get worse.

Some highlights from the paper:

* The attack itself is pretty limited -- essentially, we can create "doppelganger" blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum.

* MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. This leads to...

* Attacks are possible using only the proof of concept test vectors released by Wang -- the actual attack is not necessary.

* Stripwire emits two binary packages. They both contain an arbitrary payload, but the payload is encrypted with AES. Only one of the packages ("Fire") is decryptable and thus dangerous; the other ("Ice") shields its data behind AES. Both files share the same MD5 hash.

* Digital Signature systems are vulnerable, as they almost always sign a hashed representation of data rather than the data itself.

* This is an excellent vector for malicious developers to get unsafe code past a group of auditors, perhaps to acquire a required third party signature. Alternatively, build tools themselves could be compromised to embed safe versions of dangerous payloads in each build. At some later point, the embedded payload could be safely "activated", without the MD5 changing. This has implications for Tripwire, DRM, and several package management architectures.

* HMAC's invulnerability has been slightly overstated. It's definitely possible, given the key, to create two datasets with the same HMAC. Attacker possession of the key violates MAC presumptions, so the impact of this is particularly questionable.

* Very interesting possibilities open up once the full attack is made available -- among other things, we can create self-decrypting executables (fire.exe and ice.exe) that exhibit differential behavior based on their internal colliding payloads. They'll still have the same MD5 hash.

* Several doppelgangers may (relatively quickly, as per Joux) be computed within a single multicollision-friendly block. As such, the particular selection of doppelganger sets within a file can itself be made to represent data. It's relatively straightforward to embed a 128 bit signature inside an arbitrary file, in such a way that no matter the value of the signature, a constant MD5 hash is maintained. This is curiously steganographic.

* Many popular P2P networks (and innumerable distributed content databases) use MD5 hashes as both a reliable search handle and a mechanism to ensure file integrity. This makes them blind to any sign
1. Re:The "Detailed Summary" by Effugas · 2004-12-07 10:51 · Score: 2, Informative
  
  SHA-1 also uses a Merkle-Damgard construction, so yes, if we find a SHA-1 collision, it'll be appendable.
  
  SHA-1 has a much stronger design. It's starting to show cracks, though, so I don't recommend anything. Something based on AES will come, though -- maybe AES-OMAC, maybe Whirlpool. At the core of almost every hashing algorithm is just a block cipher anyway...
  
  --Dan
Re:damn by networkBoy · 2004-12-07 09:24 · Score: 1, Informative

"The only hashing system without collisions is sending the original file itself."

That's not hashing:
Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
Solution: Use more than one hash algorithm by NZheretic · 2004-12-07 09:27 · Score: 2, Informative

This kind of attack can be mitigated into non-existance by just using two dissimilar hash algorithms.
Using MD5 with SHA1, or even the older MD2 or MD4 will reduce the probability of creating a compatable binary with the same checksum to virtually zero.
If only one checksum is required then just XOR the resulting checksums from each algorithm.
Re:damn by networkBoy · 2004-12-07 09:31 · Score: 2, Informative

"You shouldn't be able to because it simply isn't true."

I have to disagree with you here.
If I have algo A and algo B:
I hash with algo A and get a value which I store.
I hash with algo B and get a value which I store.
While the security does not add up to A^B it does ammount to > A+B, which is still better than A or B only. (I really wish I had my Crypto reference books handy)

Other posters mentioned it was more work and equiv to one secure algo, both those statements are true; as I pointed out this was an alternative to writing a new SHA-1 algo.
-nB

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
Re:In english by Anonymous+Brave+Guy · 2004-12-07 09:32 · Score: 4, Informative

Short version: A common technology for verifying that a file you've downloaded is legitimate and untampered-with, known as MD5, isn't as secure as people thought.

Slightly longer version: MD5 is a way of generating a checksum -- a single, comparable value -- from a file. Ideally it is supposed to give you different numbers for different files, so if a web site advertises the checksum a file should have, you can compare that with one generated from the file you actually got to see whether the file you've downloaded has been modified, potentially maliciously.

The research shows that it is possible for someone to construct a drop-in replacement for the file you thought you had that generates the same MD5 checksum as the original, so anyone attempting to validate the file this way would think they had the real thing. If it turns out that you can construct a damaging replacement for a common file -- perhaps an installer for a popular application like Firefox or OpenOffice that's usually downloaded from a public server -- then this could open a loophole for viruses, worms, etc. that would slip through the security net often used by cautious people when downloading such programs.

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Re:Already Happening by menscher · 2004-12-07 09:33 · Score: 3, Informative

Bullshit.
I'll give you $50 if you can back that claim up. I want to see two video files. They must start out the same, but have a difference about half-way through. And they have to have the same md5 sum. Just post where I can download the two files, and your paypal address.
The way I see it, you've got a 1/2^64 chance of being right. So I'm risking $50/184467440737095516, which isn't a whole lot.
Re:Cash Money? by Meostro · 2004-12-07 09:36 · Score: 2, Informative

No, but I believe this one is:

XiaoyunWang, Dengguo Feng, Xuejia Lai, and Hongbo Yu
"Collisions for hash functions md4, md5,haval-128 and ripemd"
OT yet informative: Spock and Try by Anonymous Coward · 2004-12-07 09:37 · Score: 1, Informative

"Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822.3."

I think you mean Yoda.
Besides, Dr. Spock is the baby doctor.
Re:Exploit? by anthony_dipierro · 2004-12-07 09:53 · Score: 2, Informative

As the length of the file is sent in addition to the MD5, in the vast majority of cases it's going to be impossible to find a file which gives you the same length and MD5. I guess as the size of media files increases this gets more and more likely, but if it ever starts affecting more than 0.0000000001% of files you can just increase the length of the hash.
Re:MD5 is obviously less secure by pclminion · 2004-12-07 09:56 · Score: 2, Informative

While this is technically true (there are an infinite number of passwords that hash to the same value), it's not of much practical use. There are only 256^8 possible 8-character passwords (less, if you don't allow un-typeable characters). There are 2^128 possible MD5 hashes. 256^8 is much, much, MUCH less than 2^128, thus the probability of discovering another valid passwords of 8 characters or less is negligible.
Sure, there are an infinite number of possible passwords, but they're all impossibly huge. I can come up with a password which is one trillion characters long and which hashes to the right result, but that's not practical.
Re:Exploit? by owlstead · 2004-12-07 10:00 · Score: 3, Informative

Yep, you are still wrong. I've tried some code and the (precalculated) different blocks MUST be at the start of the code. So it's more like

file1: xxxxccccccc....
file2: yyyyccccccc....

%file1 = %file2

Which is the example given in the article.

However, Wang said she could get to a collision from any intermediate hash code within the hour (according to the article). That would mean:

file1: ccccxxxxccccccccc......
file2: ccccyyyyccccccccc......

%file1 = %file2

Where xxxx and yyyy are (pre?)calculated and cccc.... is the payload.

If _I_ am not mistaken.
Re:Not just MD5 by chialea · 2004-12-07 10:00 · Score: 2, Informative

SHA-1 has an attack that's somewhat troubling. I'd look to next year's crypto and eurocrypt conferences to see starts shaking out as the new standard.

Still... I would switch out MD5 if you have a target that's worth pretty much anything at all. After a break like this, I'd expect MD5 to become basically useless pretty fast. Of course, I don't work in hash collisions, I work mostly in protocols...

Lea
You're wrong. by Piquan · 2004-12-07 10:04 · Score: 4, Informative
He can create a file that MD5sum's to the same result as a legitimate file, but does not have full control over the content or size of the result (making this a mostly useless avenue of exploitation except for people who want to spread trash on P2P networks -- I.E. it shouldn't particularly bother anyone except people who already don't care about security).
Suppose you're storing passwords as encrypted hashes, so that intercepting the hashes doesn't tell you what the password is. But if you can generate a password to match that MD5...
You know that GPG keys are identified and signed by their MD5 hashes? Suppose that I can generate a GPG key that would be identified as yours, and distributed it.
Or he can create two files that MD5sum to the same result. But he has to have control over both files, which offers effectively no advantage to someone who is trying to spread malware or tamper with existing archives that have been MD5summed.
There's a coin-flipping protocol that goes as follows. Suppose that Alice and Bob want to flip a coin (over the Internet), but they don't trust each other.
1. Alice generates a file with random data.
2. Alice sends Bob the MD5 hash of the file.
3. Bob picks a bit in the file, and whether he thinks it's a 0 or a 1.
4. Bob wins if and only if he picked right.
5. To verify, Alice sends Bob the file she generated at the beginning.
Now, suppose that Alice generated multiple files in step 1. When Bob makes his guess, she tries to pick a file that will make her win. If she generated only two files, completely randomly, this would let Alice win 75% of the time.
These are just the first ideas I thought of. If I were looking for other problems, I'd think about undeniable signatures, keysigning (which as GPG and X.509 SSL are heavily based on) and other specialized signature systems. In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks).
Digital signatures are used for much more than just testing for file tampering.
1. Re:You're wrong. by Gemini · 2004-12-07 16:17 · Score: 2, Informative
  
  You know that GPG keys are identified and signed by their MD5 hashes?
  
  They're not. The old PGP 2.6 keys are, but GnuPG generates OpenPGP keys that use SHA-1.
  
  GnuPG will use an already generated PGP 2.6 key, but will not make more of them.
Re:Correct me if I'm wrong, but... by kiltedtaco · 2004-12-07 10:07 · Score: 3, Informative

The Wang et al attack does not apply to passwords. Their attack applied to situations where the md5 input plaintext was known. Collisions are nowhere near common enough when using less than 16 character inputs to md5 to provide a feasible means of cracking passwords. Nobody has ever found a collision with under 128 bits of input, and the attacks in the article take considerably more than that.
Re:Correct me if I'm wrong, but... by jd · 2004-12-07 10:20 · Score: 2, Informative

Very true, or it could be used to poison the BGP routing tables. The problem with such an attack is that, once one router believes in the poisoned routes, the poison will spread very rapidly.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Very true by jd · 2004-12-07 10:34 · Score: 2, Informative

NESSIE is an attempt to do for the European crypto standards what NIST does for the American crypto standards. Whirlpool was approved by NESSIE and is part of the ISO/IEC 10118-3:2003(E) standard. As such, it should be OK for use in Europe for secure systems, including military and Government applications.

The ISO approval also carries some weight in industry, although after some rather disasterous specifications (such as ISO 9000), they have lost some of their image. However, there are plenty of organizations that would consider an ISO standard an absolute must.

I don't know of anyone using Whirlpool for highly secure systems. It certainly wouldn't be ok in the US, as it's not a FIPS standard. France or Germany would be better bets.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:What does this mean by Anonymous Coward · 2004-12-07 10:38 · Score: 1, Informative

Yes, that's eaxctly what that means.

Just ask yourself : how many values can a MD5-hash (checksum) have, and how many (different) files are there on this world of ours.

If the number of files exeedes (is more than) the number of checksum-values, logic dictates that there must be several files that generate the same MD5-hash (checksum) ...

But Having multiple files with the same MD5-hash is not the problem. The problem is that someone could choose which MD5-hash his (program-)file should generate. And that would mean he could replace any file he wants with another one ...

And that's just what the MD5-hash should make impossible :-(
Re:Correct me if I'm wrong, but... by thing12 · 2004-12-07 10:39 · Score: 2, Informative

Not really... the article is talking about relatively huge files which when you swap out blocks in order to result in the same MD5. Passwords on the other hand are much shorter than the MD5 in the first place and no two strings shorter than the MD5 length (128 bits) will give you the same MD5. The MD5 space is many orders of magnitude larger than the typical password space - even a 16 character password that uses mixed case, digits and punctuation (basically all of base64), that contains about 96 bits of information -- that's 4 billion times smaller than the space of MD5.
What I'm getting at is that you'll probably always have to either know the password + the salt (if there is one), use brute force, or use a database of MD5's for all possible passwords in order to decrypt an MD5'd password. But since there are already MD5 databases, we're kind of past this part anyway.
As for access tokens, MD5's are chosen simply because they're large and seemingly random which makes them "unguessable". Since they're just temporary anyway, guessability is all you're trying to prevent - you'll get a new one next time and the attacker will have to start over.
Re:Two files with the same md5 hash? by Phleg · 2004-12-07 10:52 · Score: 4, Informative

See you in the next life.

--
No comment.
Re: Exploit? by Omniscient+Ferret · 2004-12-07 13:40 · Score: 3, Informative

Somebody could use these fake files to "poison" popular torrents, making it very unlikely that anybody on them will get uncorrupted files.

This would worry me, except that BT uses SHA1, not MD5, so this is irrelevant. MD5 has seemed suspect for years, & Bram's the sort to pay attention to that sort of thing.

I checked; Edonkey is based on MD4. Gnutella variants might use MD5.
Re:Exploit? by wdr1 · 2004-12-07 16:11 · Score: 2, Informative

I thought the same at first, but they have a pretty clear example showing that's not the case. From here:
file1.dat: 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 00000010 2f ca b5 87 12 46 7e ab 40 04 58 3e b8 fb 7f 89 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 71 41 5a 00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd f2 80 37 3c 5b 00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5 00000050 35 73 9a c7 f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f 00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79 cc 15 5c 00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a d8 35 cc a7 e3 MD5(file1.dat) = a4c0d35c95a63a805915367dcfe6b751

file2.dat: 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 00000010 2f ca b5 07 12 46 7e ab 40 04 58 3e b8 fb 7f 89 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 f1 41 5a 00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd 72 80 37 3c 5b 00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5 00000050 35 73 9a 47 f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f 00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79 4c 15 5c 00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a 58 35 cc a7 e3 MD5(file2.dat) = a4c0d35c95a63a805915367dcfe6b751

--
SlashSig Karma: Excellent (mostly affected by moderatio
Re:Exploit? by CryoPenguin · 2004-12-07 18:18 · Score: 2, Informative

That's only a problem because Kazaa uses one checksum for the whole file. BitTorrent checksums each chunk (typically 256KB - 1MB), so it can detect corrupt data, throw out only little collateral real data, and blacklist whoever sent you the corrupt bits. Then it redownloads the chunk from someone else, and the user never needs to know.
Re:I have a novel solution by Dan+Farina · 2004-12-07 19:06 · Score: 2, Informative

I'm not sure if you are being sarcastic or something, but as said before, Bittorrent uses SHA-1, not MD5....

So you are safe downloading linux for now via bittorrent. Besides, the chances of MD5 collisions happening from sheer luck/unluck are very slim. (after all, we've been using it for ages with no reports)

The most dangerous factor to continued use of MD5 are malicious individuals.