Meaningful MD5 Collisions

← Back to Stories (view on slashdot.org)

Posted by Zonk on Friday June 10, 2005 @07:38AM from the bam-crash-boom dept.

mrogers writes "Researchers at Ruhr-Universität Bochum have found a way to produce MD5 collisions between human-meaningful documents. This could be used to obtain a digital signature on one document and then transfer it to another. The same technique is theoretically applicable to other hash functions based on the Merkle-Damgård structure, such as SHA-1." From the article: "Recently, the world of cryptographic hash functions has turned into a mess. A lot of researchers announced algorithms ("attacks") to find collisions for common hash functions such as MD5 and SHA-1 (see [B+, WFLY, WY, WYY-a, WYY-b]). For cryptographers, these results are exciting - but many so-called 'practitioners' turned them down as 'practically irrelevant'."

22 of 312 comments (clear)

Min score:

Reason:

Sort:

Common sense by gtrubetskoy · 2005-06-10 07:40 · Score: 2, Insightful

Unless I'm missing something, all these guys are doing is using a format that can contain an infinite amount of extraneous information that has no effect on how it's ultimately rendered, so same thing can be done with a .doc, or an HTML file, and this isn't really a cryptographic discovery of hash function weakness of any kind, just common sense for most of us - the secure hash algorithms should be applied to the English (or whatever language) textual contents of the document, no the source code of it, such as PostScript used in the article, PDF, HTML or whatever. I guess the most important lesson here is that this technique can be applied to binaries pretty easily as well.
1. Re:Common sense by loose_cannon_gamer · 2005-06-10 08:09 · Score: 2, Insightful
  
  Your common sense seems a little ridiculous. Are we saying that all documents have to be reduced to text before applying our encryption? What about nontextual documents, like, say, process flowcharts, spreadsheets, powerpoint slides, multimedia?
  
  There are a lot of formats out there that allow additions of random undisplayed information, and so I presume that many of these formats are vulnerable to these attacks.
  
  I wonder how long it will take before there are exploits out there that take advantage of these techniques... Of course, I also wonder how many there already are...
  
  Oh well. The key concept behind security is, has been, and always will be trust. You should always ask yourself when you receive something from someone else how much you trust the source, and act accordingly.
  
  --
  In Soviet Russia, us are belong to all your base.
2. Re:Common sense by Anonymous Coward · 2005-06-10 08:19 · Score: 1, Insightful
  
  Your common sense seems a little ridiculous. Are we saying that all documents have to be reduced to text before applying our encryption? What about nontextual documents, like, say, process flowcharts, spreadsheets, powerpoint slides, multimedia?
  What's a little ridiculous is your lack of security knowledge. Yes, signatures MUST be applied to meaningful parts of the document files, and exactly how to do it right varies from format to format - simply running md5sum on a file is rediculously risky!
These are important attacks.. by Ckwop · 2005-06-10 07:42 · Score: 5, Insightful

As an amateur cryptographer, I must say that labeling these attacks as 'practically irrelevant'
is at the very least misguided and at worst a shocking display of incompetence.

Stop the fixation with plain-text messages, most messages are not plain-text. Your average word document
contains loads of invisible data that doesn't get rendered. Pdf's contain "junk" data that doesn't get rendered either. Would
you notice a single bit difference in an MP3? Or a single pixel colour change in a jpeg? Hell, you can even do it in HTML <div style="visibility:hidden">Junk goes here</div>.
Mark my words, people will find in the next couple of months find two meaningful computer
documents that hash to the same value but are different byte-wise.

People undervalue these attacks because the attacker has to generate the collision before hand to use it.
To properly appreciate the power of these attacks consider the following senario.

Imagine we're agreeing a contract of employement and I'm your employer.
I give you the first word document that includes all the standard terms, however, I've also drafted
a Word document that contains a load of draconian clauses like banning you from working in any IT position five years
after leaving the company. By adding junk that doesn't render to both documents, I've managed to find to make the hash
of the two documents collide. Thinking I'm a nice employer, you sign the first document, which you do by signing the hash of
document. However, I now have your signature on BOTH documents. I now make sure the company IT system "forget" the first document
and I've successfully screwed you.

This is a human example, but there are other examples that apply in computer systems. The problem is that in many situations
the attacker can choose when you encrypt. Say you encrypt your e-mail conversation with your friend using S/MIME, many people click
"Reply" and the message body of the other persons method appears in the new message. Because of these attacks,
It's now no certainty that an attacker couldn't use this fact to construct collisions that an attacker could use.

As another security researcher said (paraphrased) It's like you're in building and you've just heard the fire alarm go off.
You can't see smoke but it's time to make your way calmly to the exit. That sums up the position with SHA-1 and MD5. Swap out the primitives
before you start seeing smoke.

It's not like we don't have alternatives anyway. Whirlpool uses the same wide-trail design principles has AES. It's slower than MD-5 or SHA-1 but it's much better designed. And beside, people would do well to realise you have to spend CPU cycles to get security.

Simon.
1. Re:These are important attacks.. by Pyrrus · 2005-06-10 08:31 · Score: 2, Insightful
  
  If you are using the md5 sum of a digital document, I imagine you would keep a digital copy of the document as well. You'd then go to court with your signed physical document, md5 signed digital document, and the 2nd digital document with the same md5 sum and junk data that appears to be inserted to cause a collision. The employer would have a bit of explaining to do.
2. Re:These are important attacks.. by Basje · 2005-06-10 08:33 · Score: 4, Insightful
  
  The one that makes the claim. In the above example you go to work for another employer. The malicous employer slaps you with the fraudulent contract _claiming_ you cannot do that.
  
  Then you whip out the original document. Presto: both documents have the same signature. Now it's up to the party making the claim to prove which contract was signed. Impossible.
  
  The reverse is also true: the programmer works on a super secret project. He signs a secrecy declaration. If he can apply the exact same signature to his lunch agreement, and walks out with the sourcecode, his employer is screwed, as per the example above.
  
  Etc. An agreement is legally binding in any form. The only reason to have a signed copy of any agreement (digitally or paper) is to have proof of the agreement. If the signature is no longer proof, no agreement rises above verbal agreements, evidence wise.
  
  --
  the pun is mightier than the sword
3. Re:These are important attacks.. by hritcu · 2005-06-10 09:13 · Score: 2, Insightful
  
  I think that the most important type of attack involves binaries. For example I could surpass tripwire and enable an exploit that was never visible in my binary code. Almost none of you would be able to tell the difference (a couple of bits), not even tripwire (md5), between the exploitable and non-exploitable version of the program. But the change in behavior could be different so that the exploit is possible.
  
  Steps:
  1. Write a very useful tool
  2. Compile it
  3. Tamper the file to get two almost identical versions, one exploitable, the other "secure"
  4. Release the source and the secure binary
  5. Wait for the tool to achieve widespread adoption based on its "security"
  6. Overwrite the secure version with the exploitable one on your server and log the IPs of the people downloading it
  7. Wait a little more
  8. Use the exploit and gain access to the exploitable machines
  9. Once you are in charge overwrite the exploitable version of the tool with the "secure" one - tripwire should not detect this
  10. Party
  
  --
  If you don't fail at least 90 percent of the time, you're not aiming high enough. (Alan Kay)
4. Re:These are important attacks.. by snorklewacker · 2005-06-10 09:40 · Score: 2, Insightful
  
  > Who'd have to explain things then?
  
  The "reasonable person" standard prevails. Remember, they have to explain the existence of that extra padding in the document. The jury, assuming it's composed of reasonable people ...
  
  sorry, had to take a moment. my co-workers are no doubt wondering what the hysterical laughter was about ...
  
  the jury is not likely to believe that you signed away your rights. Now if the employer's document specified a million-dollar signing bonus, then he pulled a switch on you to simply remove it, then the jury might be less inclined to believe you.
  
  Doubtful this sort of thing could happen with employment. A government contract on the other hand ...
  
  --
  I am no longer wasting my time with slashdot
no help by LiquidCoooled · 2005-06-10 07:45 · Score: 2, Insightful

Anything you hash together will ALWAYS result in collisions.

Extracting the formatting and code from the document will just make it EASIER to create a duplication.

"Hello World!" might match with "Hello World!!!!!! this is extra stuff"

at least leaving the exact formatting instructions in gives you a chance that the collision which leaves a compatible file is much more difficult than the hash of the simple raw text.

--
liqbase :: faster than paper
What are the alternatives? by CyricZ · 2005-06-10 07:45 · Score: 2, Insightful

If MD5 is found to be insecure, what are the alternatives we can use when signing our open-source packages? Is there any other alternative that is even approaching the widespread use of md5sum?

--
Cyric Zndovzny at your service.
do you know how big 2^128 is? by frovingslosh · 2005-06-10 07:52 · Score: 4, Insightful

Collisions are NOT and accidential everyday occurance. What is being discussed here is a deliberate md5 match, created by making just the right changes to a document to intentionally get an md5 match.
2^128 is huge. It's larger by far than the number of all the files in all of the computers in the world. It larger than the number of stars in the universe. Chance collisions will not become an everyday occurance. No accidental collision has ever been found yet. Switching to larger keys will not change anything. Sure, they might make it slightly harder to make a deliberate collision (although I don't know for a fact that they make it harder at all, there were some reports of someone in Japan being able to create a collision by hand with only pencil and paper), but just wait 2 months and the computing power will catch up with that. It's not a matter of the size of the hash function.

--
I'm an American. I love this country and the freedoms that we used to have.
1. Re:do you know how big 2^128 is? by QuestorTapes · 2005-06-10 10:23 · Score: 2, Insightful
  
  > Collisions are NOT and accidential everyday occurance.
  
  Probably.
  
  Playing the Devil's advocate here. Collisions would be impacted, possibly -very- strongly, by exactly how the GUIDs are generated. Just because there are 2^128 bits to play with doesn't mean we are using them all. There may be significant gaps; a large number of values may never be generated by the algorithm used.
  
  Combine with human errors in coding the algorithm and the odds of a collision may be much greater than expected.
Kids in Finland don't agree by StreetFire.net · 2005-06-10 07:52 · Score: 5, Insightful

Regarding being "practically irrelevant"

"every time [some software engineer] says, 'nobody will go to the trouble of doing that,' there's some kid in Finland who will go to the trouble."
Taken from Kevin' Mitnik's "The Art of intrusion"
http://www.amazon.com/exec/obidos/tg/sim-explorer/ explore-items/-/0764569597/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/104-8074733-7395136
Possible solution by orb_fan · 2005-06-10 07:55 · Score: 2, Insightful

Is the answer then to create a hash that is in fact the sum of two distinct hashing alogrithms? For example, use MD5 and SHA1. Since the alogrithms use different methods, the set of collisions from one would not overlap the set from the second (or rather the overlap would be vanishingly small). And if the overlap was larger than you wanted - just keep adding different hash alogrithms until you are satisfied.

I realise that the computations involved aren't cheap, but it becomes a trade-off between security and speed - the more sure, the more time it will take.
Re:Wow...this is nerdy even for /. by Ann+Elk · 2005-06-10 08:13 · Score: 2, Insightful
- Download something critical, say, the current Linux kernel from kernel.org.
- Insert a trojan/backdoor/whatever.
- Manipulate the tar archive so the hashes match. This is the subject of TFA.
- Somehow upload the trojaned kernel back to kernel.org.
- Since the hashes for the original kernel and the trojaned kernel are identical, they both appear valid when checked against the signature.
- ????
- Profit!
Before someone starts bitching about "lack of trust" in open source, please replace kernel with security update and kernel.org with microsoft.com in the above text.
On the contrary... by Chris+Pimlott · 2005-06-10 08:14 · Score: 5, Insightful

This attack shows us all once again that there is that the procedures for using cryptography are as important as the mathematical theories and proofs on which cryptography is based. People like to believe that it's just the algorithm that's important, and once you have such an algorithm it's equally applicable to messages of all sorts and formats. As this shows, it's clearly not the case.

You may believe it's common sense, but to the average user, encrypting a simple letter like the memos used in the article expressed as a Word document is no different than encrypting a simple text email. Heck, many of these users probably have no idea that much of the plain-looking email they send and recieve is actually HTML, which is capable of hiding beneath its rendered surface all sorts of additional information.

When's the last time you saw an email program that read in a Word document, extracted just the plain text content, signed or encrypted it and then repackaged it into some new format in a cryptographically sound way that would automatically be reconstituted as a Word document on the other side? Most just have a handy "Sign" or "Encrypt" button that will happy accept .ps or .doc just as readily as a simple text file.
The provided exploit documents can be edited! by rar · 2005-06-10 08:19 · Score: 3, Insightful

What I haven't seen mentioned yet, and people perhaps haven't realized, is that in providing these two postscript files, they have essentially provided you with an postscript signature exploit kit!

All you need to do is download the two postscript documents and do *exactly corresponding edits* in both of them, and you get two documents saying different things and still have the same md5sums!

I just tried exchanging Alice's name for my own, and surely it did work.

Now, if they released a pdf-file hack, I would be genuinely worried :)...

--
Open Materials Database
Not exactly useful for fraud... by argent · 2005-06-10 08:22 · Score: 4, Insightful

Clever, but it means the attack is not a general way to forge an MD5-signed document... you couldn't use this (for example) to seed a P2P network with malicious files that look like safe ones. It only works if you generate both documents, and it can only be used maliciously if it's never examined by an expert: the signer can't retain a copy of the signed document or obtain a copy through discovery.
Re:Wow...this is nerdy even for /. by Beryllium+Sphere(tm) · 2005-06-10 09:01 · Score: 2, Insightful

>teoretically, someone could alter a text and produce a new one that seems (from the md5 hashes) identical to the first one

Not exactly. Not unless the attacker could choose the first text. The new attacks allow you to create two documents that collide, but don't (yet) allow you to take an arbitrary document and make another that collides with it.

That word "yet" is important. A lot of bright crypto people will now start working on "preimage" attacks and they've got some new tools to work with. Be afraid. In a calm constructive way of course.
That won't solve it by gotr00t · 2005-06-10 09:14 · Score: 3, Insightful

Think of it this way: You can hash _any_ file of _any_ size using either MD5 or SHA1, and these algorithms will hive you an alphanumeric hash, which has a limited number of permutations. Thare are an infinate number of unique files (assuming adequate storage space), yet there are finite unique hashes.
Therefore, no matter how many algorithms you sum up using your described method, the number of collisions is still infinite in amount. It is not the algorithms that are flawed, rather, it is the fundamential concept of hashing that allows collisions to happen.

I would assume that the way to reduce the number of collisions is by increasing the length of the hash itself so as to increase the number of unique hashes.
Re:Misuse of MD5? by thelittlestbuddy · 2005-06-10 12:18 · Score: 2, Insightful

I hope this post was a troll.
But if it wasn't let's just clear a few things up. MD5 is a cryptographic hash function, and is intended for use in things like digital signatures, message authentication codes, and many, many other applications. The attack in the paper demonstrates that MD5 is not suitable for use with digital signatures.

By the pigeon hole principle, your "magical fingerprint" can never exist for compressing arbitrary files.
Re:well... by Dwonis · 2005-06-10 16:42 · Score: 2, Insightful
substitute "the last bit of each byte" or "padbyte=rand(DVD-byte)" where rand() is a random-# generator and it's random enough.
[Disclaimer: I'm not an expert at cryptography, but I like to think I understand it better than most non-mathematicians outside the field. It would be really nice if a crypto expert could clarify this, but I don't expect that to happen on Slashdot.]

You are correct that your scheme would add some security, but not nearly as much as our intuition might lead us to believe.

Let's say you are going to transmit an n-bit message. Even if you don't transmit any information, an attacker knows that there are 2^n possible messages that you could transmit. If we assume that your message is compressed as much as possible, then before you transmit the message, all 2^n messages are equally likely (from an attacker's point of view).

So, you have n bits of data. We'll call this message P (the plaintext). Now, let's say you generate another n-bit random message, called K (the key). Finally, you xor P and K together to produce C (the ciphertext), which is also an n-bit message.

The theory behind the one-time pad says that if and only if there are at least 2^n equally-likely possibilities for K, then someone who only knows C cannot learn anything about P.

We can express this a different way. Let's say you have an invertible function, C = f(P), and:
- P is 2^n bits long.
- C is 2^n bits long.
- There are at least 2^n equally-likely functions for the function f
Note that the function f is just a generic representation of the one-time pad algorithm and the key K, so similarly, we cam say that an attacker who knows nothing about the function f cannot learn anything about P from only C.

And that's the problem: every time you transmit a message (P) that isn't *completely* random, you give the attacker a little information about f, unless you completely change f every time you transmit a new message. This why you can never re-use a key in the one-time-pad system.

So let's say you have a key, K, that has fewer than 2^n equally-likely possibilities. Then, there are fewer than 2^n possible functions f. If there are still 2^n possible values for P, then an attacker can learn some information about P from C. So, if you don't want that to happen, you need to have 2^n possible functions for f.

So, you have 2^n equally-likely functions for f, and you need to use a different one for every message. In order to let the recipient know which function to use each time you transmit a message, you have to transmit at least n bits of information to the recipient. I think you can see where this is going...

If you were going to write an algorithm to implement the function f, the optimally-compressed description of the algorithm would have to be at least n bits long, and would need to be replaced for every new message that you send. It doesn't matter if f is an algorithm based on a DVD library, or a really complicated program. In order for the one-time pad to work (an attacker learns nothing about P from C), you need make sure that there are at least an additional n bits of information that the attacker knows nothing about.

So in your example, you'd still need to send a new 4GB (optimally-compressed) version of rand() for every 4GB message you send.

Nice try, though. Keep it up!