Meaningful MD5 Collisions
mrogers writes "Researchers at Ruhr-Universität Bochum have found a way to produce MD5 collisions between human-meaningful documents. This could be used to obtain a digital signature on one document and then transfer it to another. The same technique is theoretically applicable to other hash functions based on the Merkle-Damgård structure, such as SHA-1." From the article: "Recently, the world of cryptographic hash functions has turned into a mess. A lot of researchers announced algorithms ("attacks") to find collisions for common hash functions such as MD5 and SHA-1 (see [B+, WFLY, WY, WYY-a, WYY-b]). For cryptographers, these results are exciting - but many so-called 'practitioners' turned them down as 'practically irrelevant'."
Unless I'm missing something, all these guys are doing is using a format that can contain an infinite amount of extraneous information that has no effect on how it's ultimately rendered, so same thing can be done with a
As an amateur cryptographer, I must say that labeling these attacks as 'practically irrelevant'
is at the very least misguided and at worst a shocking display of incompetence.
Stop the fixation with plain-text messages, most messages are not plain-text. Your average word document
contains loads of invisible data that doesn't get rendered. Pdf's contain "junk" data that doesn't get rendered either. Would
you notice a single bit difference in an MP3? Or a single pixel colour change in a jpeg? Hell, you can even do it in HTML <div style="visibility:hidden">Junk goes here</div>.
Mark my words, people will find in the next couple of months find two meaningful computer
documents that hash to the same value but are different byte-wise.
People undervalue these attacks because the attacker has to generate the collision before hand to use it.
To properly appreciate the power of these attacks consider the following senario.
Imagine we're agreeing a contract of employement and I'm your employer.
I give you the first word document that includes all the standard terms, however, I've also drafted
a Word document that contains a load of draconian clauses like banning you from working in any IT position five years
after leaving the company. By adding junk that doesn't render to both documents, I've managed to find to make the hash
of the two documents collide. Thinking I'm a nice employer, you sign the first document, which you do by signing the hash of
document. However, I now have your signature on BOTH documents. I now make sure the company IT system "forget" the first document
and I've successfully screwed you.
This is a human example, but there are other examples that apply in computer systems. The problem is that in many situations
the attacker can choose when you encrypt. Say you encrypt your e-mail conversation with your friend using S/MIME, many people click
"Reply" and the message body of the other persons method appears in the new message. Because of these attacks,
It's now no certainty that an attacker couldn't use this fact to construct collisions that an attacker could use.
As another security researcher said (paraphrased) It's like you're in building and you've just heard the fire alarm go off.
You can't see smoke but it's time to make your way calmly to the exit. That sums up the position with SHA-1 and MD5. Swap out the primitives
before you start seeing smoke.
It's not like we don't have alternatives anyway. Whirlpool uses the same wide-trail design principles has AES. It's slower than MD-5 or SHA-1 but it's much better designed. And beside, people would do well to realise you have to spend CPU cycles to get security.
Simon.
Anything you hash together will ALWAYS result in collisions.
Extracting the formatting and code from the document will just make it EASIER to create a duplication.
"Hello World!" might match with "Hello World!!!!!! this is extra stuff"
at least leaving the exact formatting instructions in gives you a chance that the collision which leaves a compatible file is much more difficult than the hash of the simple raw text.
liqbase
If MD5 is found to be insecure, what are the alternatives we can use when signing our open-source packages? Is there any other alternative that is even approaching the widespread use of md5sum?
Cyric Zndovzny at your service.
2^128 is huge. It's larger by far than the number of all the files in all of the computers in the world. It larger than the number of stars in the universe. Chance collisions will not become an everyday occurance. No accidental collision has ever been found yet. Switching to larger keys will not change anything. Sure, they might make it slightly harder to make a deliberate collision (although I don't know for a fact that they make it harder at all, there were some reports of someone in Japan being able to create a collision by hand with only pencil and paper), but just wait 2 months and the computing power will catch up with that. It's not a matter of the size of the hash function.
I'm an American. I love this country and the freedoms that we used to have.
Regarding being "practically irrelevant"
/ explore-items/-/0764569597/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/104-8074733-7395136
"every time [some software engineer] says, 'nobody will go to the trouble of doing that,' there's some kid in Finland who will go to the trouble."
Taken from Kevin' Mitnik's "The Art of intrusion"
http://www.amazon.com/exec/obidos/tg/sim-explorer
Is the answer then to create a hash that is in fact the sum of two distinct hashing alogrithms? For example, use MD5 and SHA1. Since the alogrithms use different methods, the set of collisions from one would not overlap the set from the second (or rather the overlap would be vanishingly small). And if the overlap was larger than you wanted - just keep adding different hash alogrithms until you are satisfied.
I realise that the computations involved aren't cheap, but it becomes a trade-off between security and speed - the more sure, the more time it will take.
Before someone starts bitching about "lack of trust" in open source, please replace kernel with security update and kernel.org with microsoft.com in the above text.
This attack shows us all once again that there is that the procedures for using cryptography are as important as the mathematical theories and proofs on which cryptography is based. People like to believe that it's just the algorithm that's important, and once you have such an algorithm it's equally applicable to messages of all sorts and formats. As this shows, it's clearly not the case.
.ps or .doc just as readily as a simple text file.
You may believe it's common sense, but to the average user, encrypting a simple letter like the memos used in the article expressed as a Word document is no different than encrypting a simple text email. Heck, many of these users probably have no idea that much of the plain-looking email they send and recieve is actually HTML, which is capable of hiding beneath its rendered surface all sorts of additional information.
When's the last time you saw an email program that read in a Word document, extracted just the plain text content, signed or encrypted it and then repackaged it into some new format in a cryptographically sound way that would automatically be reconstituted as a Word document on the other side? Most just have a handy "Sign" or "Encrypt" button that will happy accept
What I haven't seen mentioned yet, and people perhaps haven't realized, is that in providing these two postscript files, they have essentially provided you with an postscript signature exploit kit!
:)...
All you need to do is download the two postscript documents and do *exactly corresponding edits* in both of them, and you get two documents saying different things and still have the same md5sums!
I just tried exchanging Alice's name for my own, and surely it did work.
Now, if they released a pdf-file hack, I would be genuinely worried
Open Materials Database
Clever, but it means the attack is not a general way to forge an MD5-signed document... you couldn't use this (for example) to seed a P2P network with malicious files that look like safe ones. It only works if you generate both documents, and it can only be used maliciously if it's never examined by an expert: the signer can't retain a copy of the signed document or obtain a copy through discovery.
>teoretically, someone could alter a text and produce a new one that seems (from the md5 hashes) identical to the first one
Not exactly. Not unless the attacker could choose the first text. The new attacks allow you to create two documents that collide, but don't (yet) allow you to take an arbitrary document and make another that collides with it.
That word "yet" is important. A lot of bright crypto people will now start working on "preimage" attacks and they've got some new tools to work with. Be afraid. In a calm constructive way of course.
Therefore, no matter how many algorithms you sum up using your described method, the number of collisions is still infinite in amount. It is not the algorithms that are flawed, rather, it is the fundamential concept of hashing that allows collisions to happen.
I would assume that the way to reduce the number of collisions is by increasing the length of the hash itself so as to increase the number of unique hashes.
But if it wasn't let's just clear a few things up. MD5 is a cryptographic hash function, and is intended for use in things like digital signatures, message authentication codes, and many, many other applications. The attack in the paper demonstrates that MD5 is not suitable for use with digital signatures.
By the pigeon hole principle, your "magical fingerprint" can never exist for compressing arbitrary files.
[Disclaimer: I'm not an expert at cryptography, but I like to think I understand it better than most non-mathematicians outside the field. It would be really nice if a crypto expert could clarify this, but I don't expect that to happen on Slashdot.]
You are correct that your scheme would add some security, but not nearly as much as our intuition might lead us to believe.
Let's say you are going to transmit an n-bit message. Even if you don't transmit any information, an attacker knows that there are 2^n possible messages that you could transmit. If we assume that your message is compressed as much as possible, then before you transmit the message, all 2^n messages are equally likely (from an attacker's point of view).
So, you have n bits of data. We'll call this message P (the plaintext). Now, let's say you generate another n-bit random message, called K (the key). Finally, you xor P and K together to produce C (the ciphertext), which is also an n-bit message.
The theory behind the one-time pad says that if and only if there are at least 2^n equally-likely possibilities for K, then someone who only knows C cannot learn anything about P.
We can express this a different way. Let's say you have an invertible function, C = f(P), and:
Note that the function f is just a generic representation of the one-time pad algorithm and the key K, so similarly, we cam say that an attacker who knows nothing about the function f cannot learn anything about P from only C.
And that's the problem: every time you transmit a message (P) that isn't *completely* random, you give the attacker a little information about f, unless you completely change f every time you transmit a new message. This why you can never re-use a key in the one-time-pad system.
So let's say you have a key, K, that has fewer than 2^n equally-likely possibilities. Then, there are fewer than 2^n possible functions f. If there are still 2^n possible values for P, then an attacker can learn some information about P from C. So, if you don't want that to happen, you need to have 2^n possible functions for f.
So, you have 2^n equally-likely functions for f, and you need to use a different one for every message. In order to let the recipient know which function to use each time you transmit a message, you have to transmit at least n bits of information to the recipient. I think you can see where this is going...
If you were going to write an algorithm to implement the function f, the optimally-compressed description of the algorithm would have to be at least n bits long, and would need to be replaced for every new message that you send. It doesn't matter if f is an algorithm based on a DVD library, or a really complicated program. In order for the one-time pad to work (an attacker learns nothing about P from C), you need make sure that there are at least an additional n bits of information that the attacker knows nothing about.
So in your example, you'd still need to send a new 4GB (optimally-compressed) version of rand() for every 4GB message you send.
Nice try, though. Keep it up!