Meaningful MD5 Collisions
mrogers writes "Researchers at Ruhr-Universität Bochum have found a way to produce MD5 collisions between human-meaningful documents. This could be used to obtain a digital signature on one document and then transfer it to another. The same technique is theoretically applicable to other hash functions based on the Merkle-Damgård structure, such as SHA-1." From the article: "Recently, the world of cryptographic hash functions has turned into a mess. A lot of researchers announced algorithms ("attacks") to find collisions for common hash functions such as MD5 and SHA-1 (see [B+, WFLY, WY, WYY-a, WYY-b]). For cryptographers, these results are exciting - but many so-called 'practitioners' turned them down as 'practically irrelevant'."
As an amateur cryptographer, I must say that labeling these attacks as 'practically irrelevant'
is at the very least misguided and at worst a shocking display of incompetence.
Stop the fixation with plain-text messages, most messages are not plain-text. Your average word document
contains loads of invisible data that doesn't get rendered. Pdf's contain "junk" data that doesn't get rendered either. Would
you notice a single bit difference in an MP3? Or a single pixel colour change in a jpeg? Hell, you can even do it in HTML <div style="visibility:hidden">Junk goes here</div>.
Mark my words, people will find in the next couple of months find two meaningful computer
documents that hash to the same value but are different byte-wise.
People undervalue these attacks because the attacker has to generate the collision before hand to use it.
To properly appreciate the power of these attacks consider the following senario.
Imagine we're agreeing a contract of employement and I'm your employer.
I give you the first word document that includes all the standard terms, however, I've also drafted
a Word document that contains a load of draconian clauses like banning you from working in any IT position five years
after leaving the company. By adding junk that doesn't render to both documents, I've managed to find to make the hash
of the two documents collide. Thinking I'm a nice employer, you sign the first document, which you do by signing the hash of
document. However, I now have your signature on BOTH documents. I now make sure the company IT system "forget" the first document
and I've successfully screwed you.
This is a human example, but there are other examples that apply in computer systems. The problem is that in many situations
the attacker can choose when you encrypt. Say you encrypt your e-mail conversation with your friend using S/MIME, many people click
"Reply" and the message body of the other persons method appears in the new message. Because of these attacks,
It's now no certainty that an attacker couldn't use this fact to construct collisions that an attacker could use.
As another security researcher said (paraphrased) It's like you're in building and you've just heard the fire alarm go off.
You can't see smoke but it's time to make your way calmly to the exit. That sums up the position with SHA-1 and MD5. Swap out the primitives
before you start seeing smoke.
It's not like we don't have alternatives anyway. Whirlpool uses the same wide-trail design principles has AES. It's slower than MD-5 or SHA-1 but it's much better designed. And beside, people would do well to realise you have to spend CPU cycles to get security.
Simon.
For those who can't convientently view PostScript files, the text of the two letters:
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
Alice Falbala fulfilled all the requirements of the Roman Empire
intern position. She was excellent at translating roman into her gaul
native language, learned very rapidly, and worked with considerable
independence and confidence.
Her basic work habits such as punctuality, interpersonal deportment,
communication skills, and completing assigned and self-determined
goals were all excellent.
I recommend Alice for challenging positions in which creativity,
reliability, and language skills are required.
I highly recommend hiring her. If you'd like to discuss her attributes
in more detail, please don't hesitate to contact me.
Sincerely,
Julius Caesar
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
May, 22, 2005
Order:
Alice Falbala is given full access to all confidential and secret
information about GAUL.
Sincerely,
Julius Caesar
What these researchers did was not to improve the known attacks on MD5, but to demonstrate a clever way of turning the known attack, generally considered to be of theoretical interest only, into an attack that could potentially really be used.
The way they did it was to create a postscript document that actually contains two documents, one that the sender would be willing to sign and one that he presumably would not. The full text of both is contained in the file, but near the beginning of the file is a bit of code that compares two blocks of random-appearing bits, call them A and B. If A == B, the postscript interpreter will select the innocuous message and display that. If A != B, the interpreter will display the other message.
The researchers then generated a pair of blocks with the same MD5 hash. In one copy of the postscript file, they used one of these blocks as both A and B. In the other copy, they used one block as A and the other as B. Because every bit of both documents before and after the two blocks is identical, and because those blocks hash to the same value, the documents hash to the same value.
It's an interesting attack. It only applies to documents that are also programs, in some sense, but we use lots of document formats that fit that description.
A simple countermeasure that makes such an attack more difficult is to compress the documents before signing.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
The signing of open-source packages are to prevent download corruption usually. If a download is corrupted, the data will be different, and hence the hash will be different. Most of these attacks are malicious in that you have to go great lengths to find a collision to use. If your connection corrupts the download in such a way to produce a collision, your modem obviously hates you.
2^128 is huge. It's larger by far than the number of all the files in all of the computers in the world. It larger than the number of stars in the universe. Chance collisions will not become an everyday occurance. No accidental collision has ever been found yet. Switching to larger keys will not change anything. Sure, they might make it slightly harder to make a deliberate collision (although I don't know for a fact that they make it harder at all, there were some reports of someone in Japan being able to create a collision by hand with only pencil and paper), but just wait 2 months and the computing power will catch up with that. It's not a matter of the size of the hash function.
I'm an American. I love this country and the freedoms that we used to have.
Regarding being "practically irrelevant"
/ explore-items/-/0764569597/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/104-8074733-7395136
"every time [some software engineer] says, 'nobody will go to the trouble of doing that,' there's some kid in Finland who will go to the trouble."
Taken from Kevin' Mitnik's "The Art of intrusion"
http://www.amazon.com/exec/obidos/tg/sim-explorer
We could just couple it with another widely used industry standard
" Yesterday upon the stair I met a man who wasn't there. He wasn't there again today. I wish that man would go away."
Basically, when you do an md5 for a string, you transform an existing text with a variable length to a fixed length string. Now, imagine the variable text is 200bytes long, but the fixed string is 20 bytes long, you are obiously loosing information, and that there may be a combination of 200 bytes that produce the same 20 byte sequence, but the amount of combinations in 20 bytes (160 bits) make it highly unlikely that you will find a repeated sequence. What this investingators found is a way to replicate this sequences. The problem being that usually we check integrity with this md5 hashes, so teoretically, someone could alter a text and produce a new one that seems (from the md5 hashes) identical to the first one. This is specially nice for putting backdoors in source code downloaded from the net, as we often check it against an md5 hash.
At first I thought: Postscript! Well, obviously. To find a collision, you've probably got to hide a clump of randomness in the document, and then rotate that clump until the hashes collide. If you tried to hide random data in a text file, it would be obvious to the person signing it. You need some format to hide the random bits from the viwer.
I bet the random parts are REALLY BIG! I mean, you'd probably need a lot of random data before you could find a collision...
Then I downloaded the files...
There's almost nothing to them! I can't read PS, so I'm not sure how many of that handful of bytes at the beginning might be tweakable... but it's a lot less than I expected.
Collisions must be very easy to find! I am now offically very worried about this.
...or maybe not.
in my next big project?
In all seriousness, I believe Schneier's right. We need a competition for a new hash function.
Nah, let's just wait for 24 to drop the words "MD5" before we know it's really bad.
- Relative expected values of gian vs. loss: The attacker thinks "I know I can gain a #BIG_NUM million dollars" and devotes their full effort to the attack. The defender thinks "I'm safe, there's a low probability, and I'm sure I'll catch the problem before it becomes real money, " and does not not devote effort to security becuase a Gartner report told him it was over-hyped. Thus, the attacker's perceived expected value is much higher than the defenders perceived expected loss and each invests accordingly.
- Rising Complexity: As IT systems become more complex, they become less secure. Each new device, new networking protocol, new physical layer, new OS feature, new networked application provides new opportunities for the attacker and a dilution of security resources for defenders.
- Time: The attacker has the advantage of time. New algorithms, new mathematical theories, new exploits, and faster processors all favor the attacker. What once was supposed to take the age of the universe to crack can be decrypted with a quickly declining number of networked (even zombied) PCs.
- Curse of Compatibility: Because so much crypto and security is networking related, it is subject to implementation delays caused by the need to be compatible. Defenders continue to use old, vulnerable systems to maintain compatibility with key partners. Patches don't solve the problem because the patch itself can introduce incompatibilities that make defenders leary of applying the patch with a very real chance of causing problems to avoid a hypothetical security issue.
The bottom line is that the defender must protect all vulnerabilities while going about the day-to-day business of using the computer. In contrast, the attacker can devote full time to any weakness of their choice.Two wrongs don't make a right, but three lefts do.
If you don't know what a hash function is, then turn in your nerd credentials now.
Basically, they provided an example case where one of these recent methods to generate hash function collisions can be turned into a "real world" attack.
It's a very simple example case, but it demonstrates the point effectively. The point is that these recently discovered methods to generate collisions quickly are a real threat to any software using them as a method for digital signatures and such.
The real world application here is that it is possible, probably in several good ways, to generate a couple of different files that have the same hash and also have meaningful data in them. The attacks found that generate seemingly random data with the same hashes can be used in ways that will let them apply to non-random, purposefully designed data.
The example they use is where some secretary gets her boss to sign a document, and then uses his signature on another document which gives her access she shouldn't have. It's a way to forge a digital signature on a document by having them sign another one that you specially crafted.
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
I forget where or when exactly, so please feel free to run a search if you care to... it was here on Slashdot though.
There was talk about someone being able to foil P2P networks by seeding bad stuff through random data formulated to fit the MD5/SHA1 code from legitimate files shared on those networks. The consensus was that it was BS and that even if it weren't BS there could be updates to make such attacks more difficult or impossible to perform.
Am I missing something or are these two stories relevant to each other?
Lenstra and others came up with a way to generate syntactically-correct X509 certificates that collide under MD5.
Here's a link to the paper: Lenstra et al.
We're talking about cryptographic hashes here, not encryption. Encryption is meant to be a reversible process, and is therefore one-to-one. In other words, there's no concern over collisions with encryption.
With cryptographic hashes, you're throwing away nearly all of the data to obtain a hash (a number) which represents the larger data set in such a way that (hopefully) the hash will never turn up again in practical usage. The article here indicates that there are ways being devised to force two data sets to have a hash collision while keeping the practical parts of the data sets the same.
As for accusing encryption of being "security through obscurity", you're misusing that term. If knowing the encryption algorithm allowed you instant access to all data encrypted with that algorithm, then yes, the only security present would be dependent upon the secrecy of the algorithm itself. But that's not the case here. Encryption typically works by public key exchange, meaning that a key (a number) used to encrypt messages is shared with the encrypting partner, while the key to decrypt and recover the data is kept private (is never transmitted). Recovering the private key through brute force is not a compromise of the algorithm itself - given enough time, any private key can be recovered, regardless of the algorithm, but by increasing the key size arbitrarily, the time taken to find that key can also be increased arbitrarily.
This attack shows us all once again that there is that the procedures for using cryptography are as important as the mathematical theories and proofs on which cryptography is based. People like to believe that it's just the algorithm that's important, and once you have such an algorithm it's equally applicable to messages of all sorts and formats. As this shows, it's clearly not the case.
.ps or .doc just as readily as a simple text file.
You may believe it's common sense, but to the average user, encrypting a simple letter like the memos used in the article expressed as a Word document is no different than encrypting a simple text email. Heck, many of these users probably have no idea that much of the plain-looking email they send and recieve is actually HTML, which is capable of hiding beneath its rendered surface all sorts of additional information.
When's the last time you saw an email program that read in a Word document, extracted just the plain text content, signed or encrypted it and then repackaged it into some new format in a cryptographically sound way that would automatically be reconstituted as a Word document on the other side? Most just have a handy "Sign" or "Encrypt" button that will happy accept
What I haven't seen mentioned yet, and people perhaps haven't realized, is that in providing these two postscript files, they have essentially provided you with an postscript signature exploit kit!
:)...
All you need to do is download the two postscript documents and do *exactly corresponding edits* in both of them, and you get two documents saying different things and still have the same md5sums!
I just tried exchanging Alice's name for my own, and surely it did work.
Now, if they released a pdf-file hack, I would be genuinely worried
Open Materials Database
Clever, but it means the attack is not a general way to forge an MD5-signed document... you couldn't use this (for example) to seed a P2P network with malicious files that look like safe ones. It only works if you generate both documents, and it can only be used maliciously if it's never examined by an expert: the signer can't retain a copy of the signed document or obtain a copy through discovery.
I think that's a big shortsighted... I agree that if we let history take a crack at it, that any encryption put together by smart people will eventually be breakable by smart people.
However, most data that I deal with day-to-day is time relevant. Do I care if someone figures out my credit card number on an account I closed 5 years ago? Is it terrible if someone hacks an old email only to find out I was begging a professor for a passing grade in 1997?
Encryption is meant to hide things, and for many things, the need to hide is temporary. If the hidden thing stays hidden as long as it needs to stay hidden, there is nothing wrong with it.
Know the limitations on the technology you use, and know the parties with which you exchange information. Those two rules alone, if followed, will probably provide more than adequate real-world defense. Perfect? No. Good enough. Statistically, yes.
In Soviet Russia, us are belong to all your base.
Actually, the two documents are actually almost identical. The difference is only one block in the whole file, which essentially acts as a selector for which of the two sets of content is displayed. MD5 (like most hash functions) works on fixed-size blocks smaller than the average file. To hash a complete file, you hash the first block, feed that into the hash with the second block, feed that into the hash with the third block, and so forth. So they have two files, and the first blocks are the same, the second blocks are different but hash the same, and the rest of the files are the same. Of course, the second blocks are junk, but the postscript is expecting a block-sized arbitrary value at that point anyway, so it doesn't matter that there's junk there.
So they are actually using a format that can contain an exact quantity of extraneous information that doesn't get rendered but entirely changes what does get rendered.
The same thing could be done with PDF or doc, and executables, but not anything compressed (it won't decompress at all if a block is changed) and not HTML without javascript (there's no way to test which block of junk is included and show different results based on that).
Therefore, no matter how many algorithms you sum up using your described method, the number of collisions is still infinite in amount. It is not the algorithms that are flawed, rather, it is the fundamential concept of hashing that allows collisions to happen.
I would assume that the way to reduce the number of collisions is by increasing the length of the hash itself so as to increase the number of unique hashes.
Isn't this obivous if you check the file sizes?
The files are the same size.
The cksum comand (which uses a 32 bit CRC) spits out the checksum and a file size. Why doesn't md5sum do the same thing?
It does - The file size is used as part of the MD5 hash. The MD5 algorithm hashes the file, then appends the file size and hashes that too. If it didn't do this then you could create an MD5 collision just by appending zeros to the file.