Meaningful MD5 Collisions
mrogers writes "Researchers at Ruhr-Universität Bochum have found a way to produce MD5 collisions between human-meaningful documents. This could be used to obtain a digital signature on one document and then transfer it to another. The same technique is theoretically applicable to other hash functions based on the Merkle-Damgård structure, such as SHA-1." From the article: "Recently, the world of cryptographic hash functions has turned into a mess. A lot of researchers announced algorithms ("attacks") to find collisions for common hash functions such as MD5 and SHA-1 (see [B+, WFLY, WY, WYY-a, WYY-b]). For cryptographers, these results are exciting - but many so-called 'practitioners' turned them down as 'practically irrelevant'."
For those who can't convientently view PostScript files, the text of the two letters:
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
Alice Falbala fulfilled all the requirements of the Roman Empire
intern position. She was excellent at translating roman into her gaul
native language, learned very rapidly, and worked with considerable
independence and confidence.
Her basic work habits such as punctuality, interpersonal deportment,
communication skills, and completing assigned and self-determined
goals were all excellent.
I recommend Alice for challenging positions in which creativity,
reliability, and language skills are required.
I highly recommend hiring her. If you'd like to discuss her attributes
in more detail, please don't hesitate to contact me.
Sincerely,
Julius Caesar
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
May, 22, 2005
Order:
Alice Falbala is given full access to all confidential and secret
information about GAUL.
Sincerely,
Julius Caesar
What these researchers did was not to improve the known attacks on MD5, but to demonstrate a clever way of turning the known attack, generally considered to be of theoretical interest only, into an attack that could potentially really be used.
The way they did it was to create a postscript document that actually contains two documents, one that the sender would be willing to sign and one that he presumably would not. The full text of both is contained in the file, but near the beginning of the file is a bit of code that compares two blocks of random-appearing bits, call them A and B. If A == B, the postscript interpreter will select the innocuous message and display that. If A != B, the interpreter will display the other message.
The researchers then generated a pair of blocks with the same MD5 hash. In one copy of the postscript file, they used one of these blocks as both A and B. In the other copy, they used one block as A and the other as B. Because every bit of both documents before and after the two blocks is identical, and because those blocks hash to the same value, the documents hash to the same value.
It's an interesting attack. It only applies to documents that are also programs, in some sense, but we use lots of document formats that fit that description.
A simple countermeasure that makes such an attack more difficult is to compress the documents before signing.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
The signing of open-source packages are to prevent download corruption usually. If a download is corrupted, the data will be different, and hence the hash will be different. Most of these attacks are malicious in that you have to go great lengths to find a collision to use. If your connection corrupts the download in such a way to produce a collision, your modem obviously hates you.
Basically, when you do an md5 for a string, you transform an existing text with a variable length to a fixed length string. Now, imagine the variable text is 200bytes long, but the fixed string is 20 bytes long, you are obiously loosing information, and that there may be a combination of 200 bytes that produce the same 20 byte sequence, but the amount of combinations in 20 bytes (160 bits) make it highly unlikely that you will find a repeated sequence. What this investingators found is a way to replicate this sequences. The problem being that usually we check integrity with this md5 hashes, so teoretically, someone could alter a text and produce a new one that seems (from the md5 hashes) identical to the first one. This is specially nice for putting backdoors in source code downloaded from the net, as we often check it against an md5 hash.
If you don't know what a hash function is, then turn in your nerd credentials now.
Basically, they provided an example case where one of these recent methods to generate hash function collisions can be turned into a "real world" attack.
It's a very simple example case, but it demonstrates the point effectively. The point is that these recently discovered methods to generate collisions quickly are a real threat to any software using them as a method for digital signatures and such.
The real world application here is that it is possible, probably in several good ways, to generate a couple of different files that have the same hash and also have meaningful data in them. The attacks found that generate seemingly random data with the same hashes can be used in ways that will let them apply to non-random, purposefully designed data.
The example they use is where some secretary gets her boss to sign a document, and then uses his signature on another document which gives her access she shouldn't have. It's a way to forge a digital signature on a document by having them sign another one that you specially crafted.
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
We're talking about cryptographic hashes here, not encryption. Encryption is meant to be a reversible process, and is therefore one-to-one. In other words, there's no concern over collisions with encryption.
With cryptographic hashes, you're throwing away nearly all of the data to obtain a hash (a number) which represents the larger data set in such a way that (hopefully) the hash will never turn up again in practical usage. The article here indicates that there are ways being devised to force two data sets to have a hash collision while keeping the practical parts of the data sets the same.
As for accusing encryption of being "security through obscurity", you're misusing that term. If knowing the encryption algorithm allowed you instant access to all data encrypted with that algorithm, then yes, the only security present would be dependent upon the secrecy of the algorithm itself. But that's not the case here. Encryption typically works by public key exchange, meaning that a key (a number) used to encrypt messages is shared with the encrypting partner, while the key to decrypt and recover the data is kept private (is never transmitted). Recovering the private key through brute force is not a compromise of the algorithm itself - given enough time, any private key can be recovered, regardless of the algorithm, but by increasing the key size arbitrarily, the time taken to find that key can also be increased arbitrarily.
I think that's a big shortsighted... I agree that if we let history take a crack at it, that any encryption put together by smart people will eventually be breakable by smart people.
However, most data that I deal with day-to-day is time relevant. Do I care if someone figures out my credit card number on an account I closed 5 years ago? Is it terrible if someone hacks an old email only to find out I was begging a professor for a passing grade in 1997?
Encryption is meant to hide things, and for many things, the need to hide is temporary. If the hidden thing stays hidden as long as it needs to stay hidden, there is nothing wrong with it.
Know the limitations on the technology you use, and know the parties with which you exchange information. Those two rules alone, if followed, will probably provide more than adequate real-world defense. Perfect? No. Good enough. Statistically, yes.
In Soviet Russia, us are belong to all your base.
If two documents exist with the same hash, then they were both produced by the same source, since there is no practical, known way of finding a collision without having control of the content of both documents. Therefore, your signed copy of the original document proves that the employer created both versions.
If 'the people' in Amendment 2 are 'the state' then Amendments 1, 2, 4, 9, and 10 benefit the state, not you.
To find a collision, you've probably got to hide a clump of randomness in the document, and then rotate that clump until the hashes collide.
That's actually not what they did. They generated two essentially random blobs of data that have the same hash. We'll call these X and Y. They then created a PostScript document containing BOTH messages, the one that Alice's boss would sign and the one he presumably would not sign. They inserted two copies of block X into one of these documents, and a one X and one Y into the other. The original document contained code that compared the two blocks, and if they were the same, caused one message to be rendered, or if they were different, caused the other message to be rendered. Thus both documents hash the same (since X hashes the same as Y), but you see different text when you view the files.
This sort of attack would only work on documents that can contain code of some sort. It would not work on text files.
Isn't this obivous if you check the file sizes?
The files are the same size.
The cksum comand (which uses a 32 bit CRC) spits out the checksum and a file size. Why doesn't md5sum do the same thing?
It does - The file size is used as part of the MD5 hash. The MD5 algorithm hashes the file, then appends the file size and hashes that too. If it didn't do this then you could create an MD5 collision just by appending zeros to the file.
It is the same document, just relying on differences in the document name (it appears) to generate the different pages.
No, you have missed the point. Go back and rtfa again. The attack still works if you rename the documents to the same filename.
The difference lies in a generated "binary cookie" in the beginning of the postscript documents. This "cookie" makes the postscript intepreter either select to show document 'A' or 'B'. The "thing" with the cookies are that they are carefully selected to be md5-colliding. Result: both documents have the same md5sum.
You can change the rest of the documents freely if you make the same changes in both documents. The md5sum will change, but it will still be the same for both documents.
So. No. It is indeed a md5 collission attack.
Open Materials Database