Meaningful MD5 Collisions
mrogers writes "Researchers at Ruhr-Universität Bochum have found a way to produce MD5 collisions between human-meaningful documents. This could be used to obtain a digital signature on one document and then transfer it to another. The same technique is theoretically applicable to other hash functions based on the Merkle-Damgård structure, such as SHA-1." From the article: "Recently, the world of cryptographic hash functions has turned into a mess. A lot of researchers announced algorithms ("attacks") to find collisions for common hash functions such as MD5 and SHA-1 (see [B+, WFLY, WY, WYY-a, WYY-b]). For cryptographers, these results are exciting - but many so-called 'practitioners' turned them down as 'practically irrelevant'."
Unless I'm missing something, all these guys are doing is using a format that can contain an infinite amount of extraneous information that has no effect on how it's ultimately rendered, so same thing can be done with a
What happens when databases using GUIDs finally become large enough that GUID collisions are everyday occurances? Will we have to switch to using 256 bit keys?
Does this mean that MD5/SHA1 should no longer be used? What about AES's hash's?
Anyone want to explain the real world applications of this to someone who is considering turning in his nerd credentials after being unable to get the gist of this from the write up...and please don't tell me to RTFA, this is after all /.!
As an amateur cryptographer, I must say that labeling these attacks as 'practically irrelevant'
is at the very least misguided and at worst a shocking display of incompetence.
Stop the fixation with plain-text messages, most messages are not plain-text. Your average word document
contains loads of invisible data that doesn't get rendered. Pdf's contain "junk" data that doesn't get rendered either. Would
you notice a single bit difference in an MP3? Or a single pixel colour change in a jpeg? Hell, you can even do it in HTML <div style="visibility:hidden">Junk goes here</div>.
Mark my words, people will find in the next couple of months find two meaningful computer
documents that hash to the same value but are different byte-wise.
People undervalue these attacks because the attacker has to generate the collision before hand to use it.
To properly appreciate the power of these attacks consider the following senario.
Imagine we're agreeing a contract of employement and I'm your employer.
I give you the first word document that includes all the standard terms, however, I've also drafted
a Word document that contains a load of draconian clauses like banning you from working in any IT position five years
after leaving the company. By adding junk that doesn't render to both documents, I've managed to find to make the hash
of the two documents collide. Thinking I'm a nice employer, you sign the first document, which you do by signing the hash of
document. However, I now have your signature on BOTH documents. I now make sure the company IT system "forget" the first document
and I've successfully screwed you.
This is a human example, but there are other examples that apply in computer systems. The problem is that in many situations
the attacker can choose when you encrypt. Say you encrypt your e-mail conversation with your friend using S/MIME, many people click
"Reply" and the message body of the other persons method appears in the new message. Because of these attacks,
It's now no certainty that an attacker couldn't use this fact to construct collisions that an attacker could use.
As another security researcher said (paraphrased) It's like you're in building and you've just heard the fire alarm go off.
You can't see smoke but it's time to make your way calmly to the exit. That sums up the position with SHA-1 and MD5. Swap out the primitives
before you start seeing smoke.
It's not like we don't have alternatives anyway. Whirlpool uses the same wide-trail design principles has AES. It's slower than MD-5 or SHA-1 but it's much better designed. And beside, people would do well to realise you have to spend CPU cycles to get security.
Simon.
- but many so-called 'practitioners' turned them down as 'practically irrelevant'."
"Just Smile and Nod." --Huck
http://www.cits.rub.de.nyud.net:8090/MD5Collisions /
Anything you hash together will ALWAYS result in collisions.
Extracting the formatting and code from the document will just make it EASIER to create a duplication.
"Hello World!" might match with "Hello World!!!!!! this is extra stuff"
at least leaving the exact formatting instructions in gives you a chance that the collision which leaves a compatible file is much more difficult than the hash of the simple raw text.
liqbase
See Schneier's Blog for more thoughts on the subject. I am sure it will get fleshed out more as more details emerge.
Throw the bums out!
"Encryption" is just another flawed method of concealing information behind obfuscated algorithms. History has proven time and again that such techniques are inevitably compromised, and therefore useless.
My suggestion: if you want your data to be protected that well, don't transfer it over electronic media.
Socialism: A feeling of discontent and resentment caused by a desire for the possessions or qualities of another.
If MD5 is found to be insecure, what are the alternatives we can use when signing our open-source packages? Is there any other alternative that is even approaching the widespread use of md5sum?
Cyric Zndovzny at your service.
For those who can't convientently view PostScript files, the text of the two letters:
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
Alice Falbala fulfilled all the requirements of the Roman Empire
intern position. She was excellent at translating roman into her gaul
native language, learned very rapidly, and worked with considerable
independence and confidence.
Her basic work habits such as punctuality, interpersonal deportment,
communication skills, and completing assigned and self-determined
goals were all excellent.
I recommend Alice for challenging positions in which creativity,
reliability, and language skills are required.
I highly recommend hiring her. If you'd like to discuss her attributes
in more detail, please don't hesitate to contact me.
Sincerely,
Julius Caesar
Julius. Caesar
Via Appia 1
Rome, The Roman Empire
May, 22, 2005
Order:
Alice Falbala is given full access to all confidential and secret
information about GAUL.
Sincerely,
Julius Caesar
What these researchers did was not to improve the known attacks on MD5, but to demonstrate a clever way of turning the known attack, generally considered to be of theoretical interest only, into an attack that could potentially really be used.
The way they did it was to create a postscript document that actually contains two documents, one that the sender would be willing to sign and one that he presumably would not. The full text of both is contained in the file, but near the beginning of the file is a bit of code that compares two blocks of random-appearing bits, call them A and B. If A == B, the postscript interpreter will select the innocuous message and display that. If A != B, the interpreter will display the other message.
The researchers then generated a pair of blocks with the same MD5 hash. In one copy of the postscript file, they used one of these blocks as both A and B. In the other copy, they used one block as A and the other as B. Because every bit of both documents before and after the two blocks is identical, and because those blocks hash to the same value, the documents hash to the same value.
It's an interesting attack. It only applies to documents that are also programs, in some sense, but we use lots of document formats that fit that description.
A simple countermeasure that makes such an attack more difficult is to compress the documents before signing.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Maybe now, people will finally start migrating to a different hash algorithm for signatures. MD5 has been around long enough that people have known this sort of thing is possible for some time. It's only now that they finally sat down and did the proof-of-concept.
Of course, MD5 passwords are probably still safe, for now (between maximum password lengths and the fact that this attack will have a hard time actually doing that), but it's only a matter of time.
-Amalcon
Are you kidding? ANY article with the word "hash" in it is going to grab the attention of even the weakest synapse. ôô
2^128 is huge. It's larger by far than the number of all the files in all of the computers in the world. It larger than the number of stars in the universe. Chance collisions will not become an everyday occurance. No accidental collision has ever been found yet. Switching to larger keys will not change anything. Sure, they might make it slightly harder to make a deliberate collision (although I don't know for a fact that they make it harder at all, there were some reports of someone in Japan being able to create a collision by hand with only pencil and paper), but just wait 2 months and the computing power will catch up with that. It's not a matter of the size of the hash function.
I'm an American. I love this country and the freedoms that we used to have.
Regarding being "practically irrelevant"
/ explore-items/-/0764569597/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/104-8074733-7395136
"every time [some software engineer] says, 'nobody will go to the trouble of doing that,' there's some kid in Finland who will go to the trouble."
Taken from Kevin' Mitnik's "The Art of intrusion"
http://www.amazon.com/exec/obidos/tg/sim-explorer
Is the answer then to create a hash that is in fact the sum of two distinct hashing alogrithms? For example, use MD5 and SHA1. Since the alogrithms use different methods, the set of collisions from one would not overlap the set from the second (or rather the overlap would be vanishingly small). And if the overlap was larger than you wanted - just keep adding different hash alogrithms until you are satisfied.
I realise that the computations involved aren't cheap, but it becomes a trade-off between security and speed - the more sure, the more time it will take.
Compare the two files, and you see that there were only 7 bits that changed, all on the 5th most significant bit of the affected byte.
I guess it's obvious from looking at the ps file in text form, but if it's that easy to mangle postscript to display two different layers (or is it changing comments or pointers? I am not a binary postscript parser.) I still don't think it's time to throw md5 away yet. Six months.
What other hashing alternatives are there these days? SHA-1 apparently has the same kinds of weaknesses.
Doing the Right Thing should not be preempted by making a buck.
We could just couple it with another widely used industry standard
" Yesterday upon the stair I met a man who wasn't there. He wasn't there again today. I wish that man would go away."
Care to explain why?
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
There are 2 reasons people do this. 1) decryption takes longer than computing a checksum. and 2) this would force everyone to decrypt the file in order to use it. Only paranoid people and those who realize bits get corrupted bother to verify checksums now.
Reliable solutions are entirely available, people just need to start using them. BTW, I'm no expert but I think there are simpler ways than that suggested above.
I find it hard to believe that even Slashdot readers find this interesting.
This latest development should not only concern us nerds, but the rest of us as well. You see, this time they demonstrated that two messages can have the same hash, and neither of them need look like gibberish.
In other words, one might read "Bow before me and my enormous todger!" while the forgery could be "Warning: you need an electron microscope to view my penis". How can one tell which is real and which is fake?
Use ISO 8601 dates [YYYY-MM-DD]
At first I thought: Postscript! Well, obviously. To find a collision, you've probably got to hide a clump of randomness in the document, and then rotate that clump until the hashes collide. If you tried to hide random data in a text file, it would be obvious to the person signing it. You need some format to hide the random bits from the viwer.
I bet the random parts are REALLY BIG! I mean, you'd probably need a lot of random data before you could find a collision...
Then I downloaded the files...
There's almost nothing to them! I can't read PS, so I'm not sure how many of that handful of bytes at the beginning might be tweakable... but it's a lot less than I expected.
Collisions must be very easy to find! I am now offically very worried about this.
...or maybe not.
in my next big project?
In all seriousness, I believe Schneier's right. We need a competition for a new hash function.
Nah, let's just wait for 24 to drop the words "MD5" before we know it's really bad.
- Relative expected values of gian vs. loss: The attacker thinks "I know I can gain a #BIG_NUM million dollars" and devotes their full effort to the attack. The defender thinks "I'm safe, there's a low probability, and I'm sure I'll catch the problem before it becomes real money, " and does not not devote effort to security becuase a Gartner report told him it was over-hyped. Thus, the attacker's perceived expected value is much higher than the defenders perceived expected loss and each invests accordingly.
- Rising Complexity: As IT systems become more complex, they become less secure. Each new device, new networking protocol, new physical layer, new OS feature, new networked application provides new opportunities for the attacker and a dilution of security resources for defenders.
- Time: The attacker has the advantage of time. New algorithms, new mathematical theories, new exploits, and faster processors all favor the attacker. What once was supposed to take the age of the universe to crack can be decrypted with a quickly declining number of networked (even zombied) PCs.
- Curse of Compatibility: Because so much crypto and security is networking related, it is subject to implementation delays caused by the need to be compatible. Defenders continue to use old, vulnerable systems to maintain compatibility with key partners. Patches don't solve the problem because the patch itself can introduce incompatibilities that make defenders leary of applying the patch with a very real chance of causing problems to avoid a hypothetical security issue.
The bottom line is that the defender must protect all vulnerabilities while going about the day-to-day business of using the computer. In contrast, the attacker can devote full time to any weakness of their choice.Two wrongs don't make a right, but three lefts do.
If things went to court, it should be fairly easy to obtain a copy of the forged electronic document and look for added junk.
A novice might miss it, but a trained eye could easily see the garbage.
What worries me more is executeables where you're depending on the hash to match. If the file sizes are approximately the same, it would be very easy to trick someone into running something that is actually something else.
Well, you're of course a troll, but for people who might not be very informed about this... That unlikely event that you talk about has just been purposely produced. And they could have used any different collision they wanted for this effect, since if you append the same content (ANY content will do) to two files which are an MD5 collision, you end up with... two files which are an MD5 collision.
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
If you don't know what a hash function is, then turn in your nerd credentials now.
Basically, they provided an example case where one of these recent methods to generate hash function collisions can be turned into a "real world" attack.
It's a very simple example case, but it demonstrates the point effectively. The point is that these recently discovered methods to generate collisions quickly are a real threat to any software using them as a method for digital signatures and such.
The real world application here is that it is possible, probably in several good ways, to generate a couple of different files that have the same hash and also have meaningful data in them. The attacks found that generate seemingly random data with the same hashes can be used in ways that will let them apply to non-random, purposefully designed data.
The example they use is where some secretary gets her boss to sign a document, and then uses his signature on another document which gives her access she shouldn't have. It's a way to forge a digital signature on a document by having them sign another one that you specially crafted.
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
I forget where or when exactly, so please feel free to run a search if you care to... it was here on Slashdot though.
There was talk about someone being able to foil P2P networks by seeding bad stuff through random data formulated to fit the MD5/SHA1 code from legitimate files shared on those networks. The consensus was that it was BS and that even if it weren't BS there could be updates to make such attacks more difficult or impossible to perform.
Am I missing something or are these two stories relevant to each other?
Lenstra and others came up with a way to generate syntactically-correct X509 certificates that collide under MD5.
Here's a link to the paper: Lenstra et al.
I hope all those who said the previous MD5 attacks were nothing to worry about will take it back. A theoretical break or partial break is almost always followed by a practical break. The lesson to learn from this is not to do the same thing with SHA1 and, I would suggest, AES. The vulnerabilities might be impractical now but they won't stay that way. Move now, while you're still at least partially secure, rather than once there's a practical break.
I am trolling
Is that the same device that was used against WTO protestors in the United States?
Go America! One day you will be as oppressive as China.
--- Slashdot can't count:
Slow Down Cowboy!
Slashdot requires you to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.
It's been 25 minutes since you last successfully posted a comment
The GUID (as called by Microsoft) has it's history in the UUID, which was first "invented" by OSF as part of it's big DCE specs (long before Microsoft adopted the format and gave it a spiffy new name).
The UUID and GUID look alike, but the UUID was constructed in a much different manner than GUIDs seem to be nowdays. GUIDs are basically 128-bits of random data (usually made by passing pseudo-random seeds through a hash). UUIDs on the other hand contained structured data too in addition to randomness, to try to prevent accidental reuse. For instance the rightmost 48 bits come from the Ethernet MAC address if available. Other bits contained date/time information, process/thread ids, etc.
There was an expired Draft RFC written in February 1998 which explained a lot of the technical details of GUIDs and UUIDs (surprisingly authored by Microsoft). You may still be able to find copies of it out on the net (the filename was "draft-leach-uuids-guids-01.txt")
It is an interesting attack, and IANAL, but I'd be curious about the legal ramifications. If I slip a carbon (ah... the way-back machine) in a stack of papers and ask someone to sign the top one without thus informing them, I think my stealth probably invalidates the additional document(s).
You could argue that there's a noticeable difference between pen and carbon -- making the copy hard to enforce -- but I'd argue the digital version is even easier: at least in the PS example, both "copies" of the document need to be present to preserve the hash.
In normal (pen/paper) signature situations, I get a copy of what I signed. The same ought to apply to digital sigs, resulting in a simple legal challenge to the validity of the document.
[1]a tes/ddl-full.pdf
[LdW] A. Lenstra, B. de Weger: On the possibility of constructing meaningful hash collisions for public keys
http://www.win.tue.nl/~bdeweger/CollidingCertific
[2]
Using the length extension property present in MD5 and all other hash functions following the (almost omnipresent) Merkle-Damgaard design principle, we constructed a pair of documents to display either the letter or the order.
MyHash(ValueToBeHashed) = ValueToBeHashed
I have done extensive testing on this algorithm and have not been able to produce any collisions yet... I'll keep you guys updated. (I also used this new function to hash this message for verification!)
--------
MobileMrX Hash:
I am currently working on a hash function that returns the exact value of the hashed value. The algorithm works a little something like this:
MyHash(ValueToBeHashed) = ValueToBeHashed
I have done extensive testing on this algorithm and have not been able to produce any collisions yet... I'll keep you guys updated. (I also used this new function to hash this message for verification!)
This attack shows us all once again that there is that the procedures for using cryptography are as important as the mathematical theories and proofs on which cryptography is based. People like to believe that it's just the algorithm that's important, and once you have such an algorithm it's equally applicable to messages of all sorts and formats. As this shows, it's clearly not the case.
.ps or .doc just as readily as a simple text file.
You may believe it's common sense, but to the average user, encrypting a simple letter like the memos used in the article expressed as a Word document is no different than encrypting a simple text email. Heck, many of these users probably have no idea that much of the plain-looking email they send and recieve is actually HTML, which is capable of hiding beneath its rendered surface all sorts of additional information.
When's the last time you saw an email program that read in a Word document, extracted just the plain text content, signed or encrypted it and then repackaged it into some new format in a cryptographically sound way that would automatically be reconstituted as a Word document on the other side? Most just have a handy "Sign" or "Encrypt" button that will happy accept
In this case, I think the lameness filter has been a little lenient...
MD5 seemed like a good way to check for duplicate jpg images, sometimes an image with the same data had a different name.
But equivelent MD5 hashes of different images made a mess of that scheme.
Oddly enough SHA had the same problem, though I didn't check to see if it was for the same set of images...
I thought it was maybe a problem with the python MD5 and SHA libs and abandonded the idea.
I wonder if the nature of jpg compression significanlty reduced the (mathmatical) image "space" and thus
raises the probability of hash equivelence for non identical images.
One should be able to see the effect with a many thousand jpg images.
As you rightly pointed out, collisions are a natural phenomenon linked to the limited space of the hash.
What is important -and the reason for moving away from md5 and sha1- is the fact that flaws in these algorithms can be used to derive collisions more easily than it should be possible to.
Problem is that it is extremely hard to prove that an algorithm is strong, it is much easier to disprove: just find an example. Science at its best!
TODO: 753) write sig.
Other attacks on MD5
What I haven't seen mentioned yet, and people perhaps haven't realized, is that in providing these two postscript files, they have essentially provided you with an postscript signature exploit kit!
:)...
All you need to do is download the two postscript documents and do *exactly corresponding edits* in both of them, and you get two documents saying different things and still have the same md5sums!
I just tried exchanging Alice's name for my own, and surely it did work.
Now, if they released a pdf-file hack, I would be genuinely worried
Open Materials Database
Clever, but it means the attack is not a general way to forge an MD5-signed document... you couldn't use this (for example) to seed a P2P network with malicious files that look like safe ones. It only works if you generate both documents, and it can only be used maliciously if it's never examined by an expert: the signer can't retain a copy of the signed document or obtain a copy through discovery.
They did point out that they were exploiting the fact that if the MD5 hashes of X1 and X2 collide, then the MD5 hashes of X1+S (read: X1 concatenated with S) and X2+s also collide.
As much as you didn't like it, Lucks and Daum had just turned that into a possibly-practical hash function weakness. If they can do this with Word...
Here's a short article Bruce Schneier wrote on how to become a cryptographer (as a civilian): http://www.schneier.com/crypto-gram-9910.html
obviously your "enourmous todger" is fake. It's easy to hide something tiny with something enourmous.
I am a leaf on the wind. Watch how I soar.
I would have thought that with a name like yours, the answer would be obvious.
Show me on the doll where his noodly appendage touched you.
last time we had this discussion , maybe 3-6 months ago.. and they were meaningless collissions.. i told everyone "give it another 6 months or so, and someone'll figure out how to spoof properly"
"Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
The combination of two weak hash functions is just another weak hash function. You might be able to buy yourself a small amount of time, but it will be broken. If there is a large set of operations that can modify the file without changing the first hash, and another large set of operations which can modify it without changing the second hash, you can construct the intersection of these two sets to find the modifications which keep both hashes (and therefore also their sum) the same.
Strong hash functions are difficult to design, and require a lot of research and testing.
I'll probably be modded down for this...
I found a copy of "draft-leach-uuids-guids-01.txt" here.
Or a single pixel colour change in a jpeg?
This is not possible to do, you know. Unless of course you store jpegs with zero compression, which sort of defeats the whole purpose of jpeg compression.
>Switching to larger keys will not change anything.
Switching to longer hash sizes (not keys) will help against the current generation of SHA attacks. The effect of the Wang attacks is to knock (if memory serves) 16 bits off the work factor to find a collision. In other words, finding a collision in SHA-1 is now 64K times easier than brute force. Make brute force harder and the attack gets harder.
Some numbers: SHA-1 produces a 160-bit hash. If you're trying to find a collision as oposed to generating one the birthday paradox works in your favor and you only need on average 2^80 trials for brute force. Wang has lowered that number to 2^64, which is uncomfortably small by today's standards and reckless by tomorrow's. SHA-256 produces a 256-bit hash: you do the math.
The big worry is that cryptographic attacks tend to inspire research that finds better attacks. There's an unquantifiable but real risk that SHA-1 could collapse altogether.
The crypto geeks are speulating that these weaknesses are inherent in any hash algorithm that can be calculated incrementally. Right now the output after block N of input depends only on blocks 1..N of input. If the intermediate result for block N depended on blocks N+1..EOF then the current attacks wouldn't work. Neither would hashing an input stream. Such a change would destroy the ability to hash a file in parallel with reading it.
I never thought MD5 was supposed to be used to authenticate the content of a file, only as a quick check for simple bit errors. I'm not buying any of this discussion about coming up with a new hashing algorithm. MD5 works great when it is used the way it is supposed to be. The trouble with coming up with something better is that you are trying to represent a large body of data with a smaller body of data. The ONLY situation in which your smaller body of data will absolutely be uniquely coupled with the original large body of data is in the case of lossless data compression such as LZW. Any other form of lossy data compression or tokenization will run the risk of duplicates matching the same fingerprint. Someone who comes up with a magical fingerprint that can be tied to one and only one large body of data will have discovered the Holy Grail of data compression.
Tripwire does this. They use CRC32 and MD5. You could use MD5 and SHA and CRC32 if you wanted. It is probably not possible to get a pair of files that will collide in all three of those hashes.
Therefore, no matter how many algorithms you sum up using your described method, the number of collisions is still infinite in amount. It is not the algorithms that are flawed, rather, it is the fundamential concept of hashing that allows collisions to happen.
I would assume that the way to reduce the number of collisions is by increasing the length of the hash itself so as to increase the number of unique hashes.
Apply digital signatures to X not H(X). So what if sizeof(signature)=toobigtohandle, it's better than validityof(signature)=uncertain.
Actually, for smallish documents where the equivalent of a notary signature is required, this may be the way to go.
Sigh.
In the meantime, expect courts to allow people to repudiate supposed signatures when the validity is in doubt, the same way banks allow individuals to swear under oath that a signature on a stolen or altered check is fake.
In the meantime, I'll sign my employment contracts the old fashioned way and keep a copy in my files.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
See http://en.wikipedia.org/wiki/Birthday_paradox for details.
Basically if you have enough documents you will be able to find two with the same hash. Not to be mistaken with the difficulty of finding a document with the same hash as a GIVEN document. This cannot be avoided regardless of the hash.
To foil a birthday attack, NEVER agree to sign a given document. One must ALWAYS add some random junk to the document (clear text or hidden). This should be implemented in sofetware that helps in signing, but if the option isn't given, you can add a clear text "random" string.
Yossi
but almost. MD5 it's obsolete now and SHA-1 have been phased out. While things like this demonstrate some "probable but yet to be seen" attacks and in theory are good to see the weak spot of this kind of hash functions, the fact is that by now every serious cryto soft must be using high order sha, like SHA-256 or SHA-512 or stronger has functions like Tiger or Whirpool. There is no real threat. Some people said here in /. that Schneier wants a new competion, well, that's good, but not really necessary as SHA was designed with this kind of problems in mind (every hash funcion has collisions, you can't avoid that) that's why there are high order SHA funtions.
"Only two things are infinite, the universe and human stupidity, and I'm not sure about the former" - Albert Einstein.
If you're signing hashes of documents without seeing the document itself you're a huge fool. And if you're seeing the document (presumably on your own machine, since security-conscious people don't go leaving their private keys around everywhere) you can make a digital copy of it, and keep it, easily.
I am trolling
Every hashsum necessarily has collisions. There are two reasons why cryptographic hashes work anyway: 1. The value distribution (a small change in the input leads to a completely different hashsum). 2. The large set of possible hashes. MD5 is 128 bits. That allows roughly for 3x10^38 values. That is huge number. SHA-256 is 256 bits which allows for roughly 10^77 hashes. That is only within orders of magnitude of the estimated number of atoms in the observable universe (between 4×10^78 and 6×10^79). So I think it is safe to say that nobody will ever see or be able to generate a collision with SHA-256 or higher (unless there is a flaw in the algorithm).
Well, not so much a secure scheme as making pad distribution a lot easier.
Suppose I'm a spy and I want to encrypt 4GB of data to send to my HQ. I need a 4GB pad.
We arranged ahead of time for me to memorize a list of movies and byte-offsets. My first pad is the DVD for that movie, starting at the prearranged byte-offset.
I rent the movie at the local video store, watch the movie, then burn a DVD with the encrypted data and find some way to smuggle it back to HQ. If it's compromised, no big loss assuming they don't trace it back to me.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
What if Microsoft was to create a hash collision in their Longhorn CD image? Then when someone makes a torrent of the released version, they flood the network with the bogus version instead.
The problem in this case isn't the hash weakness as much as the protocol weakness. If Ceasar, instead of just hashing the content that was handed to him, hashed the content + a random string he generated and then provided the random string and the signed hash output, then this attack would be impossible.
HTTP digest authentication does just this with the nonce and cnonce strings. Sure, MD5 may be weak, but digest authentication that uses MD5 isn't. So this criticism of implementors not taking this seriously is largely misplaced: most people only use MD5 in ways that aren't impacted by the attack.
Liberty you never use is liberty you lose.
At the risk of being modded insightful by proving myself wrong, it should be noted that PKCS#7 and, by inclusion, S/MIME is subject to this sort of attack. PKCS#7 signing with "unauthenticatedAttributes" hashes the document directly.
Liberty you never use is liberty you lose.
The only case I can think these techniques might be usable is if you wanted to change a decimal point to a comma or add a zero or something like that on a number. Changeing an entire text word in the document would be practically impossible. This might be somewhat useful for some forgeries but not many..
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Ahem. Rainbow tables?
Why is ROT13 mod'ed funny every time a crypto discussion comes up? (After the 10th time...it's not funny anymore.)
Signing such a non-plaintext file by looking at it in a viewer only (and (a bank) accepting such a signed document) is the mistake. You don't even need MD5 collisions to exploit that.
Suppose you write a HTML file with some java-script embedded that displays one text if the date is before X, and another if it's after. you get someone to sign it before X, and then use it afterwards.
Similar, in general, different viewers interprete (broken) files differently. So if you know which viewer the signer and the bank use, you could probably write a postscript file that displays different text in those viewers.
No checksum or cryptographic mechanism is going to prevent that. Nor is displaying files differently than other viewers a security problem that should be fixed in every viewer.
So the only way to prevent this, is not signing "binary" files (not fully plain text human parsable) written by someone else.since the first news about hash-weaknesses i turned to use "multi-hashes". meaning, hashing the whole file first then just hash some more fixed defined small parts of it with a "weak hash" like md5. then i pulled out a "hash of all hashes" which wont be too easy to reproduce or fake.
I must say it is a remarkable piece of work, creating two different documents that make sense after rendering. In case of these documents, some evidence of tampering might be easy to find upon closer examination of the files, but its still a very important thing.
This also makes me wonder on how widespread the consequences may be. Using MD5 and similar digests is common practice, and was until recently considered quite safe.
As the bittorrent protocol uses extensive hash checking, could this new discovery mean that MPAA and the likes will be able to insert corrupt data to torrent networks? I can imagine this would be a very real 'application' of this research, if it allows one to create blocks of data with hashes and sizes as desired.
Even worse, in torrent networks, these poisoned blocks will probably be re-distributed by all peers - or am I missing something?
There are (almost) fully random UUIDs, also. There's one UUID format in which a couple bits denote "the rest of this UUID is random" basically. Other types of UUIDs use the MAC address and a timestamp, or sometimes a randomized MAC section (in which case a bit in that section will be set which is never set in a real MAC address).
Liberty in your lifetime
This kind of information will wreak havoc with digital forensic departments. Right now we use various checksums, the most common being MD5, to verify the integrity and accuracy of a drive image/duplicate. Even if the chances of an accidental collision are so remote, the fact that people are researching and reporting on the fact that it can be done means proving this is a duplicate in court will be harder. Best case scenario, jury is shown reports like this, you then have reasonable doubt in the jury that the drive may NOT be an exact copy. Worst case scenario, the defence claims your lab doctored the results, and then they "prove" it with these algorithms and reports.
...a single hash algorithm. Calculate multiple hashes of the same document: MD5, SHA-1, CRC-32, and whatever else you can think of. Then sign all of them at once.
Now the attacker doesn't just have to generate a new document with the same MD5 sum, he has to generate a new document that has the same sum under all those hashes simultaneously. If someone manages to figure that out within the next 5 years, I won't just eat my hat, I'll switch on the Infinite Improbability Drive and turn into a hat.
Visual IRC: Fast. Powerful. Free.
i have always wondered why they wouldn't use someone's browsing history as well as the MAC address as part of the "random" data.there is lots of data there to play with like time,sitename,site I.P address,etc. Of course i've never really been "crypto guy", so there might be a reason,I just don't know about.
ACs don't waste your time replying, your posts are never seen by me.
http://en.wikipedia.org/wiki/Caesar_cipher
Hmmm.... when you say "content + a random string" are you talking about an appended string, like a nonce?
Because this attack is invulnerable to that - MD5 has a lovely property that if A and B collide, then appending the same random string to both A and B results in another pair of collisions. So it is not enough to append a random string to the document before signing.
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
substitute "the last bit of each byte" or "padbyte=rand(DVD-byte)" where rand() is a random-# generator and it's random enough.
The keys to this scheme are:
1) your adversary doesn't know DVDs are the basis of the one-time pad. With one-time pads, security through obscurity IS security.
2) your adversary doesn't know the algorithm to turn the DVD into a pad. It could be something as simple as a bit-offset like I suggested, or as complex as you want it to be, as long as it's easy to impliment. Heck, it can even be 2 DVDs xor'd together.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
I can't believe that nobody, including Bruce Schneier, has pointed out the well understood principle that you *never* blindly sign something provided to you by someone else.
Secure schemes involving one party signing material provided by another party should always involve a signing nonce. This is a random bitstring appended to the material to be signed. In the example from the article, if Caesar had added a nonce before signing the "good" document, then Alice would not have been able to produce a "bad" document with a matching hash.
However, as someone pointed out, this kind of attack is possible without using hash collisions. You simply have to convince your victim to sign something that is code-like (such as the postscript files used in the example, or an HTML document with javascript, or a Word doc with Visual Basic macros, or an executable) without making an expert examination of the complete contents of the code. The code-like "document" can be written to present one set of behavior under one set of circumstances and a completely different behavior under other circumstances. If the signer is not competent to examine the program and be fully aware of what it does, then the signature can have no meaning.
The attack demonstrated in the article is *still* of no practical consequence in my opinion. It doesn't mean that signed code can't be trusted; trusted code is always signed by it's creator, meaning that if the signature is valid *and* you trust the creator, then you can trust the code. It is still impossible for someone who didn't create the signed code to create a different piece of code that has the same hash, so you don't have to worry about an attacker replacing signed trusted code with signed malicious code. As always, you still have to worry about misplaced trust in the code's creator.
What the attack in the article *should* be trying to make clear is that a signature is only as trustworthy as the signer's ability and desire to evaluate what is being signed. A third party auditor that signs code after reviewing it doesn't ensure that the code is trustworthy, it only ensures that the auditor *claims* the code is trustworthy. What the implications are for the actual trustworthiness of the code depends on the trust you place in the intentions and abilities of the auditor.
The random type of UUIDs, which aren't guaranteed unique, but are merely probabilisticly unique -- and which don't use MACs at all -- are meant to be fallbacks if a MAC is unavailable on your system, or if one has privacy concerns and doesn't wish to broadcast a UUID's location of generation.
Liberty in your lifetime
I'd be interested to see how this attack is actually useful in the real world. Even given that they're using a code-base format like postscript, under what strange scenario is an organization going to rely on important documents being digitally signed by people other than the original creator? Any kind of contract or important document is going to be hand signed with at least one witness.
This attack is easily revealed too. A simple examination of the postscript documents shows that it contains two versions--the neferious and non-nefarious versions. Not just that, but the person signing the document will most likely have a copy of the non-nefarious version stored, so if someone tries to pull some fraud, they just whip out the version _they_ received and show that the hashes match up.
The only other use I've seen usggested is for generating two certificates (or any equivalent) with different public keys. But what would be the use in that since it would have to be the same person generating both certificates? Congratulations Mr. Hacker, you can generate two certificates with different keys that both appear to come from Mr. Hacker.
Even without technical knowledge of all this, could you really get away with any of this? If a contract or official "command" document of some kind was forged, people would find out the instant someone tried to take advantage of what they have done that something fishy was going on.
Really, how dense do you have to be to not notice that suddenly a low level employee is going into high access areas despite no one being informed about it beyond a single electronic document? How dense do you have to be to a) not notice a big change in a contract and b) not keep a personal copy of what you signed? Frankly, it would be easier to just swap *printed* forms of contracts than it would be to do this electronci forgery since no one uses electronic contracts like that anyway.
Which part of the above is MD5-specific? It sounds like the above would be possible regardless of the hash algorithm used? Or is it that they've found a method to find X and Y that both hash to the same value quickly enough with MD5. From your description it sounds like X and Y can be very small though, in which case it shouldn't take too long to find X and Y "brute-force" for any hash algorithm?
Tell me how this applies?
-- Ender, Duke_of_URL
Prepending a string might work, though. Just a thought.
In my scenario, over the life of my spying, I have a very limited amount of data to send. Say it's 2KB per week for 10 years, that's about a MB.
There are thousands of commercial DVDs and tens of thousands of commercial CDs available in both my country and his.
For each CD or DVD, assume we use the key=gzip(/dev/dvdrom)+offset, using only bit #n out of each byte, and discard the rest, and wrapping around one time (and one time only) to get the full length of the gzip'd file.
In effect, I have probably 5% of the bytes on any CD or DVD as part of the pad, and the effective pad length is 5% times the combined size of all CDs and DVDs whose names I can commit to memory.
To make matters worse, my adversary doesn't know I'm using DVDs and CDs as a data source.
Now, of course, if my adversary can figure out that I'm using commercial DVDs as the basis of my data, then I've lost a huge amount of security, but not nearly as much as if he somehow intercepted an actual one-time pad - in that case it would be "game over, man."
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Seems to me that A and B have to end at MD5 block boundaries otherwise md5(A+Z) != md5(B+Z).
Yeah, you're probably right - but if you're already engineering A and B so that the sums collide, it shouldn't be too difficult to ensure that they also end at a block boundary.
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
If I was a troll I wouldn't have 5 mod points every week. You have to use your brain when reading what I say. I know the kinds of stuff I could post that would give me +5 insightful, or +5 funny. However, I choose to post my true feelings. When some douche mods it offtopic, it doesn't make me a troll.
Cover your eyes and click this link!
Maybe you're not a troll, but your post was a troll. Did you even read the article before posting it? If you had, I don't think you would say what you said...
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F