Apache Subversion Fails SHA-1 Collision Test, Exploit Moves Into The Wild (arstechnica.com)
WebKit's bug-tracker now includes a comment from Friday noting "the bots all are red" on their git-svn mirror site, reporting an error message about a checksum mismatch for shattered-2.pdf. "In some cases, due to the corruption, further commits are blocked," reports the official "Shattered" web site. Slashdot reader Artem Tashkinov explains its significance:
A WebKit developer who tried to upload "bad" PDF files generated from the first successful SHA-1 attack broke WebKit's SVN repository because Subversion uses SHA-1 hash to differentiate commits. The reason to upload the files was to create a test for checking cache poisoning in WebKit.
Another news story is that based on the theoretical incomplete description of the SHA-1 collision attack published by Google just two days ago, people have managed to recreate the attack in practice and now you can download a Python script which can create a new PDF file with the same SHA-1 hashsum using your input PDF. The attack is also implemented as a website which can prepare two PDF files with different JPEG images which will result in the same hash sum.
Another news story is that based on the theoretical incomplete description of the SHA-1 collision attack published by Google just two days ago, people have managed to recreate the attack in practice and now you can download a Python script which can create a new PDF file with the same SHA-1 hashsum using your input PDF. The attack is also implemented as a website which can prepare two PDF files with different JPEG images which will result in the same hash sum.
I take it you've never seen a team that for some reason merged current files on a pull to effectively revert commits pushed to the remote with their few intended changes, over and over and over, across several people, for days, before noticing that there was a problem. You will bang your head on a desk shouting "why didn't they notice there was a problem..."
I can't believe though that question was upvoted 13,182 times. According to this SA "most upvoted" query, it's the currently the second most upvoted question they have. There are a *lot* of questions on SA. Sorry, but there is something wrong when something that should be so obvious is a problem that is so commonly faced by so many people when trying to use a "common" and "most fit for the purpose" tool.
Unbelievably, the third most upvoted question is also related to git. Five of the results in that query are related to git. That's 25% of the top 20 upvoted questions on SA. No questions in the top 20 are related to subversion. I believe that is valid circumstantial evidence that git is difficult to use and understand, especially the concepts related to it. That doesn't mean it's bad, but if you're working with a team that doesn't have much SCM experience (thankfully that's becoming less of an issue quickly), and you don't have time to commit to everyone on the team being able to pick up the nuances of git for a project, SVN is a great way to go as it's extremely simple, mostly due to significantly less feature coverage. Though, $deity help you if you aren't sitting next door to the server.
Does the Git usage of SHA-1 *really* cause silent problems? I'm not sure how Git works internally but I was under the impression that it hashes whole objects, like individual source files at least.
The individual objects inside git aren't file.
The individual objects are commits (i.e..: the content of a patchfile, and a few information like pointer to other past commits to which this patch applies).
To make things easier, a handy number designates this commit - this is currently generated by SHA-1.
(Git is a content-addressable platform. You don't access object by name, you access them depending on their content. But instead of using the whole content to access them, you use addresses generated by SHA-1 to access the various blocks.
So to say which are the parent commits to which the patch in a commit applies, you just mention them by using the SHA-1 sum of the content of these commits).
A theoretical attack would be:
- try to generate 2 commits.
one adds a clean piece of code. the other adds a backdoored piece of code.
but both commits hash to the same SHA-1 so they would be considered as "the same content" by git.
Then try to force your target to re-download the whole repo from scratch from your backdoored history (otherwise git will simply ignore the commits with sha-1 sum that it already has - it thinks that it has the same content already).
In practice it's currently not doable.
The only thing that google managed to generate is a pair of block series. Each series contain completely random junk. Both series end-up generating the exact same shasum even if the random junk is different.
- That is exploitable in a PDF (or any other binary format that supports scripting. You could even do it in an EXE) : using the embed scripting present 2 different contents depending on which random junk is present.
- That is not exploitable in a sourcecode commit : you would need a believable explanation for why the random junk is present in the patched source code.
AND you would need a piece of code which reacts differently (normal vs. backdoor) depending on which random junk is present - to be able to pull that unnoticed would require "Underhanded C Contest"-level of ingenuity.
That's it, you only have blocks of random garbage.
Google currently can't produce hashes colliding from arbitrary pieces of data ("Hey google: here's is legit script A, and that's malicious script B. Add a small nonce at the end so they both end-up having the same sha-1sum") ("Actually don't add a nonce, that would be too conspicuous, try to tweak the punctuation in the comments instead")
Also as you mention, further edits will be problematic :
if I edit script A and submit a patch, this patch will be valid, but will completely fail on top of script B.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
This is why git is not vulnerable in this specific instance. In git all objects are prepended with their type, in this case "blob". Of course if you had $100k (-ish) to burn, you could repeat this attack on a file that does start with "blob" to break git.
However you don't need to do this. This attack depends on reaching an intermediate state with specific properties in order to massively reduce the search space. Any attempt to hash a file that reaches one of these states can be detected and rejected. If you swap to using https://github.com/cr-marcstevens/sha1collisiondetection for all SHA-1 calculations, every instance of this attack can be detected and rejected.
Also I mis-spoke slightly and spotted my error after checking the paper again. The first pair of blocks have half of the same bytes, but produce an internal state with only 6 bytes of differences. The second pair of blocks, again only differ in half of their bytes, and exactly cancel out those 6 bytes of differences. See Table One on page 3 for the actual byte values.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.