Apache Subversion Fails SHA-1 Collision Test, Exploit Moves Into The Wild (arstechnica.com)
WebKit's bug-tracker now includes a comment from Friday noting "the bots all are red" on their git-svn mirror site, reporting an error message about a checksum mismatch for shattered-2.pdf. "In some cases, due to the corruption, further commits are blocked," reports the official "Shattered" web site. Slashdot reader Artem Tashkinov explains its significance:
A WebKit developer who tried to upload "bad" PDF files generated from the first successful SHA-1 attack broke WebKit's SVN repository because Subversion uses SHA-1 hash to differentiate commits. The reason to upload the files was to create a test for checking cache poisoning in WebKit.
Another news story is that based on the theoretical incomplete description of the SHA-1 collision attack published by Google just two days ago, people have managed to recreate the attack in practice and now you can download a Python script which can create a new PDF file with the same SHA-1 hashsum using your input PDF. The attack is also implemented as a website which can prepare two PDF files with different JPEG images which will result in the same hash sum.
Another news story is that based on the theoretical incomplete description of the SHA-1 collision attack published by Google just two days ago, people have managed to recreate the attack in practice and now you can download a Python script which can create a new PDF file with the same SHA-1 hashsum using your input PDF. The attack is also implemented as a website which can prepare two PDF files with different JPEG images which will result in the same hash sum.
Webkit is apparently on SVN repository.
If you don't like it, don't use it. Personally, I love it.
Here's what it means: One major aspect of modern cryptography are "hash functions"- a hash function is a function which essentially has the property that in general two inputs with very small differences will give radically different outputs. Also, ideally a hash function will also make it hard to detect "collisions" which are two inputs which have the same output. In general, hash schemes are used for a variety of different purposes, including determining if a file is what it claims to be (by checking that the file has the correct hash value).
Every few years, an existing hash system gets broken and needs to be replaced. MD5 is an example of this; it was very popular and then got replaced.
One of the major currently used hash schemes is SHA-1. However, a few days ago, a group from Google described an attack that allowed them easily find collisions in SHA-1 (easy here is comparative- the amount of computational resources needed was still pretty high). The group released evidence that they could do so but didn't describe how they did so in detail. They gave an example of two files with a SHA-1 collisions and they also described some of the theory behind their attack. What TFS is talking about is how based on this, others have since managed to duplicate the attack and some make some even more efficient variants of it; so effectively this attack is now in the wild.
I do not understand why many developers feel so strongly about versions control systems. I wonder if carpenters feel the same way about hammers or if developers are just way to opinionated...
Not really: http://marc.info/?l=git&m=1156... .
Git hashes objects (commit, trees, blobs, tags) instead of individual tags. If you managed to somehow create, say, a commit with the same SHA1 as another existing in a repository pushes to it would be simply ignored.
...instead of individual files...
Pretty sure they do.
Git and SVN work very, very differently under the hood. The fact that they rely on the same hash algorithm is irrelevant as they use it in very different ways.
Someone checked in PDFs that demonstrate the first engineered SHA-1 collision and this broke SVN. PDFs in question took 6500+ cpu years + 110 GPU years to generate. "In the wild" is a bit panicky & excessive.
What does this actually means in terms of integrity of repos and other things that rely on SHA-1? Does it merely break repos or does it facilitate injection attack vectors - how important is secure hashing in the guts of repos? What precisely is being secured? SHA-1 has been deprecated for SSL certs already so you shouldn't be using certs with SHA1 sigs anymore. Myself, keep an eye on how this develops and start thinking about using SHA-2 but won't be replaing git or existing usage of SHA1 for password hashing anytime soon.
When did "Doing exactly what you want in an intuitive way is a basic function of any software.”? I thought that was the holy grail of software. I have still not used one source control system that I found hard to use and in my experience git-repos get messed-up more often than others (might be because they are the most common). Some devs seems to have problem understanding remotes and rebase.
You are probably right :)
Because it was good once. Better than MD5. Changing it can break a lot of compatibility. So they don't change it.
If you are keeping software running over a long time. You need to balance compatibility, Security and maintainable design. Otherwise such projects will take decades to develop and be out of date on release.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
I wonder if carpenters feel the same way about hammers
Hahahah tip of the iceberg. I saw two carpenters on my house arguing about who had better screwdrivers. Yes people most definitely do.
I am trying to read their paper on the sha1 collisions over here: https://shattered.io/static/sh... and there's some unusual equation stuff.
mi = (mi3 mi8 mi14 mi16)1
Can anyone explain that to me in english?
Ah dam. My unicode got munged by the slashdot anti garbage filter. Should have hit preview first!
Anyway the symbol I was referencing is a circular arrow pointing in a clockwise direction that looks like the images on this page: https://en.wikipedia.org/wiki/... . I've never seen that in a paper. What does it mean when it's in an exponent?
It is a bitwise rotation. The direction and number specify if it is a right or left rotation and then how many bits to rotate.
A cryptographic hash function has the properties you mention, plus the fact that it must not be easily reversible and uniformly distribute results over its entire output space.
The later is a property which is not guaranteed by most common checksums.
Thus, when you need a hash function to give a number to use as a handy "nickname" for a collection of data (e.g.: for a hash look-up table. Or for a content-addressable like git to create said addresses for a given content - and thus to give a serial number to a commit. Or apparently also used in SVN to give a simple number to designate commits), it might be a good choice to pick-up a cryptographic hash like SHA-1 because it guarantees you this additional property, which a vanilla checksum could lack.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Does the Git usage of SHA-1 *really* cause silent problems? I'm not sure how Git works internally but I was under the impression that it hashes whole objects, like individual source files at least.
The individual objects inside git aren't file.
The individual objects are commits (i.e..: the content of a patchfile, and a few information like pointer to other past commits to which this patch applies).
To make things easier, a handy number designates this commit - this is currently generated by SHA-1.
(Git is a content-addressable platform. You don't access object by name, you access them depending on their content. But instead of using the whole content to access them, you use addresses generated by SHA-1 to access the various blocks.
So to say which are the parent commits to which the patch in a commit applies, you just mention them by using the SHA-1 sum of the content of these commits).
A theoretical attack would be:
- try to generate 2 commits.
one adds a clean piece of code. the other adds a backdoored piece of code.
but both commits hash to the same SHA-1 so they would be considered as "the same content" by git.
Then try to force your target to re-download the whole repo from scratch from your backdoored history (otherwise git will simply ignore the commits with sha-1 sum that it already has - it thinks that it has the same content already).
In practice it's currently not doable.
The only thing that google managed to generate is a pair of block series. Each series contain completely random junk. Both series end-up generating the exact same shasum even if the random junk is different.
- That is exploitable in a PDF (or any other binary format that supports scripting. You could even do it in an EXE) : using the embed scripting present 2 different contents depending on which random junk is present.
- That is not exploitable in a sourcecode commit : you would need a believable explanation for why the random junk is present in the patched source code.
AND you would need a piece of code which reacts differently (normal vs. backdoor) depending on which random junk is present - to be able to pull that unnoticed would require "Underhanded C Contest"-level of ingenuity.
That's it, you only have blocks of random garbage.
Google currently can't produce hashes colliding from arbitrary pieces of data ("Hey google: here's is legit script A, and that's malicious script B. Add a small nonce at the end so they both end-up having the same sha-1sum") ("Actually don't add a nonce, that would be too conspicuous, try to tweak the punctuation in the comments instead")
Also as you mention, further edits will be problematic :
if I edit script A and submit a patch, this patch will be valid, but will completely fail on top of script B.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
A hash function takes an arbitrary string of bits and outputs a string of bits of a fixed length.
A CRC is an example of a hash function and a long CRC would probably be good enough for GIT or most repositories.
First Pre-image resistance - this is a test of the one wayness of the function. Given a hash value it is difficult to find a pre-image that hashes to that value. Given y a string of bits of length hash output length finding X such that h(X) = y is hard.MD-5 and SHA-1 are still resilient against first pre-image attacks
Second Pre-image resistance - given a message X finding a Y such that h(X)=h(Y) is difficult. MD-5 and SHA-1 are still resilient against second pre-image attacks
Collision resistant - It is hard to find two messages X and Y such that h(X) = h(Y). Note the attacker here is free to choose both X and Y. Both MD-5 and SHA-1 are no-longer collision resistant.
So far however the two messages X and Y have to be nearly identical. They have to start and end the same way and the blocks that are changed actually have to be changed and tested together to make sure the hash function internal state changes only in a specific way. I can't create a document that says the rent will be $3000 per month and another that says it will be $30000. (I might create one that says it is $3149.21 and the other $53210.63 per month, like in the PDF example they played with a colour field). Also because of the way the internal state of the hash function changes we now have a way of detecting if someone is feeding a "funny" stream of bits into our hash function and detect this attack with a very low probability of a false positive.
I wonder if carpenters feel the same way about hammers or if developers are just way to opinionated...
Yeah, typical carpenter hammer arguments:
*) Hammer weight (usually 16-24oz for house framing)
*) Handle type (wood? Fiberglass? (fiberglass hammers suck tbh))
*) Is the face of the hammer smooth or textured?
"First they came for the slanderers and i said nothing."
Considering that several years ago everyone was told to move away from SHA1 as it wasn't considered secure given the at the time theoretical attacks this shouldn't come as a surprise. NIST has been very open about the process as of late with the AES process and more recently the SHA3 process. Even though no known issues exist with the SHA2 suite of hashes they were proactive in going forward with the SHA3 process because SHA2 is mathematically similar to SHA1 so it may be possible to have related attacks against the various SHA2 hashes. I would question anything that is just dropped wholesale from the government like the whole botched EC crypto, and then there was thae long standing questions about the DES S-Boxes that while it turned out were strengthened against differential attacks but no explanation was given as to why at the time. Even now the full set of parameters used for them haven't been provided.
Time to offend someone
It seems obvious to me that a small string sequence could be identical from two differents long original texts. Even it happend, the hash function is NOT the original message, and a collision could happen. It does'nt mean that the two original texts are the same.
Am i right ?
Yes. A hash is nothing more than a function mapping data of arbitrary size to an output of fixed, smaller size so by definition you can always construct two inputs which yield the same hash. What makes crypto hashes secure is that this is normally very, very hard to do - that is, given a hash generate an input from it.