Apache Subversion Fails SHA-1 Collision Test, Exploit Moves Into The Wild (arstechnica.com)

← Back to Stories (view on slashdot.org)

Apache Subversion Fails SHA-1 Collision Test, Exploit Moves Into The Wild (arstechnica.com)

Posted by EditorDavid on Sunday February 26, 2017 @08:37AM from the making-a-hash-of-it dept.

WebKit's bug-tracker now includes a comment from Friday noting "the bots all are red" on their git-svn mirror site, reporting an error message about a checksum mismatch for shattered-2.pdf. "In some cases, due to the corruption, further commits are blocked," reports the official "Shattered" web site. Slashdot reader Artem Tashkinov explains its significance: A WebKit developer who tried to upload "bad" PDF files generated from the first successful SHA-1 attack broke WebKit's SVN repository because Subversion uses SHA-1 hash to differentiate commits. The reason to upload the files was to create a test for checking cache poisoning in WebKit.

Another news story is that based on the theoretical incomplete description of the SHA-1 collision attack published by Google just two days ago, people have managed to recreate the attack in practice and now you can download a Python script which can create a new PDF file with the same SHA-1 hashsum using your input PDF. The attack is also implemented as a website which can prepare two PDF files with different JPEG images which will result in the same hash sum.

23 of 167 comments (clear)

Min score:

Reason:

Sort:

Re: In other news by Cesare+Ferrari · 2017-02-26 09:09 · Score: 3, Insightful

Actually, svn is just about the perfect source control system if you want something quick and dirty that you can understand. git (I presume that is what you would propose as an alternative) adds no features to many small software development teams. Fortunately the svn->git migration path is well trodden.
If you had mentioned cvs, or rcs, then i'd agree ;-)
Here's what it means by JoshuaZ · 2017-02-26 09:20 · Score: 4, Informative

Here's what it means: One major aspect of modern cryptography are "hash functions"- a hash function is a function which essentially has the property that in general two inputs with very small differences will give radically different outputs. Also, ideally a hash function will also make it hard to detect "collisions" which are two inputs which have the same output. In general, hash schemes are used for a variety of different purposes, including determining if a file is what it claims to be (by checking that the file has the correct hash value).
Every few years, an existing hash system gets broken and needs to be replaced. MD5 is an example of this; it was very popular and then got replaced.
One of the major currently used hash schemes is SHA-1. However, a few days ago, a group from Google described an attack that allowed them easily find collisions in SHA-1 (easy here is comparative- the amount of computational resources needed was still pretty high). The group released evidence that they could do so but didn't describe how they did so in detail. They gave an example of two files with a SHA-1 collisions and they also described some of the theory behind their attack. What TFS is talking about is how based on this, others have since managed to duplicate the attack and some make some even more efficient variants of it; so effectively this attack is now in the wild.
1. Re:Here's what it means by Lisandro · 2017-02-26 09:38 · Score: 5, Informative
  
  FWIW, you're correct, but "hash function" englobes much more than that. Technically, a CRC is, by definition, a hash function. So is bit parity.
  A cryptographic hash function has the properties you mention, plus the fact that it must not be easily reversible and uniformly distribute results over its entire output space.
2. Re:Here's what it means by complete+loony · 2017-02-26 11:34 · Score: 5, Informative
  
  Google produced two pdf's that differ in some binary data near the beginning of the file. The SHA-1 hash routine processes data one block at a time, updating its internal state. There are two consecutive blocks that differ between the pdf's. The first pair of blocks produce an internal state where half of the bytes are the same. The second pair of blocks then produce an identical state. The remainder of the pdf files is the same.
  So you can use these two pdf prefixes and append whatever data you want to them to produce your own pair of files. Pdf includes a programming language for rendering content. Within this language you can inspect the earlier bytes of the file to detect which version of the file you are rendering, and make some visual changes. So while there are only a few bytes that are different, you can make two pdfs that display different content.
  Nobody has invested the time to produce a new hash collision, but someone has already automated the production of duplicate pdf's based on this work.
  
  --
  09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
3. Re:Here's what it means by complete+loony · 2017-02-26 14:00 · Score: 5, Interesting
  
  This is why git is not vulnerable in this specific instance. In git all objects are prepended with their type, in this case "blob". Of course if you had $100k (-ish) to burn, you could repeat this attack on a file that does start with "blob" to break git.
  However you don't need to do this. This attack depends on reaching an intermediate state with specific properties in order to massively reduce the search space. Any attempt to hash a file that reaches one of these states can be detected and rejected. If you swap to using https://github.com/cr-marcstevens/sha1collisiondetection for all SHA-1 calculations, every instance of this attack can be detected and rejected.
  Also I mis-spoke slightly and spotted my error after checking the paper again. The first pair of blocks have half of the same bytes, but produce an internal state with only 6 bytes of differences. The second pair of blocks, again only differ in half of their bytes, and exactly cancel out those 6 bytes of differences. See Table One on page 3 for the actual byte values.
  
  --
  09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re: In other news by Puff_Of_Hot_Air · 2017-02-26 09:31 · Score: 4, Informative

The problem with git, and I don't see it as a major problem, is not that it's hard to get up and running, but rather that you can quite easily get into the kind of trouble that needs expert knowledge to get out of. If you don't happen to have a git expert handy; well, you are going to have a very bad day. In this git shares the same problem as most very powerful tools, for example C. So I agree with the original poster, if you need the facilities of git, then you'd be an idiot to not use it, but if you don't, then other tools are better for simpler uses cases. Much like using something like python makes a huge amount of sense for certain applications, and zero sense for writing kernels.
Re:FINALLY! by espenskaufel · 2017-02-26 09:32 · Score: 3, Funny

I do not understand why many developers feel so strongly about versions control systems. I wonder if carpenters feel the same way about hammers or if developers are just way to opinionated...
Re:FINALLY! by Lisandro · 2017-02-26 09:42 · Score: 2

Pretty sure they do.
"In the wild" - slight exaggeration by geekpowa · 2017-02-26 09:49 · Score: 2

Someone checked in PDFs that demonstrate the first engineered SHA-1 collision and this broke SVN. PDFs in question took 6500+ cpu years + 110 GPU years to generate. "In the wild" is a bit panicky & excessive.
What does this actually means in terms of integrity of repos and other things that rely on SHA-1? Does it merely break repos or does it facilitate injection attack vectors - how important is secure hashing in the guts of repos? What precisely is being secured? SHA-1 has been deprecated for SSL certs already so you shouldn't be using certs with SHA1 sigs anymore. Myself, keep an eye on how this develops and start thinking about using SHA-2 but won't be replaing git or existing usage of SHA1 for password hashing anytime soon.
1. Re:"In the wild" - slight exaggeration by serviscope_minor · 2017-02-26 10:15 · Score: 2
  
  "In the wild" is a bit panicky & excessive.
  No, it's really not. This demonstrates that SHA-1 is not only weak, but broken. One golden rule about security is that it never improves over time. It means that collisions are now possible, and are within reach of moderate sized organisations. Google can clearly manage, governments certainly can and any criminal organisation with a large enough botnet can manage too. This isn't just finding random data either: it's a practical attack whereby two valid PDFs both hash to the same value.
  The security will get worse over time, just like it did for MD-5. With MD-5 it took less than 3 years for someone to go from creating two valid documents with the same hash (poth PDF and PS support arbitrary data embedded for various purposes which makes them relatively easy targets) to a completely broken cryptographic certificate which broke the chain of trust entirely. Not only did it happen, but it took a scant 11 hours on a 30 node cluster, meaning practical, attacks were in range of a single, not well funded individual, only 3 years after the first collision was found. With SHA-0, it took about a year and a half to go from the first collision to fast collisions.
  It's hard enough migrating things and old systems tend to hang around for years or even decades, so you should be planning your migration right now.
  That is not to say that SHA-1 is unsuitable for content identification with non malicious inputs, it's fine for that, but so is MD-5.
  
  --
  SJW n. One who posts facts.
2. Re:"In the wild" - slight exaggeration by swillden · 2017-02-26 10:49 · Score: 2
  
  Umm, that is an uncited claim in the summary. Nothing of the sort is stated in any of the links. The summary links to a paper that provides more details of the attack. Very heavy and technical though a few inital takeaways from it is that implementations only take a few days to run on gear they have so does seem safe to assume that SHA-1 collisions are pretty much pwned.
  The Python script in question doesn't find new SHA-1 collisions. It takes two input PDFs and produces two output PDFs that hash to the same value. It uses some quirks of how PDFs work, plus that original SHAttered collision generated by the Google researchers. Finding another collision is a lot of work. Using a known collision to generate PDFs with the same hash value is not.
  https://github.com/nneonneo/sha1collider
  
  --
  Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
3. Re:"In the wild" - slight exaggeration by guruevi · 2017-02-26 10:49 · Score: 2
  
  If someone checked in, that means they have permissions to do so. It's not like Git just blindly accepts commits with the same hash but different contents. We know it's possible, it's even possible with SHA256 to create a collision, as long as you're making a hash, you can create a collision as you're mapping an infinite set of bits onto a finite set of bits, there will always be a second set of bits that creates a collision as the number of sets approaches infinity regardless of the hash function you use.
  The fact that it's "easier" for a certain definition of "easy" doesn't mean the thing is broken, it just means people should be more careful when accepting particular hashes (eg. if you're using a cloned repo of whatever software you want to use) but even then, a bit-by-bit comparison can easily weed them out.
  As far as mainstream repo's a) you would notice someone suddenly inserting a very oddly shaped document into your repo's b) that person would require permission to do so and c) you should never automate a repo to pull in and compile something into production. Not sure if that's what happened here, the summary is very unclear as to what actually happened besides someone intentionally pushing a broken thing and it broke other things.
  
  --
  Custom electronics and digital signage for your business: www.evcircuits.com
4. Re:"In the wild" - slight exaggeration by guruevi · 2017-02-26 11:01 · Score: 2
  
  Say it with me: Hashing is not Encryption. Hashing is not Encryption. Hashing is not Encryption.
  Very high level:
  Hashing is the irreversible mapping of a set of bits onto a (usually smaller) set of bits in order to obfuscate the original set of bits (one-way)
  Encryption is the mapping of a set of bits onto another equally sized set of bits where the mapping is reversible through some process (two-way)
  Hashing can be done with salts so that using rainbow tables is harder or impossible, but there will always be another set of bits that maps into the same set of bits. It's good enough for hiding a password or for reducing the complexity of finding matches. If you were writing a file system you could use it to do things like de-duplication but when you have a collision, you should ideally still do a bit-by-bit check when a collision occurs.
  Calculating a collision with actual useful content - if I want to insert a "return 0" on a particular line somewhere in the Linux kernel) is still as hard if not impossible to do as before without also inserting a load of weird, binary comments. We just know that these collisions can now be calculated faster, but it's not like adding an arbitrary string will break the calculation and produce a predefined hash.
  
  --
  Custom electronics and digital signage for your business: www.evcircuits.com
5. Re:"In the wild" - slight exaggeration by 0ptix · 2017-02-26 11:40 · Score: 2
  
  To be fair, any pair of distinct inputs to SHA1 that hash to the same value are a new collision. In general, being given one collision for a hash function doesnt make it automatically easy to find another. Its only because SHA1 is an iterated hash function (merkle-damgard) that this becomes true. (admittedly, almost all practical cryptographic hash functions are iterated constructions.)
  If SHA1(x0) = SHA1(x1) then for any z SHA1(x0Â¦Â¦z) = SHA1(x1Â¦Â¦z). I'm guessing the collision generated by the Google-CWI team is on a pair x0 and x1 where xb is the beginning of a pdf document that basically encodes "of the next two sections in this pdf file display section b". Given that its easy to extend them to any colliding pdf documents one wants.
Re: In other news by brantondaveperson · 2017-02-26 10:23 · Score: 4, Informative

I love arguing about git.
SVN has several huge advantages over git. It's far simpler. It doesn't have a thing called 'rebase', which rewrites your commits and occasionally messes them up. Its revision numbers are actually in order, which means you always know which revision came first, given two of them, something that's impossible with git's hashes (YES - I know why the hashes are used... but the reality of 99% of software development is that the repository is centralised, so git's solving an almost non-existent problem here). SVN supports real cherry-picking, and actually records in the repo that you took code from somewhere, as opposed to git's cut-n-paste approach.
SVN has branches, git has pointers into a tree. Thus in git, it is impossible after the fact to determine to which branch a change was committed, just in which branches it now currently resides. Branches don't really exist in git at all, they aren't versioned (who created a branch, and when?), and if you accidentally delete them you tend to lose the commits against them. Tags in git are even worse. Added to which is the fact that both are implemented in the filesystem as regular files, which means you're at the mercy of your filesystem's ideas regarding case and permitted characters, and good luck if someone tries to check it out into a filesystem with different ideas. Nice design decision there Linus, guess you were having an off day? And the line-endings stuff?... Oh. My. God.
Re: In other news by multi+io · 2017-02-26 11:56 · Score: 2

SVN has several huge advantages over git. It's far simpler.
Explain in one or two sentences what a "tree conflict" is and how to resolve it.
Re:Reading the paper. What is in an exponent?? by cryptizard · 2017-02-26 12:06 · Score: 3, Informative

It is a bitwise rotation. The direction and number specify if it is a right or left rotation and then how many bits to rotate.
Re: In other news by Anonymous Coward · 2017-02-26 12:09 · Score: 2, Interesting

I take it you've never seen a team that for some reason merged current files on a pull to effectively revert commits pushed to the remote with their few intended changes, over and over and over, across several people, for days, before noticing that there was a problem. You will bang your head on a desk shouting "why didn't they notice there was a problem..."
I can't believe though that question was upvoted 13,182 times. According to this SA "most upvoted" query, it's the currently the second most upvoted question they have. There are a *lot* of questions on SA. Sorry, but there is something wrong when something that should be so obvious is a problem that is so commonly faced by so many people when trying to use a "common" and "most fit for the purpose" tool.
Unbelievably, the third most upvoted question is also related to git. Five of the results in that query are related to git. That's 25% of the top 20 upvoted questions on SA. No questions in the top 20 are related to subversion. I believe that is valid circumstantial evidence that git is difficult to use and understand, especially the concepts related to it. That doesn't mean it's bad, but if you're working with a team that doesn't have much SCM experience (thankfully that's becoming less of an issue quickly), and you don't have time to commit to everyone on the team being able to pick up the nuances of git for a project, SVN is a great way to go as it's extremely simple, mostly due to significantly less feature coverage. Though, $deity help you if you aren't sitting next door to the server.
GIT by DrYak · 2017-02-26 12:41 · Score: 3, Interesting

Does the Git usage of SHA-1 *really* cause silent problems? I'm not sure how Git works internally but I was under the impression that it hashes whole objects, like individual source files at least.
The individual objects inside git aren't file.
The individual objects are commits (i.e..: the content of a patchfile, and a few information like pointer to other past commits to which this patch applies).
To make things easier, a handy number designates this commit - this is currently generated by SHA-1.
(Git is a content-addressable platform. You don't access object by name, you access them depending on their content. But instead of using the whole content to access them, you use addresses generated by SHA-1 to access the various blocks.
So to say which are the parent commits to which the patch in a commit applies, you just mention them by using the SHA-1 sum of the content of these commits).
A theoretical attack would be:
- try to generate 2 commits.
one adds a clean piece of code. the other adds a backdoored piece of code.
but both commits hash to the same SHA-1 so they would be considered as "the same content" by git.
Then try to force your target to re-download the whole repo from scratch from your backdoored history (otherwise git will simply ignore the commits with sha-1 sum that it already has - it thinks that it has the same content already).
In practice it's currently not doable.
The only thing that google managed to generate is a pair of block series. Each series contain completely random junk. Both series end-up generating the exact same shasum even if the random junk is different.
- That is exploitable in a PDF (or any other binary format that supports scripting. You could even do it in an EXE) : using the embed scripting present 2 different contents depending on which random junk is present.
- That is not exploitable in a sourcecode commit : you would need a believable explanation for why the random junk is present in the patched source code.
AND you would need a piece of code which reacts differently (normal vs. backdoor) depending on which random junk is present - to be able to pull that unnoticed would require "Underhanded C Contest"-level of ingenuity.
That's it, you only have blocks of random garbage.
Google currently can't produce hashes colliding from arbitrary pieces of data ("Hey google: here's is legit script A, and that's malicious script B. Add a small nonce at the end so they both end-up having the same sha-1sum") ("Actually don't add a nonce, that would be too conspicuous, try to tweak the punctuation in the comments instead")
Also as you mention, further edits will be problematic :
if I edit script A and submit a patch, this patch will be valid, but will completely fail on top of script B.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Hash Functions 101 by FeelGood314 · 2017-02-26 12:45 · Score: 4, Informative

A hash function takes an arbitrary string of bits and outputs a string of bits of a fixed length.
A CRC is an example of a hash function and a long CRC would probably be good enough for GIT or most repositories.
First Pre-image resistance - this is a test of the one wayness of the function. Given a hash value it is difficult to find a pre-image that hashes to that value. Given y a string of bits of length hash output length finding X such that h(X) = y is hard.MD-5 and SHA-1 are still resilient against first pre-image attacks
Second Pre-image resistance - given a message X finding a Y such that h(X)=h(Y) is difficult. MD-5 and SHA-1 are still resilient against second pre-image attacks
Collision resistant - It is hard to find two messages X and Y such that h(X) = h(Y). Note the attacker here is free to choose both X and Y. Both MD-5 and SHA-1 are no-longer collision resistant.

So far however the two messages X and Y have to be nearly identical. They have to start and end the same way and the blocks that are changed actually have to be changed and tested together to make sure the hash function internal state changes only in a specific way. I can't create a document that says the rent will be $3000 per month and another that says it will be $30000. (I might create one that says it is $3149.21 and the other $53210.63 per month, like in the PDF example they played with a colour field). Also because of the way the internal state of the hash function changes we now have a way of detecting if someone is feeding a "funny" stream of bits into our hash function and detect this attack with a very low probability of a false positive.
1. Re:Hash Functions 101 by FeelGood314 · 2017-02-26 14:14 · Score: 2
  
  This still can be weaponized. Even if I only have two bit streams that start the same and then only differ in a block that I couldn't control I can still create malicious executables. Once I have the two streams that collide as long as the bits I add to both streams are identical the hashes will remain identical. I then have code after the differing block(s) that checks a value of a field in the differing blocks and behaves differently based on this value. I now have a good executable that is well behaved that I can submit to be signed by Microsoft or some other trusted company and a bad piece of software that has the same hash value. I take the valid signature from the good software and append it to the bad software and the signature remains valid.
Re: In other news by complete+loony · 2017-02-26 14:15 · Score: 3, Funny

You want a revision number? Simple;
$ git rev-list HEAD | wc -l
Assuming everyone is on the same branch of course....
(Obligatory XKCD)

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re: In other news by Anonymous Coward · 2017-02-26 19:44 · Score: 2, Informative

> I'd give alot to go back to revision numbers...
Why? All that matters is the logical ordering of commits, not the chronological ordering. I DGAF about when someone wrote a bit of code. All that I want is for each commit to more or less work, and for the history to be easily bisectable to aid in bug hunting.
> Git gives a conflict in the first case - naturally - and may give a conflict in the second if its magic content-tracking algorithm fails...
As an SVN -> Git convert, I've had git just Do The Right Thing(TM) when committing changes to a moved file more times than I can count. I lost _weeks_ of my life to doing that shit manually with SVN. I get that content tracking doesn't work 100% of the time, but it works infinity% of the time more often than it does in SVN.
> Branches don't really exist in git at all...
what?
> they aren't versioned (who created a branch, and when?),
wot?!
If you look at this https://github.com/inaka/shotgun/compare/dave.151.update.deps.to.2.0.0.pre It's obvious that euenlopez@gmail.com created that branch on June 02, 2016. This one-liner makes it doubly clear that this isn't github trickery:
git log `git log master..dave.151.update.deps.to.2.0.0.pre --oneline | tail -n 1 | awk '{ print $1 }'` | head -n 3
(The git log invocation inside the backticks gives you the commits that dave.151... has that master doesn't, putting the "oldest" at the bottom. The tail/awk pipe scrapes out the commit ID of the "oldest' commit. That commit ID is fed to the outer git log, and the first three lines of that log are scraped off by head.)
> ...and if you accidentally delete them you tend to lose the commits against them.
a) If you delete a branch, you don't want the commits in it.
b) But if you _did_ want those commits, git reflog saves you from your mistakes
c) man git-reflog , for $DEITY's sake
Don't get lost in the terminology of the man page. Just make a git repo, perform some changes to it (some small, some large) and play with git reflog.
Every change you make to a local repo is stored locally, for 90 days -by default-. git reflog lets you sort through that and restore the pointers that you carelessly blew away.