Slashdot Mirror


Google Talks About the Dangers of User Content

An anonymous reader writes "Here's an interesting article on the Google security blog about the dangers faced by modern web applications when hosting any user supplied data. The surprising conclusion is that it's apparently almost impossible to host images or text files safely unless you use a completely separate domain. Is it really that bad? "

23 of 172 comments (clear)

  1. I don't know if the question should be... by Tastecicles · · Score: 2

    ...is it a server problem, with the way it interprets record data, or the browser (any browser) (maybe as instructions rather than markup)? I'm guessing server in this case, since if the stream is intercepted and there's a referrer URL that directly references an image or other blob on the same or another server on a subdomain, that could be used to pwn the account/whatever... I'm not up on that sort of hack (you can probably tell). I don't quite get how hosting blobs on an entirely different domain would mitigate against that hack, since you would require some sort of URI that the other domain would recognise to be able to serve up the correct file - which would be in the URL request! Someone want to try and make sense of what I'm trying to say here?

    --
    Operation Guillotine is in effect.
    1. Re:I don't know if the question should be... by Sarusa · · Score: 5, Informative

      It's fundamentally a problem with the browsers. Without getting too technical...

      Problem 1: Browsers try real hard to be clever and interpret maltagged/malformed content so people with defective markup or bad mime content headers won't say 'My page doesn't work in Browser X, Browser X is defective!'. Or if the site is just serving up user text in html, stick some javascript tags in the text. Whichever way, you end up so someone malicious can upload some 'text' to a clipboard or document site which the browser then executes when the malicious person shares the URL.

      Problem 2: There are a lot of checks in most browsers against 'cross site scripting', which is a page on site foobar.com (for instance) making data load requests to derp.com, or looking at derp.com's cookies, or even leaving a foobar.com cookie when derp.com is the main page. But if your script is running 'from' derp.com (as above) then permissions for derp.com are almost wide open, because it would just be too annoying for most users to manage permissions on the same site. Now they can grab all your docs, submit requests to email info, whatever is allowed. This is why just changing to another domain name helps.

      There's more nitpicky stuff in the second half of TFA, but I think that's the gist of it.

    2. Re:I don't know if the question should be... by TubeSteak · · Score: 5, Insightful

      It's fundamentally a problem with not validating inputs. Without getting too technical...

      Problem 1: Browsers try real hard to be clever and interpret maltagged/malformed content instead of validating inputs.

      Problem 2: There are a lot of checks in most browsers against 'cross site scripting', which is fundamentally a problem of not validating inputs.

      /don't forget to validate your outputs either.

      --
      [Fuck Beta]
      o0t!
    3. Re:I don't know if the question should be... by Sarusa · · Score: 3, Insightful

      This is true! You could even say it's a sooper-dooper-fundamental problem of HTTP/HTML not sufficiently separating the control channel from the data channel and/or not sufficiently encapsulating things (active code anywhere? noooo.)

      But since browsers have actively chosen to validate invalid inputs and nobody's going to bother securing HTTP/HTML against this kind of thing any time soon, or fix the problems with cookies, or, etc etc etc, I figured that was a good enough high level summary of where we're at realistically. Nobody's willing to fix the foundations or 'break' when looking at malformed pages.

    4. Re:I don't know if the question should be... by 19thNervousBreakdown · · Score: 5, Interesting

      I'm actually not a big fan of validating inputs. I find proper escaping is a much more effective tool, and validation typically leads to both arbitrary restrictions of what your fields can hold and a false sense of security. It's why you can't put a + sign in e-mail fields, or have an apostrophe in your description field.

      In short, if a data type can hold something, it should be able to read every possible value of that data type, and output every possible value of that data type. That means that if you have a Unicode string field, you should accept all valid Unicode characters, and be able to output the same. If you want to restrict it, don't use a string. Create a new data type. This makes escaping easy as well. You don't have a method that can output strings, at all. You have a method that can output HTMLString, and it escapes everything it outputs. If you want to output raw HTML, you have RawHTMLString. Makes it much harder to make a mistake when you're doing Response.Write(new RawHTMLString(userField)).

      A multi-pronged approach is best, and input validation certainly has its place (ensuring that the user-supplied data conforms to the data type's domain, not trying to protect your output), but the first and primary line of defense should be making it harder to do it wrong than it is to do it right.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
    5. Re:I don't know if the question should be... by dzfoo · · Score: 3, Interesting

      I'm actually not a big fan of validating inputs. I find proper escaping is a much more effective tool, and validation typically leads to both arbitrary restrictions of what your fields can hold and a false sense of security.

      OK, fair point. How about if we expand the concept of "validating input" to include canonicalization and sanitation as well? Oh, it already does. Go figure.

      Reducing it to a mere reg-exp is missing the point. Proper canonicalization (and proper understanding of the underlying standards and protocols, but that's another argument) would allow you to use a plus-sign in an e-mail address field.

      But this won't happen as long as every kid fresh out of college wants to roll their own because they known The One True Way to fix it, this time For Real. As long as they keep ignoring everything learned before because, you know, it's old stuff and this is the new technology of The Web, where everything old does not count at all; nothing will change.

      A multi-pronged approach is best, and input validation certainly has its place (ensuring that the user-supplied data conforms to the data type's domain, not trying to protect your output), but the first and primary line of defense should be making it harder to do it wrong than it is to do it right.

      "MOAR TECH!!!1" and over-wrought protocols are no silver-bullet against ignorance, naivety, and hubris.

                  -dZ.

      --
      Carol vs. Ghost
      ...Can you save Christmas?
    6. Re:I don't know if the question should be... by 19thNervousBreakdown · · Score: 2

      Your solution appears to be, "Do exactly what we've been doing, just more." My rebuttal to that is the entire history of computer security. While it's true that proper understanding of underlying standards and protocols would go a long way toward mitigating the problems, a more complete solution is to make such detail-oriented understanding unnecessary. Compartmentalization of knowledge is, in my opinion anyway, the primary benefit of computers, and the rejection of providing that benefit to other programmers or utilizing it yourself while writing software smacks of programmers who don't want others invading their turf.

      I'll grant you, new does not necessarily mean better. Some new approaches work better, some work worse, but we already know exactly what the old approach accomplishes.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
    7. Re:I don't know if the question should be... by ais523 · · Score: 5, Informative

      After seeing a demonstration of a successful XSS attack on a plaintext file (IE7 was the offending browser, incidentally), I find it hard to see what sort of validation could possibly help. After all, the offending code was a perfectly valid ASCII plain text file that didn't even look particularly like HTML, but happened to contain a few HTML tags. (Incidentally, for this reason, Wikipedia refuses to serve user-entered content as text/plain; it uses text/css instead, because it happens to render the same on all major browsers and doesn't have bizarre security issues with IE.)

      --
      (1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
    8. Re:I don't know if the question should be... by dzfoo · · Score: 2

      You misunderstood my point, and then went on to suggest that the "old way" won't work; inadvertently falling into the trap I was pointing out.

      My "solution" (which really, it wasn't a solution per se) is not "more of the same." It is the realization that previous knowledge or practices may not be obsolete, and that we shouldn't try to find new ways to do things for the mere sake of being new.

      A lot, though not all, of the security problems encountered in modern applications have been known and addressed in the past, to various degrees of success. We should embrace this experience and apply it, not shunt it as antiquated.

      Whether you want to admit it or not, lack of input validation and understanding of data encoding at the various transport layers, is the source of most security issues. We should acknowledge this and address it directly.

      You are right, a lot can be done to build solutions into our tools to ease their implementation. However, technology itself won't solve the problem of developers not understanding the risks or why they happen.

      What does not help at all is to hand-wave or diminish this particular problem and blame the tools for not doing our due diligence. Or worse, ignore experience and history and mark it as a new problem, only solvable by more technology.

              dZ.

      --
      Carol vs. Ghost
      ...Can you save Christmas?
    9. Re:I don't know if the question should be... by Anonymous Coward · · Score: 2, Informative

      It doesn't "refuse" to serve text/plain, it just makes you ask for it specifically. (Use ?action=raw via index.php and/or format=txt via api.php)

    10. Re:I don't know if the question should be... by Cajun+Hell · · Score: 2

      You're assuming input comes from a browser using a page you made yourself. .. . If you aren't validating input in your server code, what is?

      No he's not. If you do things right, then hostile input, honestly mistaken input, and perfectly valid input all get handled the same way. Instead of getting "validated," they get escaped for whatever context they're used within, as they get written to that context.

      If you're building a string for use in a SQL statement, then the string gets escaped for SQL, regardless of whether you trust it or not. You just always do it (unless some other part of the system is guaranteed to be doing it for you, later than your own handling of the data). So it's ok if the data has a single-quote character, because you're always going to be sending that to the database as '' or \'. If you're outputting it to be part of a text node in HTML, then the string gets escaped for HTML text -- always, regardless of whether you trust it or not. So it's ok if it has a < character, because you're always going to send that to web browsers as &lt;.

      Validation would impose needless restrictions (you can't have a quotation mark or a less-than sign) that are going to turn out to be useless anyway. You won't ever think of all the characters that might break something else that the data some day gets used for. I currently maintain a system where there's a rule that some data can't contain "weird characters" (it actually tells that to people as they enter it) and it's a decade too late to fix that, so it merely validates the strings and there's a shitload of code that trusts that validation to have happened, and because of that, there's an upper bound to how diversely this data can ever be used. All because someone back in the mists of time thought that input validation was the answer, rather than output escaping.

      OTOH, escaping at the last moment always fixes the problem, every time and in every context, whether we're talking about SQL, HTML, or something that hasn't been invented yet. Every format will always have some mechanism for escaping strings. Use it, as you're outputting to that format, not prematurely as you're storing the value somewhere. Do this, and you'll have no security problems related to data values, and there's nowhere your data can't go.

      BTW, I'm not totally anti-validation. Sometimes the actual value of a string matters, although usually when it does, it means some other part of the system is mis-designed. (But we all sometimes have to maintain mis-designed systems.) An invalid input should usually be expressed as a failed lookup (e.g. since I'm trying to store the foreign key for a car manufacturer named "Ferd", rather than validate that "Ferd" is the name of manufacturer before I store that string) or a failed conversion (e.g. I wasn't able to translate 2012-08-32 into a Julian date) or something like that. If it's really raw text with no systemic meaning ("I L1ke ur b00bies in yer v1d30 and want to date u") then there's no reason it needs any sort of validation at all, regardless of whether a stupid human or a malicious robot wrote it. There is no conceivable Unicode character that you shouldn't allow in a string like that, no matter how it's going to be used, as long as you're escaping it for each context right as you use it.

      Most of the time, though, validation should be semantic. It's not that you entered an invalid name for something, it's that you entered that your movie will be in theaters in 3012 or that your thing which turned out to be a book had a blank author (whereas it would have been ok for a teacup to lack an author), or something like that.

      --
      "Believe me!" -- Donald Trump
    11. Re:I don't know if the question should be... by ais523 · · Score: 2

      http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw&ctype=text/plain
      "You have chosen to open index.php which is a: text/x-wiki from: http://en.wikipedia.org/"

      http://en.wikipedia.org/w/api.php?format=txt
      "You have chosen to open api.php which is a: text/text from: http://en.wikipedia.org/"

      It refuses to serve text/plain, even if you ask for it specifically. (Compare http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw&ctype=text/css, which it'll serve quite happily.)

      --
      (1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
  2. Re:It's called reprocessing by Anonymous Coward · · Score: 5, Informative

    As TFA points out, it is possible to create a Flash applet using nothing but alphanumeric characters. Good luck catching that in your reprocessing.

  3. Yes, it really is that bad. by VortexCortex · · Score: 5, Interesting

    This is what happens when you try to be lenient with markup instead of strict (note: compliant does not preclude extensible), and then proceed to use a horribly inefficient and inconsistent (by design) scripting language and a dysfunctional family of almost sane document display engines combined with a stateless protocol to produce a stateful application development platform by way of increasingly ridiculous hacks.

    When I first heard of "HTML5" I thought: Thank Fuck Almighty! They're finally going to start over and do shit right, but no, they're not. HTML5 is just taking the exact same cluster of fucks to even more dizzying degrees. HOW MANY YEARS have we been waiting for v5? I've HONESTLY lost count and any capacity to give a damn when we reached a decade -- Just looked it up, 12 years. For about one third the age of the Internet we've been stuck on v4.01... ugh. I don't, even -- no, bad. Wrong Universe! Get me out!

    In 20XX when HTML6 may be available I may reconsider "web development". As it stands web development is chin-deep in its own filth which it sprays with each mention, onto passers by and they receive the horrid spittle joyously not because its good or even not-putrid, but because we've actually had worse! I can crank out a cross platform pixel perfect native application for Android, iOS, Linux, OSX, XP, Vista, Win7, and mother fucking BSD in one third the time it takes to make a web app work on the various flavours of IE, Firefox, Safari, Chrom(e|ium). The time goes from 1/3rd down to 1/6th when I cut out testing for BSD, Vista, W7 (runs on XP, likely runs on Vista & Win7. Runs on X11 + OpenGL + Linux, likely builds/runs on BSD & Mac).

    Long live the Internet and actual cross platform development toolchains, but fuck the web.

    1. Re:Yes, it really is that bad. by sgrover · · Score: 5, Funny

      +1, but tell us how you really feel

    2. Re:Yes, it really is that bad. by SuricouRaven · · Score: 5, Insightful

      Of course it's a mess. The combination of HTTP and HTML was designed for simple, static documents displaying predominatly text, a little formatting and a few images. By this point we're using extensions to extensions to extensions. It's a miracle it works at all.

    3. Re:Yes, it really is that bad. by adolf · · Score: 4, Funny

      It's a miracle it works at all.

      It works?

    4. Re:Yes, it really is that bad. by TheDarkMaster · · Score: 2

      I think the same thing. I currently work doing "web systems". And do they work? Work, I managed to make a web application that can use a card printer. But at what price? I spent twice the time that I would spend if I did compiled desktop applications, and lost count of the many horrible hacks I had to do to similar desktop functionality using HTML

      --
      Religion: The greatest weapon of mass destruction of all time
  4. Re:"user content" by Anonymous Coward · · Score: 2, Interesting

    Umm, what does your comment have to do with the subject in TFA? They used to host content on google.com, then they moved it to googleusercontent.com for security reasons. If anything they have made it clear that the user owns it, but not for that reason.

  5. HTML needs a sandbox tag by Hentes · · Score: 2

    The easiest way to secure embedded content would be a sandbox tag that allows to limit what kind of content can be inside of it.

  6. Problem can be solved, but users are the problem by gweihir · · Score: 2

    Images and text can be sanitized reliably. The problem is that this strips out all of the non-essential features. Users have a hard time understanding that, because users do not understand the trade-offs involved.

    But the process is easy: Map all images to meta-data and compression free formats (pnm, e.g.) then recompress with a trusted compressor. For text, accept plain ASCII, RTF and HTML 2.0. Everything else, convert either to images or to cleaned PDF/Postscript by "printing" and OCR'ing.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  7. Novel Solution by Sentrion · · Score: 2, Interesting

    This was a real problem back in the 1980s. Everytime I would connect to a BBS my computer would execute any code it came across, which made it very easy for viruses to infect my PC. But lucky for me, in the early 90's the world wide web came into being and I didn't have to run executable code just to view content that someone else posted. The PC was insulated from outside threats by viewing the web "pages" only through a "web browser" that only let you view the content, which could be innocuous text, graphics, images, sound, and even animation that was uploaded to the net by way of a non-executable markup language known as HTML. It was at this time that the whole world began to use their home computers to view content online because it was now safe for amateurs and noobs to connect their PCs to the internet without any worries of being inundated with viruses and other malware.

    Today I only surf the web with browsers like Erwise, Viola, Mosaic, and Cello. People today are accessing the internet with applications that run executable code, such as Internet Explorer and Firefox. Very dangerous for amateurs and noobs.

  8. My explanation of article by kent.dickey · · Score: 5, Informative

    The blog post was a bit terse, but I gather one of the main problems is the following:

    Google lets users upload profile photos. So when anyone views that user's page, they will see that photo. But, malicious users were making their photos files contain Javascript/Java/Flash/HTML code. Browsers (I think it's always IE) are very lax and will try to interpret files how they please, regardless of what the web page says. So, webpage says it's pointing to a IMG, but some browsers will interpret it as Javascript/Java/Flash/HTML anyway once they look at the file. So now a malicious user can serve up scripts that seem to be coming from Google.com, and so they are given a lot of access at Google.com and break their security (e.g., let you look at other people's private files).

    Their solution: user images are hosted at googleusercontent.com. Now, if a malicious user tries to put a script in there, it will only have the privileges of a script run from that domain--which is no privileges at all. Note this just protects Google's security...you're still running some other user's malicious script. Not google's problem.

    The article then discusses how trying to sanitize images can never work, since valid images can appear to have HTML/whatever in them, and their own internal team worked out how to get HTML to appear in images even after image manipulation was done.

    Shorter summary: Browsers suck.