OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab
abhishekmdb writes No browsers are safe, as proved yesterday at Pwn2Own, but crashing one of them with just one line of special code is slightly different. A developer has discovered a hack in Google Chrome which can crash the Chrome tab on a Mac PC. The code is a 13-character special string which appears to be written in Assyrian script. Matt C has reported the bug to Google, who have marked the report as duplicate. This means that Google are aware of the problem and are reportedly working on it.
The Assyrian came down like the wolf on the fold,
And his cohorts were gleaming in purple and gold;
And the sheen of their spears was like stars on the sea,
When the blue wave rolls nightly on deep Galilee.
Byron
Internet cat-youtuber-viewer, you could be attacked at any moment and lose all your newly discover-list at any moment.
Let us henceforth dub it the Snow Crash exploit.
Like a turd in a toilet..
Just flush it and get real.
Stop the presses a bug found in a large complex program.
Aaaaaaaaaahhhh....
Apparently, specially crafted input can expose bugs. It won't ever change. Anyone who thinks that computer software can be made foolproof either doesn't understand how it's made, or is in denial. This would have been news about 1985.
Exactly why is this front page news?
This exploit rang a bell, so I searched Bruce Schneier's website. And, sure enough, on July 15, 2000, he observed ``Unicode is just too complex to ever be secure.'' Doesn't exactly warm the cockles of the paranoid's heart.
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/
to ditch unicode support. They recognized that experimental technology like this shouldn't be rolled out to this much users. Thank you dice for keeping slashdot safe!
If I were looking for a language to scare a program into submission with, Assyrian would be a pretty plausible choice. Even by the rather high standards of the rough neighborhood that is the near and middle east, they cut quite a swath of blood-soaked mayhem through their neighbors; and put out lots of cuneiform inscriptions and rather morbid art gloating about their efficiency at this.
That script is the Syriac script not the Assyrian one: https://en.wikipedia.org/wiki/....
this report is a dupe: https://code.google.com/p/chro...
Google translate doesn't even do Assyrian!
I once had a small Notes web thing running for a bunch of people in Scandinavia. The thing crashed every time when someone from Iceland worked with it. Ruend out that the icelandic character is not in some middle european character set (this was before UTF-8) and wasted Notes every time. That was a total bastard of a problem to find.
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
It might not be unicode. I once had a bug because I assumed a particular MacOSX/iOS API call was returning UTF8. It was actually returning old-school MacRoman by default. Worked for some locales, caused a crash on others.
Yeah, computers should only support good old-fashioned US-ASCII, there's no way any data using those characters could possibly cause anything to break.
How long do you think it's going to take for said characters to be posted (inadvertently, of course) in a comment on this post?
http://blogs.msdn.com/b/oldnew...
About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.
First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.
8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.
UTF-8. These usually begin with a BOM but not always.
Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.
Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.
If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:
D0 AE
Depending on which encoding you assume, you get very different results.
If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "". Sure this looks strange, but maybe it's part of the word VATNI which might be the name of an Icelandic hotel.
If you assume UTF-8, then the file consists of the single Cyrillic character U+042E
If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE
If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0
Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.
Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.
The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".
In related news, we don't need to worry about this bug being used by unscrupulous sorts of folks in the comments here. The one and only time a lack of unicode support has come in useful...
Google are?
I've had a delightful time explaining to my trainees that *EVERY SERVER SHOULD ONLY BE RUN IN A LANG=C ENVIRONEMNT". Unicode is *bad*, *bad*, *bad* for systems work of any sort.
And in a related XKCD post:
https://xkcd.com/327/
That works, until your servers have to process any kind of foreign characters whatsoever. This is a fault that only affects OS X, only when using Google Chrome. It's not (to my knowledge) a weakness of Unicode.
"Set a man a fire, he'll be warm for the rest of the night. Set a man afire, he'll be warm for the rest of his life."
My conclusion is that the unicode guys are assholes.
Yeah, well, it's not too hard to escape from unicode hell...
And use what instead? Firefox, the browser with a UI just as fucking bad as Chrome's, but that's also much slower and so much more bloated than Chrome is? Or Safari, which is basically equivalent to Chrome, but a year or two outdated? Or Opera, the new version of which is literally Chrome, and the old version which is getting very outdated these days? Or IE, which doesn't even run on OS X? Don't even waste my time with Vivaldi, or Pale Moon, or any of those other half-assed attempts at a modern browser.
Look, Chrome is the best we have on OS X, or any other platform for that matter. Its UI is rubbish, but at least it's a fast, sleek browser, unlike so many of its competitors. I hate Chrome, but the alternatives are so much worse, or not even available on OS X!
Of course we'd have options if Opera hadn't killed their good browser and replaced it with a steaming pile of monkey shit. We'd also have options if the Firefox devs were more concerned with creating a good browser than with crucifying their former CEO because he dared hold an opinion about gay marriage that differed from theirs. But that's not how reality is. So we'll continue to use Chrome until some other browser vendor gets its shit together and releases a better browser.
mtbf - 15 mins.
Need Mercedes parts ?
hmm, ancient and dead language from the time of reported magic. Just typing the words will crash your Mac. Imagine if one spoke them!
... we know that Assyrian or more precisely Sumerian is tricky.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Just tried it in Chrome on OS X. Out of date article??
The news header speaks of Assyrian script, but Slashdot provides an egyptian scarabeus bug icon to accompany it
hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?
Unicode made three big mistakes.
1. Attempting to be backwards compatible with a subset of ASCII. A subset that breaks all the common encodings used outside the US.
2. Multiple encodings (8, 16 and 32 bit). Pick one, stick to it, don't make try to guess with stupid BOMs etc.
3. CJK unification. Trying to merge three distinct languages in a way that makes it impossible to mix them in a pure Unicode document.
So yeah, those guys are assholes.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
As for point 1. UTF-8 is backward compatible with the full ASCII set, the full ASCII set only contains 128 code points. The extension for latin-* are beyond ASCII.
I agree the UTF-16 encodings where a mistake, the whole thing with the encoding of extended planes. Maybe they should even drop UTF-32 as an encoding, UTF-8 can encode any character anyway.
Yeah, because people who speak funny foreign languages don't deserve to use our linguistically pure English-speaking servers, right?
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Unicode and how it is represented in a file are two different things. Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
How to represent it in a file is different. UTF-8 is the obvious answer today, but other encodings were tried by different organisations first. The big win of UTF-8 is that you can have characters from very different regions on the same web page (or in the same file) - something that you cannot do you you adopt a purely 8 bit code like iso-8859-1.
We are still in transition: there are files encoded in various ways out there; however I think that UTF-8 will eventually become the encoding mechanism that everyone uses - so files encoded in other ways will become increasingly rare. So: a bit of patience please.
hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?
The byte order mark is part of the unicode standard, and is used all over the place besides windows. Your question answers itself.
For UTF-16. "Only Windows uses BOMs" is pretty much correct for UTF-8, where the Unicode standard discourages it.
4. Inconsistent policy for character inclusion. After years of opposing addition of symbols commonly used in typesetting or web pages (such as a common symbol for indicating external links consisting of a box with a curved arrow coming out of it) on the basis that they are "not plain text and best represented by graphic images", we get emoji added. And they still won't add many of these symbols they've opposed in the past (they recently added the standard triangular recycling mark, but this was long after the emoji was added with several circular Japanese recycling marks clearly demonstrating the hypocracy).
This is the same problem that killed Internet Explorer - to make things easy on the devs we allowed malformed pages. No need to follow the standard, the algorithm will try to figure out what you mean and try to do the right thing. End result: How many people use IE these days? How many devs want to code on that platform?
Heuristic approaches to solve crappy interpretations of standards does nothing good for the standard - eventually it muddies the standard to the point it becomes exploitable and utterly useless. In other words stop catering to the stupid people. Wrong should be wrong - end of. Stop being clever about it.
I come from Assyrian origin and I can ensure you that these letters form strong black spell which could crash wizards books and it seems to have similar effects on today's computers.
I agree overall with your comment, but I think UTF-8's backwards compatibility with ASCII was genius and is the reason we have as much Unicode support as we do today. I consider UTF-8 to be one of the best hacks of all time. Without it, the software that existed at the time would have had to be thrown out or re-written. The fact that software can (often) process UTF-8 without even being aware that it isn't ASCII was exactly what was needed to get Unicode off the ground. UTF-8 allowed Unicode to be adopted incrementally (especially by Unixes, which were much slower to adopt any (universal) international character set than Windows was).
Sadly, not everyone is as brilliant as Ken Thompson, so the UTF-8 encoding didn't exist when Unicode and ISO 10646 were first created. If someone had thought of it just a few years earlier we probably would have used that for nearly everything, and your second point would be irrelevant.
But by the time Unicode was even a thing, a lot of the software industry was already invested in ISO 10646, specifically UCS-2 (notably Microsoft and IBM, but plenty of others) so unless you think excluding IBM and Microsoft (in 1990!) would have been good for the widespread adoption of Unicode, the designers had no choice but to have multiple encodings.
Ironically, Linux and Apple were able to chose the (arguably much better) UTF-8 encoding only because they got serious about adopting an international character set several years later than Microsoft and IBM did (call it second mover advantage.)
So I couldn't call those mistakes. More like "historical accidents", just like most other bad designs we have to live with.
Your third point is just a face-palm, I agree.
The problem is that ASCII is only useful for US English. Other forms of English need symbols like the pound (£) sign. Other Latin derived languages need accented characters. Non-Latin languages already use some subset of ASCII plus extensions. Any software that has to support more than just 7-bit US ASCII and UTF-8 has to guess, and usually gets it wrong.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
I know, Syrian, but still. I always knew he was going to be the death of Apple.
Place something witty here
The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
C# and Java use UTF16 internally for strings.
I agree completely. There is no reason that a program cannot read UTF-8 and store as UTF-32 internally. There is a trade-off between time and memory. Note that UTF-16 is also a variable length encoding scheme so you still need to start at the start of string to find the nth character.
UTF-16 has the exact same problem, not every codepoint fits in the original UCS2 encoding so they added surrogate pairs. Only UTF-32/UCS4 escapes this issue but you still have to count from the start because what a human calls a on-screen character can be composed of several codepoints.
Or, you know, we could all accept that Notepad was created originally for 7-bit ASCII (with quasi 8-bit ANSI support)) and that either a specific override or a BOM should be required to get different behavior. Because the only reason 'your algorithm look stupid for choosing "incorrectly"' is when you try to create a "smart" algorithm and its made to look "dumb". Meanwhile, if you choose a "dumb" algorithm, you'll see the backlash against the "smart" people who think they're clever.
Unicode made one enormous mistake - existing in the first place.
If plain ascii was good enough for Virgil, Newton & Shakespeare it's good enough for you.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use
Yay! You actually got the answer partially correct. However you then badly stumble when you follow this up with:
8-bit ANSI, but under no circumstances UTF-16
The correct answer is "after knowing it is not UTF-8, use your complicated and error-prone encoding detectors".
The problem is a whole lot of stupid code, in particular from Windows programmers, basically tries all kinds of matching against various legacy encodings and UTF-16, and only tries UTF-8 if all of those return false. This is why Unicode support still sucks everywhere.
You try UTF-8 FIRST. This is for two reasons: first because UTF-8 is really popular and thus likely the correct solution (especially if you count all ASCII files as UTF-8, which they are). But the second is that a random byte stream is INCREDIBLY unlikely to be valid UTF-8 (like 2.6% chance for a two-byte file, and geometrically lower for any longer ones), this means your decision of "is this UTF-8" is very very likely to be correct. Just moving this really reliable test to be the first one will improve your detection enormously.
The biggest help would be to check for UTF-8 first, not last. This would fix "Bush hid the facts" because it would be identified as UTF-8. But a variation on that bug would still exist if you stuck a non-ASCII byte in there, in which case it would still be useful (but much much less important) to not do stupid things in the detectory, for instance requiring UTF-16 to either start with a BOM or to have at least one word with either the high or low byte all zero would be a good idea and indicate you are not an idiot.
The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.
C# and Java use UTF16 internally for strings.
And you are aware that UTF-16 is variable-length as well, and therefore you can't "find the nth character" quickly either?
You might want to retake compsci 101.
Same thing happens when you type Bill fed the goats. Its an unicode error in notepad for XP. You want something fun? type that into Chrome for a mac in an apple store. Thats fun.
Since it deleted the word here is an image of it http://2.bp.blogspot.com/-_TfD...
Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.
The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a variable-length byte encoding for a system that still had to deal with existing software and data that was ASCII, this was a command from people who did not have to write and maintain the software.
The programmers, who knew damn well that variable-length was the correct solution, were unfortunately not bright enough to avoid making mistakes in their encodings (such as not making them self-synchronizing). UTF-8 fixed that, but these errors also led some of the less-knowledgeable to think there was a problem with variable length.
Unfortunately political correctness at Microsoft won, despite the fact that they had already added variable-length encoding support to Windows. It may also have been seen as a way to force incompatibility with NFS and other networked data so that Microsoft-only servers could be used.
One of the few good things to come out of the "Unix wars" was that commercial Unix development was stopped before the blight of 16-bit characters was introduced (it was well on it's way and would have appeared at the same time Microsoft did it). Non-commercial Unix made the incredibly easy decision to ignore "wide characters".
The biggest problem now is that Window convinced a lot of people who should know better that you need to use UTF-16 to open files by name (all that is really needed is to convert UTF-8 just before the api is called). This led to UTF-16 to infect Python, Qt, Java, and a lot of other software and cause problems and headaches and bugs even on Linux. There is some hope that they are starting to realize they made a terrible mistake, Python in particular seems to be backing out by storing a UTF-8 version of the string alongside the UTF-32.
Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
That's one of its biggest problems: it doesn't support all the characters in Chinese. In fact it doesn't really support any of them, because they tried to merge them with Japanese and Korean characters. The result is that Unicode contains a sort of amalgamation that can be used to approximate any of those three languages, but not represent them properly.
I listen to both Japanese and Chinese music. Unicode is broken for me. There is no way to tell if a character is a Chinese or a Japanese one. The character has the same Unicode code for both languages. The software is supposed to somehow magically know which language is in use and select a Japanese or Chinese font. When you have file names or metadata tags there is no simple way of determining language, you just have to guess. Humans are pretty good at guessing, machines not so much.
That problem has nothing to do with encoding, it's to do with the standard body trying to merge characters from different languages that shouldn't be merged.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC