OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab
abhishekmdb writes No browsers are safe, as proved yesterday at Pwn2Own, but crashing one of them with just one line of special code is slightly different. A developer has discovered a hack in Google Chrome which can crash the Chrome tab on a Mac PC. The code is a 13-character special string which appears to be written in Assyrian script. Matt C has reported the bug to Google, who have marked the report as duplicate. This means that Google are aware of the problem and are reportedly working on it.
Let us henceforth dub it the Snow Crash exploit.
This exploit rang a bell, so I searched Bruce Schneier's website. And, sure enough, on July 15, 2000, he observed ``Unicode is just too complex to ever be secure.'' Doesn't exactly warm the cockles of the paranoid's heart.
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/
to ditch unicode support. They recognized that experimental technology like this shouldn't be rolled out to this much users. Thank you dice for keeping slashdot safe!
That script is the Syriac script not the Assyrian one: https://en.wikipedia.org/wiki/....
Complex software should be banned! Like the stuff that flies all the commercial aeroplanes and runs the nuclear reactors.
Now then, this particular Assyrian, the one whose cohorts were gleaming in purple and gold,
Just what does the poet mean when he says he came down like a wolf on the fold?
In heaven and earth more than is dreamed of in our philosophy there are great many things.
But I don't imagine that among them there is a wolf with purple and gold cohorts or purple and gold anythings.
Ogden Nash
http://blogs.msdn.com/b/oldnew...
About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.
First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.
8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.
UTF-8. These usually begin with a BOM but not always.
Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.
Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.
If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:
D0 AE
Depending on which encoding you assume, you get very different results.
If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "". Sure this looks strange, but maybe it's part of the word VATNI which might be the name of an Icelandic hotel.
If you assume UTF-8, then the file consists of the single Cyrillic character U+042E
If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE
If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0
Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.
Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.
The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".
In related news, we don't need to worry about this bug being used by unscrupulous sorts of folks in the comments here. The one and only time a lack of unicode support has come in useful...
Well, I don't know about *foolproof*, but most of the time when software does bad things because of specially crafted input, it's because someone didn't bother to do an input validation that they obviously ought to have done. This has been a leading cause of bugs since the 1974 edition of "The Elements of Programming Style", which devotes 2 out of 56 lessons to it:
#19 Test input for plausibility and validity.
#20Make sure input doesn't violate the limits of the program.
If K&P were writing that today they'd probably have a rule "never hand a piece of non-literal data to an interpreter without escaping anything the interpreter might consider lexically significant."
But this is evidently a somewhat *different* kind of bug -- perfectly valid data that some part of the program (likely a library) craps out on. Invalid/malicious input handling is a non-functional requirement, but this appears to be a *functional* requirement the programmers failed to implement or test.
Perhaps there should be a rule "if you don't do what you're supposed to with certain input yet, reject that input in a sensible way."
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
My conclusion is that the unicode guys are assholes.
(They spoke Aramaic long before they became Christian, of course.)
The people in question call themselves Assyrians at the present day; there are some Akkadian words preserved in their Aramaic language even now, although Akkadian itself probably died out in the earlier part of the first millennium BC.
The name "Syriac" is itself from a worn-down version of the same name; it was once used pretty much as the equivalent of "Aramaic" but is now generallly used to describe only one particular version of Aramaic which was a major literary language of Western Asia in early Christian times, and is still used as a liturgical language by Nestorian Christians as far afield as India. The script is used to write several modern Aramaic languages spoken by Christians.
These ancient communities have suffered greatly in the Middle East wars of recent times, and a huge proportion have left as refugees.
Aberrations have appeared in my destiny prognostication engine!
Unicode made three big mistakes.
1. Attempting to be backwards compatible with a subset of ASCII. A subset that breaks all the common encodings used outside the US.
2. Multiple encodings (8, 16 and 32 bit). Pick one, stick to it, don't make try to guess with stupid BOMs etc.
3. CJK unification. Trying to merge three distinct languages in a way that makes it impossible to mix them in a pure Unicode document.
So yeah, those guys are assholes.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Unicode and how it is represented in a file are two different things. Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
How to represent it in a file is different. UTF-8 is the obvious answer today, but other encodings were tried by different organisations first. The big win of UTF-8 is that you can have characters from very different regions on the same web page (or in the same file) - something that you cannot do you you adopt a purely 8 bit code like iso-8859-1.
We are still in transition: there are files encoded in various ways out there; however I think that UTF-8 will eventually become the encoding mechanism that everyone uses - so files encoded in other ways will become increasingly rare. So: a bit of patience please.
For UTF-16. "Only Windows uses BOMs" is pretty much correct for UTF-8, where the Unicode standard discourages it.