OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Related poem by Tim+the+Gecko · 2015-03-21 06:55 · Score: 1

The Assyrian came down like the wolf on the fold,

And his cohorts were gleaming in purple and gold;

And the sheen of their spears was like stars on the sea,

When the blue wave rolls nightly on deep Galilee.

Byron

Re:Related poem by TheGratefulNet · 2015-03-21 06:56 · Score: 1

what? no 'burma shave' ??

--

--
"It is now safe to switch off your computer."
Re:Related poem by BarbaraHudson · 2015-03-21 07:30 · Score: 1, Troll

It's not Tuesday :-)

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:Related poem by Anonymous Coward · 2015-03-21 07:32 · Score: 0

Lord Byron never got the praise that he justly deserved for his work with the "Burma Shave Corporation".
Re:Related poem by Chris+Mattern · 2015-03-21 07:41 · Score: 4, Funny

Now then, this particular Assyrian, the one whose cohorts were gleaming in purple and gold,
Just what does the poet mean when he says he came down like a wolf on the fold?
In heaven and earth more than is dreamed of in our philosophy there are great many things.
But I don't imagine that among them there is a wolf with purple and gold cohorts or purple and gold anythings.
Ogden Nash
Re:Related poem by Anonymous Coward · 2015-03-21 09:46 · Score: 0

No need to worry ISIL is busy elminating all Assyrians and any trace of them.
Re:Related poem by Vlad_the_Inhaler · 2015-03-21 21:17 · Score: 1

So that is Google's way of fixing this problem?

--
Mielipiteet omiani - Opinions personal, facts suspect.

Be warned by Anonymous Coward · 2015-03-21 06:59 · Score: 0

Internet cat-youtuber-viewer, you could be attacked at any moment and lose all your newly discover-list at any moment.

Thank you, Neal Stephenson by Applehu+Akbar · 2015-03-21 07:01 · Score: 5, Funny

Let us henceforth dub it the Snow Crash exploit.

Re:Thank you, Neal Stephenson by fuzzyfuzzyfungus · 2015-03-21 07:27 · Score: 1

Weren't the Snow-Crash-related fertile crescent dwellers Sumerians, the Xerox-PARC of Mesopotamian civilization, who invented more or less everything and then got massacred by their imitators?
Re:Thank you, Neal Stephenson by Whiteox · 2015-03-21 08:01 · Score: 1

It's the imitator language derivative that is still being used today in Old Persia. Those Iranians are fun guys!
It's the script to use when you don't want to write in Arabic.

--
Don't be apathetic. Procrastinate!
Re:Thank you, Neal Stephenson by Anonymous Coward · 2015-03-21 08:51 · Score: 0

nope. assyrian...
it shall be known as the ass-crash!
Re:Thank you, Neal Stephenson by fredgiblet · 2015-03-21 18:06 · Score: 1

That was my first thought as well.

Chrome on OSX by Anonymous Coward · 2015-03-21 07:05 · Score: 0, Troll

Like a turd in a toilet..

Just flush it and get real.

Man bites dog by Anonymous Coward · 2015-03-21 07:13 · Score: 1

Stop the presses a bug found in a large complex program.

Re:Man bites dog by gnupun · 2015-03-21 07:29 · Score: 1

... which millions of people use to connect to the internet... and there are dozens (thousands) of bugs still hidden where that bug came from. Do you still think browsers should be allowed for serious stuff like online banking, home automation and online elections?
Re:Man bites dog by viperidaenz · 2015-03-21 07:38 · Score: 2

Complex software should be banned! Like the stuff that flies all the commercial aeroplanes and runs the nuclear reactors.
Re:Man bites dog by Anonymous Coward · 2015-03-21 08:12 · Score: 0

And it was a dupe too, Slashdot reporting in!
Re:Man bites dog by mSparks43 · 2015-03-21 08:38 · Score: 1

Yeah, because no one was ever shot in an real bank.
Re:Man bites dog by disambiguated · 2015-03-21 14:05 · Score: 1

Stop the presses a bug found in a large complex program.
No Browser is safe : Chrome, Firefox, Internet Explorer, Safari all hacked at Pwn2Own contest
It's not "a bug" in "a program". It's every major browser. And it's pretty much like this every time they do pwn2own. If a group of hackers are able to bring down every major browser with previously unknown* exploits every year just for a chance to win a laptop, what can better motivated (financed) groups do?
* unknown to the browser developers anyway... 17 seconds to pwn IE, yeah right... like they say on the cooking shows "here's one I prepared earlier"

What is the capital of Assyria? by Anonymous Coward · 2015-03-21 07:14 · Score: 0

Aaaaaaaaaahhhh....

Re:What is the capital of Assyria? by ArcadeMan · 2015-03-21 07:27 · Score: 0

Bridgekeeper: Stop. Who would cross the Bridge of Death must answer me these questions three, ere the other side he see.
Sir Lancelot: Ask me the questions, bridgekeeper. I am not afraid.
Bridgekeeper: What... is your name?
Sir Lancelot: My name is Sir Lancelot of Camelot.
Bridgekeeper: What... is your quest?
Sir Lancelot: To seek the Holy Grail.
Bridgekeeper: What... is your favourite colour?
Sir Lancelot: Blue.
Bridgekeeper: Go on. Off you go.
Sir Lancelot: Oh, thank you. Thank you very much.
Sir Robin: That's easy.
Bridgekeeper: Stop. Who would cross the Bridge of Death must answer me these questions three, ere the other side he see.
Sir Robin: Ask me the questions, bridgekeeper. I'm not afraid.
Bridgekeeper: What... is your name?
Sir Robin: Sir Robin of Camelot.
Bridgekeeper: What... is your quest?
Sir Robin: To seek the Holy Grail.
Bridgekeeper: What... is the capital of Assyria?
[pause]
Sir Robin: I don't know that.
[he is thrown over the edge into the volcano]
Sir Robin: Auuuuuuuugh.

--
Get free satoshi (Bitcoin) and Dogecoins

So, A Bug Then. by Anonymous Coward · 2015-03-21 07:17 · Score: 0

Apparently, specially crafted input can expose bugs. It won't ever change. Anyone who thinks that computer software can be made foolproof either doesn't understand how it's made, or is in denial. This would have been news about 1985.

Exactly why is this front page news?

Re:So, A Bug Then. by hey! · 2015-03-21 08:12 · Score: 2

Well, I don't know about *foolproof*, but most of the time when software does bad things because of specially crafted input, it's because someone didn't bother to do an input validation that they obviously ought to have done. This has been a leading cause of bugs since the 1974 edition of "The Elements of Programming Style", which devotes 2 out of 56 lessons to it:

#19 Test input for plausibility and validity.
#20Make sure input doesn't violate the limits of the program.
If K&P were writing that today they'd probably have a rule "never hand a piece of non-literal data to an interpreter without escaping anything the interpreter might consider lexically significant."
But this is evidently a somewhat *different* kind of bug -- perfectly valid data that some part of the program (likely a library) craps out on. Invalid/malicious input handling is a non-functional requirement, but this appears to be a *functional* requirement the programmers failed to implement or test.
Perhaps there should be a rule "if you don't do what you're supposed to with certain input yet, reject that input in a sensible way."

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

Schneier got it right a decade and a half ago by Max+Hyre · 2015-03-21 07:19 · Score: 4, Informative

This exploit rang a bell, so I searched Bruce Schneier's website. And, sure enough, on July 15, 2000, he observed ``Unicode is just too complex to ever be secure.'' Doesn't exactly warm the cockles of the paranoid's heart.

--
I refuse to believe corporations are people until Texas executes one. -- desert rain on http://www.dailykos.com/user/

Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-21 07:39 · Score: 1

At that time, Schneier was just one of many that held this opinion. None if us is surprised by what is happening. If you want to be secure, stay away from Unicode or process UTF-8 as ASCII. As soon as you try to render, parse or even only compare anything besides standard ASCII, you are screwed.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by Antique+Geekmeister · 2015-03-21 07:50 · Score: 1

Unfortunately, unicode is now woven into various Java string handling and database interactions, and it is far too complex to test all the possible input and storage scenaries. I've also noticed a strong tendency among current QA engineers to test only the new feature, and to avoid testing old components interacting with new features without _amazing_ pushback from their managers who want to keep testing costs very small. The result is a fairly predictable string of failure modes, and of production failures, that can be avoided by discarding such expensive, complicating software features as Unicode.
Re:Schneier got it right a decade and a half ago by lgw · 2015-03-21 08:20 · Score: 1

UTF8 has nothing to do with it.
The problem commonly is: people try to "clean" input with some stupid regex, rather than treating all user-provided strings as permanently dirty. You can do anything you need to, risk-free, with this attitude. You have to understand the encoding you use for storage/transmission (if your framework doesn't provide a way to safely, blindly store/transmit any string, then just encode the string in some way first), but that's a much, much smaller world than the universe of possible user string.

As soon as you try to render, parse or even only compare anything besides standard ASCII, you are screwed.
Render? Displaying a glyph incorrectly is one thing, but crashing or leaving some exploit open is raw incompetence. Parse? If you need to parse user input, you likely have bigger issues (if you're running user scripts or whatever). Compare? Again, you might get the order wrong (is there even a defined order for pictographic languages?), but crashing is inexcusable.
They're just bytes for fuck's sake. What kind of moron can't process them safely in this day and age?
But then, this is Chrome we're talking about - the initial release would crash with a 2-character string (";=" was it?), due to an error that never should have made it past code review - subtracting 1 from an unsigned value, then using the result as a limit in a for loop IIRC. Might as well be checking your passwords into github.

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 08:43 · Score: 0

It's not an exploit. It just causes the browser to crash. Get a little perspective here.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 09:11 · Score: 0

It's a denial-of-service exploit. Causing the browser to crash does deny its services to the user, even if just temporarily.
Re:Schneier got it right a decade and a half ago by countach · 2015-03-21 11:24 · Score: 1

I don't know what your definition of "dirty" is, but there are going to be scenarios where you need your data cleaned.
Re:Schneier got it right a decade and a half ago by gnasher719 · 2015-03-21 11:55 · Score: 1

Well, assyrian unicode characters are in the range around U12000. They require four bytes in UTF-8 and two 16-bit words in UTF-16.

In UTF-8 I'd be surprised if someone handled this wrong, because three byte characters are common, and there is no good reason to be able to process three byte but not four byte UTF-8.

If they are using UTF-16 on the other hand, I wouldn't be surprised if someone assumes that characters are a single UTF-16 word.
Re:Schneier got it right a decade and a half ago by gnasher719 · 2015-03-21 12:13 · Score: 1

I'd say the things that Schneier mentions in this article are not actual problems. The first step is avoiding UTF-16 because it is much too tempting to assume that one 16-bit word = one character; nobody will make that assumption with UTF-8. The next step is cleaning UTF-8 and accepting only valid UTF-8; simply removing anything that isn't valid will do fine. What _must_ happen is that after this cleaning step nobody ever again accesses the original data, only the cleaned data. At that point handling the characters is no problem.

There are other problems. Like the incredibly convoluted way to handle Unicode characters inside MIME headers. Well, MIME headers are awful anyway. I can certainly see bugs possible there. It _should_ be possible to write code that might fail or work not quite correctly but have no security problems.

The big problem is that with Unicode what you see is not what you have. Like using cyrillic or greek uppercase letters that look exactly like latin ones, in order to get Unicode incorrectly handled not by he software, but by the user.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 14:12 · Score: 0

Best advice if you want security: ignore all user input.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 14:17 · Score: 0

I've had arguments with more than one developer who insisted that all UTF-16 characters are a single 16-bit word. It's just close enough to being true that many think it is. Four byte UTF-8 is a similar problem. You can see a lot of unicode in a lot of languages before you ever come across a four byte character. I'm sure many programmers (and therefore libraries) think three is the max, and they'd usually be right.
Re:Schneier got it right a decade and a half ago by disambiguated · 2015-03-21 14:28 · Score: 2

Unicode is sort of complicated, or at least it's more complicated than might be expected. But the problem with Schneier saying "Unicode is too complex to ever be secure" is that he might as well just say "programming is too complex to ever be secure." Sure, Unicode is a little complicated. But it's hardly the most complicated thing you'll ever have to deal with as a programmer. If we can't even get that right, we might as well just quit.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 20:14 · Score: 0

Besides, plenty of exploits are just crashes with "side-effects" so to speak. The bug may very well be exploitable, even if right now all anyone can do with it is cause a crash.
Re:Schneier got it right a decade and a half ago by Anonymous Coward · 2015-03-21 21:59 · Score: 0

I'm not quite sure what your point is other than to try to look smart by misunderstanding the problem and saying other developers are dumb, and then bashing Chrome for having bugs.
Have you ever developed any system more complicated than a college project?
Re:Schneier got it right a decade and a half ago by AmiMoJo · 2015-03-22 00:07 · Score: 1

If they had just stuck with 24 or 32 bits per character, instead of going with multiple variable length character encodings, you might be right. When you can't be sure how many bytes any given character needs you can't use simple maths to work out how big buffers need to be, or even be sure that you won't end up with odd spare bytes at the end.
It looks like this what has happened here. Even supposedly well debugged library code still has issues with it.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Schneier got it right a decade and a half ago by lgw · 2015-03-22 01:27 · Score: 1

You might be right, but it's such an old problem - it was a big deal 10 years ago in the Windows world as UCS2 didn't handle it. C# was actually UTF from the start, like Java, of course.
Still, crashing because of, what, a null in the input? I could certainly understand truncation (just like other incorrect display problems), but a crash?

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Schneier got it right a decade and a half ago by lgw · 2015-03-22 01:40 · Score: 1

Have you ever developed any system more complicated than a college project?
One or two; one or two. Somehow I've never managed to develop one that would crash due to malformed input, however.

--
Socialism: a lie told by totalitarians and believed by fools.
Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-22 02:09 · Score: 1

Indeed. That is why I usually add to stay away from Java if you want/need security. Testing is pretty much a non-starter to get secure code though, unless the person doing the tests really understands the code, security and has a generous testing budget. In usual industrial practice, none of the three are the case.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-22 02:13 · Score: 1

You miss my point: I basically said that as soon as you are interpreting the data as Unicode, you are screwed. As to treating input as permanently dirty, that would be effective if possible, but it is not. For many security-critical functionality, you just have to reject anything that is not 7-bit ASCII, because quite often you need to sanitize input and use it afterwards.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-22 02:14 · Score: 1

No, actually the best advice is to not do any computations at all, i.e. pull the plug. Unfortunately, just like ignoring user input, that comes with the slight problem that your software cannot get any work done anymore.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-22 02:16 · Score: 1

Indeed. The problem is that Unicode is far too complex to still be understandable to the average programmer (and the good ones have to waste far too much time on it). Of course, you should always make your assumptions explicit and do explicit rejection of anything you are not prepared to process. But that would be a sound coding practice, and we cannot have that, now can we?

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by gweihir · 2015-03-22 02:18 · Score: 1

My point is that my first impression when I heard bout Unicode a long time ago was "this is really dumb and it will kill security".
As to your Ad Hominem: You are an anonymous coward and have no standing.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Schneier got it right a decade and a half ago by Antique+Geekmeister · 2015-03-22 03:05 · Score: 1

It's also aggravated by the "install the latest software, and build components, from arbitrary 3rd party repositories". I'm afraid that I just a long discussion with some Java developers who were accustomed to building their software on their desktops, pulling in arbitrary, unknown versions of components and their dependencies, and and using the resulting components to build the next round. .I'm afraid it's reminding me, forcibly, of Perl developers saying "just use cpan build!", and ruby developers saying "just install the gem".
If you don't pay attention to the components of your build environment, your qa environment, and your production environment, your testing cannot be reliable. That can be a very hard policy to teach, and to enforce.
Re:Schneier got it right a decade and a half ago by spitzak · 2015-03-22 12:19 · Score: 1

Yes, Java and Python (3) and Qt all are causing enormous difficulties as they followed Microsoft down the fantasy road and thought you had to convert strings on input to "unicode" or somehow it was impossible to use them. Since not all 8-byte strings can convert there must either be a lossy conversion or there must be an error, neither of which are expected, especially if the software is intended to copy data from one point to another without change.
The original poster is correct in saying "stay away from Unicode". This does not mean that Unicode is impossible. It means "treat it as a stream of bytes". Do not try to figure out what Unicode code points are there unless you really really have a reason to. And you will be surprised how little you need to figure this out. In particular you can search for arbitrary regexps (including sets of Unicode code points) with a byte-based regexp interpreter. And you can search for ASCII characters with trivial code.
Re:Schneier got it right a decade and a half ago by lgw · 2015-03-23 06:56 · Score: 1

Maybe I'm still not getting your point. Sure, if you need to understand the details of Unicode character composition and such because you're the one rendering the output glyphs, or you want to sort or search across different encodings of the same word, that's rough, but there's no excuse for a security failure while doing those tasks.
On your other point: the notion of "sanitizing input" is fundamentally flawed to begin with. You can never know what future framework that user data will be interacting with, and what might be interpreted as an escape sequence in that mysterious future, but you can assume that the guy doing that future work will just assume "the input was sanitized", and you're screwed. Instead, don't go there. If e.g. you need to store a user string in a SQL DB, do it in such a way that there's no possible problematic string (perhaps the DB has a way of doing queries that's guaranteed safe, for example). If e.g. you need to send a user sting inside an XML blob, just convert the user string to a hex/base64/whatever representation first - guaranteed safe.
What usecase were you thinking of that makes any of this hard at all?

--
Socialism: a lie told by totalitarians and believed by fools.

How well forethought of dice by NotInHere · 2015-03-21 07:19 · Score: 5, Funny

to ditch unicode support. They recognized that experimental technology like this shouldn't be rolled out to this much users. Thank you dice for keeping slashdot safe!

Re:How well forethought of dice by cdrudge · 2015-03-21 09:39 · Score: 1

Did Dice ditch unicode support? I thought the slash code always had issues/didn't support it, long before Dice acquired them.
Re:How well forethought of dice by wiredlogic · 2015-03-21 09:54 · Score: 1

Yeah. It's not like Slashdot.jp patched slashcode to support Unicode 10+ years ago.

--
I am becoming gerund, destroyer of verbs.
Re:How well forethought of dice by AmiMoJo · 2015-03-21 10:13 · Score: 2

Actually we are probably going to have to ditch Unicode at some point, at least in its current form. East Asian language support is badly broken. I could be fixed, but not in a non-breaking way.
CJK unification is one of the biggest screw-ups in the history of computing.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:How well forethought of dice by Anonymous Coward · 2015-03-21 10:13 · Score: 1

perhaps i can draw the situation in pictures
joke

0
\/ you /\
Re:How well forethought of dice by Anonymous Coward · 2015-03-21 10:20 · Score: 0

speaking of unicode support, it appears that /. has no *text* support either.
Re:How well forethought of dice by Anonymous Coward · 2015-03-21 11:31 · Score: 1

From what I understand unicode has abandoned CJK unification a long time ago there are now separate planes for each language.
Of course the old planes still exists, so you need to transpose those when you find them in a string.
Re:How well forethought of dice by Megane · 2015-03-21 17:49 · Score: 1

The support is in there, it's just that it uses a whitelist, which happens to be very small, probably only to U00FF if that much. There are also likely problems on the client side where the user's browser posts in the wrong encoding.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }

Re:How well forethought of dice by PincushionMan · 2015-03-23 02:04 · Score: 1

No, it does, you just have to use the

 tags, like so...


      This is preformatted text 


Well, that stinks. Let me try the <tt> tags, then:


    This is preformatted text with     tt & /tt    tags.


Phooey.  It ate all my extra spaces.  I suppose you could use &nbsp; non-breaking spaces....



Nope.  I guess trolls abused these features too much in the distant past, so I sort of understand that.

I'm still confused about the lack of Unicode, though.  I though Perl could handle it?

Re:How well forethought of dice by tlhIngan · 2015-03-23 03:39 · Score: 1

Did Dice ditch unicode support? I thought the slash code always had issues/didn't support it, long before Dice acquired them.
Slashcode always supported Unicode.
The reason it appears it doesn't is that thanks to a bunch of wankers who decided to abuse Unicode to no end, it ended up screwing the site layout up thanks to abuse of control codes.
So what was added was an input filter that limited what Unicode could come in - pretty much just ASCII at this point.
Unicode IS complex, and you really cannot blindly handle "all of it" because there will be odd edge cases you will NOT have thought of. And even more so as it's not a static character set - today you might think you handled all the edge cases, but tomorrow the new Unicode spec may introduce more and now you have more edge cases and combinations to test.
A couple of years ago it seems Slashcode implemented an output Unicode filter as well - because the old pages that were screwed up by the Unicode abuse no longer are screwed up. But their legacy lives on - Google for ":erocS" on /.
(Yes, abuse of the right-to-left override meant you could "fakemod" yourself by pretending you had a +5 mod)

In fairness... by fuzzyfuzzyfungus · 2015-03-21 07:25 · Score: 0

If I were looking for a language to scare a program into submission with, Assyrian would be a pretty plausible choice. Even by the rather high standards of the rough neighborhood that is the near and middle east, they cut quite a swath of blood-soaked mayhem through their neighbors; and put out lots of cuneiform inscriptions and rather morbid art gloating about their efficiency at this.

Re:In fairness... by bargainsale · 2015-03-21 08:16 · Score: 1

Wrong Assyrians. The ones you're thinking of spoke Akkadian and wrote cuneiform.

Eventually their (Christian) descendants ended up speaking Aramaic like practically everyone else in the Near East at the time (it was the official language in the Western part of the Persian Empire); the modern Assyrian language is one of the many forms of modern Aramaic (now split into several different languages, much as Latin evolved into several different languages over much the same period) and this script is properly called Syriac, specifically Estrangela.

--
Aberrations have appeared in my destiny prognostication engine!
Re:In fairness... by bargainsale · 2015-03-21 08:46 · Score: 3, Informative

(They spoke Aramaic long before they became Christian, of course.)

The people in question call themselves Assyrians at the present day; there are some Akkadian words preserved in their Aramaic language even now, although Akkadian itself probably died out in the earlier part of the first millennium BC.

The name "Syriac" is itself from a worn-down version of the same name; it was once used pretty much as the equivalent of "Aramaic" but is now generallly used to describe only one particular version of Aramaic which was a major literary language of Western Asia in early Christian times, and is still used as a liturgical language by Nestorian Christians as far afield as India. The script is used to write several modern Aramaic languages spoken by Christians.

These ancient communities have suffered greatly in the Middle East wars of recent times, and a huge proportion have left as refugees.

--
Aberrations have appeared in my destiny prognostication engine!

Syriac not Assyrian by seyyah · 2015-03-21 07:26 · Score: 4, Informative

That script is the Syriac script not the Assyrian one: https://en.wikipedia.org/wiki/....

Re:Syriac not Assyrian by Anonymous Coward · 2015-03-21 07:33 · Score: 0

That is a messed up script.
https://www.youtube.com/watch?v=FavUpD_IjVY
Re:Syriac not Assyrian by Anonymous Coward · 2015-03-21 16:17 · Score: 0

Same difference. Modern day Assyrians call the language Assyrian or Syriac. I should know being an Assyrian myself.
Re:Syriac not Assyrian by PJ6 · 2015-03-22 11:37 · Score: 1

Yes, but what does it say?

Dupe by NotInHere · 2015-03-21 07:27 · Score: 1

this report is a dupe: https://code.google.com/p/chro...

I call BS.... by Anonymous Coward · 2015-03-21 07:28 · Score: 0

Google translate doesn't even do Assyrian!

Re:I call BS.... by bargainsale · 2015-03-21 08:29 · Score: 1

It says "John, house of Ephraim."

Who says the Internet isn't educational?

--
Aberrations have appeared in my destiny prognostication engine!

Lotus Notes was like this too by tigersha · 2015-03-21 07:33 · Score: 1

I once had a small Notes web thing running for a bunch of people in Scandinavia. The thing crashed every time when someone from Iceland worked with it. Ruend out that the icelandic character is not in some middle european character set (this was before UTF-8) and wasted Notes every time. That was a total bastard of a problem to find.

--
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism

Re:Lotus Notes was like this too by tigersha · 2015-03-21 07:34 · Score: 1

Hah. Slashdot breaks too! It is the Icelandic 'thorn' character http://en.wikipedia.org/wiki/T...

--
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism

Might not be unicode ... by perpenso · 2015-03-21 07:44 · Score: 1

It might not be unicode. I once had a bug because I assumed a particular MacOSX/iOS API call was returning UTF8. It was actually returning old-school MacRoman by default. Worked for some locales, caused a crash on others.

Re:Might not be unicode ... by gnasher719 · 2015-03-21 12:20 · Score: 1

I'd be curious to know which iOS call would return MacRoman.
Re:Might not be unicode ... by perpenso · 2015-03-21 12:54 · Score: 1

It was years ago (2010'ish). I was getting iOS to localize currency amounts and dates. Testing was done in English, French and German and things seemed fine -- yeah my bad for using such similar languages. The crash occurred with a Scandinavian user, I don't recall the particular language. The fix was simple, I believe I merely had to specify that I wanted UTF8 rather than the default.

I've changed version control systems since then so I don't have the check-in history handy.

Re:This by Anonymous Coward · 2015-03-21 08:00 · Score: 0

Yeah, computers should only support good old-fashioned US-ASCII, there's no way any data using those characters could possibly cause anything to break.

So by Anonymous Coward · 2015-03-21 08:01 · Score: 0

How long do you think it's going to take for said characters to be posted (inadvertently, of course) in a comment on this post?

Re:So by EmeraldBot · 2015-03-21 08:21 · Score: 1

How long do you think it's going to take for said characters to be posted (inadvertently, of course) in a comment on this post?
Since Slashdot doesn't actually support Unicode, they wouldn't come in at all. They'd just disappear. Soviet Russia style.

--
"Set a man a fire, he'll be warm for the rest of the night. Set a man afire, he'll be warm for the rest of his life."

Re:Type "bush hid the facts" into Notepad. by rudy_wayne · 2015-03-21 08:01 · Score: 5, Informative

http://blogs.msdn.com/b/oldnew...

About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.

First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.

8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.
UTF-8. These usually begin with a BOM but not always.
Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.
Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.

If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:

D0 AE

Depending on which encoding you assume, you get very different results.

If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "". Sure this looks strange, but maybe it's part of the word VATNI which might be the name of an Icelandic hotel.
If you assume UTF-8, then the file consists of the single Cyrillic character U+042E
If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE
If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0

Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.

Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.

Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,

cmd /u /c dir >results.txt

This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.

The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".

Good news! by Anubis+IV · 2015-03-21 08:03 · Score: 2

In related news, we don't need to worry about this bug being used by unscrupulous sorts of folks in the comments here. The one and only time a lack of unicode support has come in useful...

TFS by Anonymous Coward · 2015-03-21 08:15 · Score: 0

Google are?

Re:TFS by mister_playboy · 2015-03-21 11:42 · Score: 1

That's correct usage in British English, AC. Welcome to the Internet.

--
Do what thou wilt shall be the whole of the Law ::: Love is the law, love under will

Re:LANG=C, baby, LANG=C!!!! by EmeraldBot · 2015-03-21 08:15 · Score: 1

I've had a delightful time explaining to my trainees that *EVERY SERVER SHOULD ONLY BE RUN IN A LANG=C ENVIRONEMNT". Unicode is *bad*, *bad*, *bad* for systems work of any sort.

And in a related XKCD post:

https://xkcd.com/327/

That works, until your servers have to process any kind of foreign characters whatsoever. This is a fault that only affects OS X, only when using Google Chrome. It's not (to my knowledge) a weakness of Unicode.

--
"Set a man a fire, he'll be warm for the rest of the night. Set a man afire, he'll be warm for the rest of his life."

Re:Type "bush hid the facts" into Notepad. by Pinky's+Brain · 2015-03-21 08:18 · Score: 4, Funny

My conclusion is that the unicode guys are assholes.

Re:LANG=C, baby, LANG=C!!!! by l0ungeb0y · 2015-03-21 08:27 · Score: 1

Yeah, well, it's not too hard to escape from unicode hell...

And use what instead? by Anonymous Coward · 2015-03-21 08:50 · Score: 0

And use what instead? Firefox, the browser with a UI just as fucking bad as Chrome's, but that's also much slower and so much more bloated than Chrome is? Or Safari, which is basically equivalent to Chrome, but a year or two outdated? Or Opera, the new version of which is literally Chrome, and the old version which is getting very outdated these days? Or IE, which doesn't even run on OS X? Don't even waste my time with Vivaldi, or Pale Moon, or any of those other half-assed attempts at a modern browser.

Look, Chrome is the best we have on OS X, or any other platform for that matter. Its UI is rubbish, but at least it's a fast, sleek browser, unlike so many of its competitors. I hate Chrome, but the alternatives are so much worse, or not even available on OS X!

Of course we'd have options if Opera hadn't killed their good browser and replaced it with a steaming pile of monkey shit. We'd also have options if the Firefox devs were more concerned with creating a good browser than with crucifying their former CEO because he dared hold an opinion about gay marriage that differed from theirs. But that's not how reality is. So we'll continue to use Chrome until some other browser vendor gets its shit together and releases a better browser.

Re:And use what instead? by Anonymous Coward · 2015-03-21 13:28 · Score: 0

Well, I meant Spartan on Windows 10 but I didn't want to look like a troll ;)

so does imagur by rs79 · 2015-03-21 09:01 · Score: 1

mtbf - 15 mins.

--
Need Mercedes parts ?

Sounds occult by tanimislam · 2015-03-21 09:06 · Score: 1

hmm, ancient and dead language from the time of reported magic. Just typing the words will crash your Mac. Imagine if one spoke them!

Re:Sounds occult by Anonymous Coward · 2015-03-22 05:15 · Score: 0

and, 13 letters :)

Since Snowcrash ... by angel'o'sphere · 2015-03-21 09:33 · Score: 1

... we know that Assyrian or more precisely Sumerian is tricky.

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.

Didn't work by Anonymous Coward · 2015-03-21 09:43 · Score: 0

Just tried it in Chrome on OS X. Out of date article??

Who built the pyramids? by Anonymous Coward · 2015-03-21 10:05 · Score: 0

The news header speaks of Assyrian script, but Slashdot provides an egyptian scarabeus bug icon to accompany it

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-21 10:10 · Score: 0

hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?

Re:Type "bush hid the facts" into Notepad. by AmiMoJo · 2015-03-21 10:26 · Score: 2

Unicode made three big mistakes.

1. Attempting to be backwards compatible with a subset of ASCII. A subset that breaks all the common encodings used outside the US.

2. Multiple encodings (8, 16 and 32 bit). Pick one, stick to it, don't make try to guess with stupid BOMs etc.

3. CJK unification. Trying to merge three distinct languages in a way that makes it impossible to mix them in a pure Unicode document.

So yeah, those guys are assholes.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-21 11:26 · Score: 0

As for point 1. UTF-8 is backward compatible with the full ASCII set, the full ASCII set only contains 128 code points. The extension for latin-* are beyond ASCII.

I agree the UTF-16 encodings where a mistake, the whole thing with the encoding of extended planes. Maybe they should even drop UTF-32 as an encoding, UTF-8 can encode any character anyway.

Re:LANG=C, baby, LANG=C!!!! by Half-pint+HAL · 2015-03-21 11:35 · Score: 1

Yeah, because people who speak funny foreign languages don't deserve to use our linguistically pure English-speaking servers, right?

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'

Re:Type "bush hid the facts" into Notepad. by Alain+Williams · 2015-03-21 12:34 · Score: 2

Unicode and how it is represented in a file are two different things. Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.

How to represent it in a file is different. UTF-8 is the obvious answer today, but other encodings were tried by different organisations first. The big win of UTF-8 is that you can have characters from very different regions on the same web page (or in the same file) - something that you cannot do you you adopt a purely 8 bit code like iso-8859-1.

We are still in transition: there are files encoded in various ways out there; however I think that UTF-8 will eventually become the encoding mechanism that everyone uses - so files encoded in other ways will become increasingly rare. So: a bit of patience please.

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-21 13:30 · Score: 0

hmm. unicode is fine, utf-8 is fine. only windows uses boms. so who's the asshole?

The byte order mark is part of the unicode standard, and is used all over the place besides windows. Your question answers itself.

Re:Type "bush hid the facts" into Notepad. by jrumney · 2015-03-21 14:46 · Score: 2

For UTF-16. "Only Windows uses BOMs" is pretty much correct for UTF-8, where the Unicode standard discourages it.

Re:Type "bush hid the facts" into Notepad. by jrumney · 2015-03-21 14:59 · Score: 1

4. Inconsistent policy for character inclusion. After years of opposing addition of symbols commonly used in typesetting or web pages (such as a common symbol for indicating external links consisting of a box with a curved arrow coming out of it) on the basis that they are "not plain text and best represented by graphic images", we get emoji added. And they still won't add many of these symbols they've opposed in the past (they recently added the standard triangular recycling mark, but this was long after the emoji was added with several circular Japanese recycling marks clearly demonstrating the hypocracy).

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-21 18:29 · Score: 0

This is the same problem that killed Internet Explorer - to make things easy on the devs we allowed malformed pages. No need to follow the standard, the algorithm will try to figure out what you mean and try to do the right thing. End result: How many people use IE these days? How many devs want to code on that platform?

Heuristic approaches to solve crappy interpretations of standards does nothing good for the standard - eventually it muddies the standard to the point it becomes exploitable and utterly useless. In other words stop catering to the stupid people. Wrong should be wrong - end of. Stop being clever about it.

Spell by Anonymous Coward · 2015-03-21 19:11 · Score: 0

I come from Assyrian origin and I can ensure you that these letters form strong black spell which could crash wizards books and it seems to have similar effects on today's computers.

Re:Type "bush hid the facts" into Notepad. by disambiguated · 2015-03-21 19:48 · Score: 1

I agree overall with your comment, but I think UTF-8's backwards compatibility with ASCII was genius and is the reason we have as much Unicode support as we do today. I consider UTF-8 to be one of the best hacks of all time. Without it, the software that existed at the time would have had to be thrown out or re-written. The fact that software can (often) process UTF-8 without even being aware that it isn't ASCII was exactly what was needed to get Unicode off the ground. UTF-8 allowed Unicode to be adopted incrementally (especially by Unixes, which were much slower to adopt any (universal) international character set than Windows was).

Sadly, not everyone is as brilliant as Ken Thompson, so the UTF-8 encoding didn't exist when Unicode and ISO 10646 were first created. If someone had thought of it just a few years earlier we probably would have used that for nearly everything, and your second point would be irrelevant.

But by the time Unicode was even a thing, a lot of the software industry was already invested in ISO 10646, specifically UCS-2 (notably Microsoft and IBM, but plenty of others) so unless you think excluding IBM and Microsoft (in 1990!) would have been good for the widespread adoption of Unicode, the designers had no choice but to have multiple encodings.

Ironically, Linux and Apple were able to chose the (arguably much better) UTF-8 encoding only because they got serious about adopting an international character set several years later than Microsoft and IBM did (call it second mover advantage.)

So I couldn't call those mistakes. More like "historical accidents", just like most other bad designs we have to live with.

Your third point is just a face-palm, I agree.

Re:Type "bush hid the facts" into Notepad. by AmiMoJo · 2015-03-21 23:52 · Score: 1

The problem is that ASCII is only useful for US English. Other forms of English need symbols like the pound (Â£) sign. Other Latin derived languages need accented characters. Non-Latin languages already use some subset of ASCII plus extensions. Any software that has to support more than just 7-bit US ASCII and UTF-8 has to guess, and usually gets it wrong.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Didn't Steve Jobs have Assyrian Heritage? by nucrash · 2015-03-22 00:23 · Score: 1

I know, Syrian, but still. I always knew he was going to be the death of Apple.

--
Place something witty here

Re:Type "bush hid the facts" into Notepad. by nogginthenog · 2015-03-22 01:07 · Score: 1

The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.

C# and Java use UTF16 internally for strings.

Re:Type "bush hid the facts" into Notepad. by Alain+Williams · 2015-03-22 01:19 · Score: 1

I agree completely. There is no reason that a program cannot read UTF-8 and store as UTF-32 internally. There is a trade-off between time and memory. Note that UTF-16 is also a variable length encoding scheme so you still need to start at the start of string to find the nth character.

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-22 03:29 · Score: 0

UTF-16 has the exact same problem, not every codepoint fits in the original UCS2 encoding so they added surrogate pairs. Only UTF-32/UCS4 escapes this issue but you still have to count from the start because what a human calls a on-screen character can be composed of several codepoints.

Re:Type "bush hid the facts" into Notepad. by Anonymous Coward · 2015-03-22 05:51 · Score: 0

The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".

Or, you know, we could all accept that Notepad was created originally for 7-bit ASCII (with quasi 8-bit ANSI support)) and that either a specific override or a BOM should be required to get different behavior. Because the only reason 'your algorithm look stupid for choosing "incorrectly"' is when you try to create a "smart" algorithm and its made to look "dumb". Meanwhile, if you choose a "dumb" algorithm, you'll see the backlash against the "smart" people who think they're clever.

Re:Type "bush hid the facts" into Notepad. by Hognoxious · 2015-03-22 07:04 · Score: 1

Unicode made one enormous mistake - existing in the first place.

If plain ascii was good enough for Virgil, Newton & Shakespeare it's good enough for you.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Re:Type "bush hid the facts" into Notepad. by spitzak · 2015-03-22 11:37 · Score: 1

Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use

Yay! You actually got the answer partially correct. However you then badly stumble when you follow this up with:

8-bit ANSI, but under no circumstances UTF-16

The correct answer is "after knowing it is not UTF-8, use your complicated and error-prone encoding detectors".

The problem is a whole lot of stupid code, in particular from Windows programmers, basically tries all kinds of matching against various legacy encodings and UTF-16, and only tries UTF-8 if all of those return false. This is why Unicode support still sucks everywhere.

You try UTF-8 FIRST. This is for two reasons: first because UTF-8 is really popular and thus likely the correct solution (especially if you count all ASCII files as UTF-8, which they are). But the second is that a random byte stream is INCREDIBLY unlikely to be valid UTF-8 (like 2.6% chance for a two-byte file, and geometrically lower for any longer ones), this means your decision of "is this UTF-8" is very very likely to be correct. Just moving this really reliable test to be the first one will improve your detection enormously.

The biggest help would be to check for UTF-8 first, not last. This would fix "Bush hid the facts" because it would be identified as UTF-8. But a variation on that bug would still exist if you stuck a non-ASCII byte in there, in which case it would still be useful (but much much less important) to not do stupid things in the detectory, for instance requiring UTF-16 to either start with a BOM or to have at least one word with either the high or low byte all zero would be a good idea and indicate you are not an idiot.

Re: novice programmer alert! by spitzak · 2015-03-22 11:42 · Score: 1

The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.

And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.

C# and Java use UTF16 internally for strings.

And you are aware that UTF-16 is variable-length as well, and therefore you can't "find the nth character" quickly either?

You might want to retake compsci 101.

Re:Type "bush hid the facts" into Notepad. by thunderclap · 2015-03-22 11:53 · Score: 1

Same thing happens when you type Bill fed the goats. Its an unicode error in notepad for XP. You want something fun? type that into Chrome for a mac in an apple store. Thats fun.

Re:Type "bush hid the facts" into Notepad. by thunderclap · 2015-03-22 11:54 · Score: 1

Since it deleted the word here is an image of it http://2.bp.blogspot.com/-_TfD...

Re:Type "bush hid the facts" into Notepad. by spitzak · 2015-03-22 12:13 · Score: 1

Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.

The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a variable-length byte encoding for a system that still had to deal with existing software and data that was ASCII, this was a command from people who did not have to write and maintain the software.

The programmers, who knew damn well that variable-length was the correct solution, were unfortunately not bright enough to avoid making mistakes in their encodings (such as not making them self-synchronizing). UTF-8 fixed that, but these errors also led some of the less-knowledgeable to think there was a problem with variable length.

Unfortunately political correctness at Microsoft won, despite the fact that they had already added variable-length encoding support to Windows. It may also have been seen as a way to force incompatibility with NFS and other networked data so that Microsoft-only servers could be used.

One of the few good things to come out of the "Unix wars" was that commercial Unix development was stopped before the blight of 16-bit characters was introduced (it was well on it's way and would have appeared at the same time Microsoft did it). Non-commercial Unix made the incredibly easy decision to ignore "wide characters".

The biggest problem now is that Window convinced a lot of people who should know better that you need to use UTF-16 to open files by name (all that is really needed is to convert UTF-8 just before the api is called). This led to UTF-16 to infect Python, Qt, Java, and a lot of other software and cause problems and headaches and bugs even on Linux. There is some hope that they are starting to realize they made a terrible mistake, Python in particular seems to be backing out by storing a UTF-8 version of the string alongside the UTF-32.

Re:Type "bush hid the facts" into Notepad. by AmiMoJo · 2015-03-23 02:13 · Score: 1

Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.

That's one of its biggest problems: it doesn't support all the characters in Chinese. In fact it doesn't really support any of them, because they tried to merge them with Japanese and Korean characters. The result is that Unicode contains a sort of amalgamation that can be used to approximate any of those three languages, but not represent them properly.

I listen to both Japanese and Chinese music. Unicode is broken for me. There is no way to tell if a character is a Chinese or a Japanese one. The character has the same Unicode code for both languages. The software is supposed to somehow magically know which language is in use and select a Japanese or Chinese font. When you have file names or metadata tags there is no simple way of determining language, you just have to guess. Humans are pretty good at guessing, machines not so much.

That problem has nothing to do with encoding, it's to do with the standard body trying to merge characters from different languages that shouldn't be merged.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Slashdot Mirror

OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

119 comments