Unicode Consortium Releases Unicode 8.0.0

← Back to Stories (view on slashdot.org)

Unicode Consortium Releases Unicode 8.0.0

Posted by timothy on Friday June 19, 2015 @05:39PM from the hobo-symbols-and-the-black-tongue-of-mordor dept.

An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.

4 of 164 comments (clear)

Min score:

Reason:

Sort:

CJK is Unicode's big failing by Anonymous Coward · 2015-06-19 18:04 · Score: 5, Interesting

CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.
What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.
But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.
Unicode is badly designed by Anonymous Coward · 2015-06-19 18:14 · Score: 2, Interesting

Is Unicode supposed to separate characters that look the same but are semantically different?
Looks like the answer is yes...
'LATIN CAPITAL LETTER A' (U+0041)
'GREEK CAPITAL LETTER ALPHA' (U+0391)
Looks like the answer is no...
'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
(An apostrophe and closing a quotation are two very different things.)
Re:I'm going back to ASCII by Dutch+Gun · 2015-06-19 20:33 · Score: 4, Interesting

Don't let the door hit you in the ass on the way out to the pasture of obsolescence. The rest of us will continue to use Unicode, which, despite some flaws, such as their mess-up with Han Unification, does a pretty good job at solving the problem of language intercommunication. If anyone thinks they can do a better job (not counting reverting back to English-only ASCII), have at it.
So go ahead and use ASCII, don't type in any of those dern foreign charcturs, and pretend you're back in the happy past where we had a mess of incompatible standards, and no way to easily discern which of the many possible encodings was actually used, resulting in the scrambled text we always used to see (notice you *don't* actually see that much anymore?). And most software just ignored the rest of the non-English-speaking world anyhow because of that mess.
Personally, I'm thankful people are willing to take on largely thankless (and mind-numbing to most of us) tasks such as these.

--
Irony: Agile development has too much intertia to be abandoned now.
Re:Ithought by tlhIngan · 2015-06-19 20:46 · Score: 4, Interesting

That slashdot didn't support unicode
It does. It's actually fully Unicode-compliant. It's just on the input and recently (as of a couple of years ago) the output side passes through a Unicode whitelist.
You see a Unicode codepoint is not necessarily a character.. It can be a character modifier. So you can be handling a string containing multiple codepoints, and yet on screen it only resolves to one character. Some of these include right-to-left overrides (which alter the flow of text on the screen so you can write a string and the display agent will reverse it). There are other modifiers that include flourishes, and Unicode 8 adds "skin type modifiers" as well for emoji. As in, if you display a face, the font should use a "non-human shading" (Apple chose a Simpsons-like yellow, Microsoft chose a pale zombie-ish hue). But with the addition of a skintone/diversity modifier, when combined with the emoji codepoint, can give you a variety of skin tones.
And it's also what screwed up iOS - the string you send is full of modifiers which makes it extremely hard to decide where to break the line. (Arabic is one where there are lots of modifiers because a character can appear differently based on the characters that appear before and after it).
And what does this have to do with /.? Easy - a lot of commenters abused the modifiers to screw with the website. And unless you know how to handle Unicode, it's really hard to properly reset the parser state. /. used to be able to display the screwed up the comments - if you Google for the oddball string n"5:erocS" it would show it (because Google ignores modifiers). If you wonder, that's the string "Score:5" as commenters use to fake-moderate their posts. But since /. strips unicode on display now, you get to see the messed up post as it was typed out