Unicode Consortium Releases Unicode 8.0.0

← Back to Stories (view on slashdot.org)

Unicode Consortium Releases Unicode 8.0.0

Posted by timothy on Friday June 19, 2015 @05:39PM from the hobo-symbols-and-the-black-tongue-of-mordor dept.

An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.

2 of 164 comments (clear)

Min score:

Reason:

Sort:

Already = 65K characters by divec · 2015-06-19 18:53 · Score: 4, Informative
"...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
- 1. UTF-16 is *not* two bytes per character
- 2. Therefore a "character" in Java, C#, Javascript sometimes only holds half a Unicode character
- 3. Even a whole unicode character may be only part of a grapheme cluster, which means that taking arbitrary substrings may not result in readable text.
--
perl -e 'fork||print for split//,"hahahaha"'
Re:CJK is Unicode's big failing by gustygolf · 2015-06-19 20:45 · Score: 4, Informative

In short:
To render text properly in Japanese, you need a Japanese font. To render text properly in Chinese, you need a Chinese font. It's not just because of character coverage, but because of a thing called Han unification the consortium did.
The Unicode consortium decided to map similar characters to the same code-point. Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.
And this is problematic when there are two CJK languages mixed in the same document -- in the GP's case, Chinese and Japanese --, or when a program must automatically decide which font to render things in.
Take a web browser for example. It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.
(Typically, software used base the guesses on the user's locale. It's pretty accurate -- Chinese users tend to view Chinese documents, Japanese Japanese ones. But the problems start when someone tries viewing a 'foreign' document...)
It's really quite ironic that the consortium decided on codepoint unification for the three languages that would most benefit from Unicode.

--
"Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.