Unicode Consortium Releases Unicode 8.0.0
An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.
That slashdot didn't support unicode
Beta creep sucks :(
I'm kind of sick of all of this nonsense. If you have something to say that isn't in ASCII, use a GIF or MP3. Even ASCII has a lot of garbage in it like @.
“Common sense is not so common.” — Voltaire
Adding a bunch of useless characters, especially from computer-illiterate regions... Most languages can be written with English characters (ie. plain latin). That would make things a lot simpler.
CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.
What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.
But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.
According to "http://babelstone.blogspot.com.au/2005/11/how-many-unicode-characters-are-there.html", the last version has 113021 encoded characters.
Is Unicode supposed to separate characters that look the same but are semantically different?
Looks like the answer is yes...
'LATIN CAPITAL LETTER A' (U+0041)
'GREEK CAPITAL LETTER ALPHA' (U+0391)
Looks like the answer is no...
'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
(An apostrophe and closing a quotation are two very different things.)
Slashdot Glyphs obscure the titles. Please fix.
Unicode now has a set for pre-Latin Hungarian runes!
Hanging out for the keyboard....
Don't be apathetic. Procrastinate!
The comment about growth is so wrong as to be mind-boggling. Where on earth did that figure come from? Unicode 1.0.1 had more than that in the early 90s. See here for a good table with all the gory details.
That slashdot didn't support unicode
You thought right, Slashdot does not support unicode, this story is just news for nerds that is reported by accident, as stuff that matters for G[r]eeks only!
note: i now continue my comment with a very interesting paragraph, but it is in Greek, so you can not read it, not even if you want to translate it:
Antisthenes: "Wisdom begins by examining the words/names." - excuse my English, i am (slightly...) better with my Greek!
Unicode adds support for new languages like Ik, used in Uganda
Now Uganda needs computers to see what Unicode looks like.
Slashdot, fix the reply notifications... You won't get away with it...
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
perl -e 'fork||print for split//,"hahahaha"'
I'm seeing this problem too.
Help build the anti-software-patent wiki
You can look at the size of the required but insufficient supporting libraries to get an indication (but only an indication, mind) of the cost of unicode. It's quite high. It even has a capturing effect for English, since lots of devs believe "it is the standard" or "it is the future" or somesuch nonsense, enabling the thing by default and adding even more code to "nicen up" any and all output even for text where pure ASCII would have been sufficient. This actually reduces interoperability for reasons of "modernity".
You know, something like "smart quotes" in IRC (which strictly is against the IRC standard, since they do define the character set in use and it typically isn't utf-8). Petty? Eh, I still use ASCII-only on English-written IRC channels and I get to see the fall-out, even if you don't. I like my client, why are you throwing crap at it? Because your software thinks that's a good default, that's why.
There are many more problems with unicode, including security problems, wilfully introduced interoperability problems, problems with having too many different encodings to do the same thing, and so on, and so forth. Usually subtle and hard-to-see problems, and there really isn't a good "universal" alternative, so people keep on using this one. Because it's "universal", see? Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.
In short, unicode is about as universal as USB, including the built-in crappiness. That means that while it is something of an enabler, there's quite a cost attached. We do the unicode thing because it seems universal, but in practice it is far less so than it promises. And most of the time you don't really need that universality.
Counterpoint: If we had a clear marker for encoding used, you could switch encodings and thereby switch rules on the fly, and use shorter encodings for the non-latin1-languages you use the most. Of course, you couldn't mix characters from fifty scripts at will, even mix and match accents among them. But again, the ability to do that is awesome expressive power that comes at a continuous cost but no practical gain.
So it's hard to use when it promises to be a single easy-to-use solve-all. And of course that's the fault of "shitty programmers".
I say the shittiness starts with the unicode committee.
They lost any and all respectability when they let the emoji cancer in. To hell with them.
Humans developed different languages in different regions. Now we have different languages with different features and different cultural ties. While it is often possible to translate the semantics of one language is an equivalent in another language, you have more trouble doing so with pragmatics. And in addition the result does not "taste" as good as the original. It is a little bit like food. You could just consume a nutritious supplement to sustain life. However, all the culture and tastes and emotions around food would be wasted. Even as an US-American you are aware that their are different feelings and moods attached to, lets say, porridge, a steak, a burger, a donut, a beer, Chinese take-out, pizza, corn etc.
Recent studies showed that we even have different personalities depending what language we are using. So it would be great to be only able to speak, read, and listen to one single language. And if we should agree on one. Are you willing too learn Chinese?
I would have said that even English requires diacritics to support some of its loanwords.
grep -E "[éóèâêûäöñç]" /usr/share/dict/words | grep -v "[A-Z]" | wc
174 174 1720
I know bits are cheap, but...really?. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".
Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.
Enjoy life! This is not a dress rehearsal.
Is it better at CJK ?
...even more "dominoes" to show up on my screen because the OS/applications can't render/display Unicode properly to save their lives.
... for practical purposes over 9000 languages is a bit much ...
But for historical purposes it makes sense. Shouldn't we be digitizing as much of antiquity and vanishing cultures/languages as possible. Note that its a pretty bad time for the physical preservation of antiquities in the cradle of civilization right now.
It would be helpful to academics to have such languages in a textual format not merely an image format.
Not a bad idea - abolishing every other language in the world - Chinese, Spanish, Arabic, Russian, Hindi, Urdu, Bengali, Swahili, Portugese and the whole bunch of them. Just have ENGLISH - that too, the US one, and nothing else!!! Let everyone, including the Brits and Kanucks, have to adjust - some more than others.
Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
While those may be the biggest recent wars, they are by no means the only wars in history. There was the Muslim conquests of everything from Spain to India b/w the 7th to 10th centuries, which obliterated Christianity, Zoroastrianism, Animism, Buddhism and Hinduism from a lot of the territories it conquered. There were the Conquistadoras, who overran the Aztec, Mayan & Inca empires and replaced it w/ the Spanish inquisition. There was the Thirty Years War, fought to determine whether Central Europe should be Catholic or Lutheran dominated. And today, there is the global Muslim campaign to destroy as much as possible of non-Muslim countries and subvert them until they become Islamic - that's the underpinnings of the campaigns of al Qaeda, ISIS, Hizbullah, Muslim Brotherhood and so on. Also, if one considers Communism a 'religion', which it is except that it substitutes some imaginary friends w/ dead friends, then you have the entire Soviet Purges, the Chinese Cultural Revolution and Pol Pots holocaust in Cambodia to add to the mix.
Thank you for ninjaing me. I often chime in about this issue when someone complains about Slashdot's lack of support for Unicode. Most of the time, after I explain the code point whitelist and the reason for it, someone complains that a blacklist of dangerous code points would work better. My usual reply is that new versions of Unicode may insert new control code points that get activated before the Slashdot admins have the chance to add them to the blacklist. And besides, many characters outside the current whitelist are far more useful for what used to be called "ASCII art" than for readable text in the English language. For example, Oriya letter ii (U+0B08) looks to English speakers more like the head of a Smurf. And ASCII Goatse and ASCII Jack Off are why Slashdot had to add a lameness filter in the first place.
But apparently, Slashdot doesn't strip bad characters on display, only on post. This post, for example, still contains a bidirectionality override.
While Unicode lumps Korean together with Chinese and Japanese ("CJK"), Korean has an alphabet. It is not an ideographic language like Chinese. https://en.wikipedia.org/wiki/Hangul.
Sorry, why do we need multiple languages again?
Originally, to punish ancient Babylonians for trying to build a dangerously tall ziggurat. Since then, to preserve access to oral tradition.
Both of this musician's names can be represented in ASCII: "Prince Rogers Nelson" and "O(+>".
So yet another major version number and they still haven't bothered to add the many arrow (and other directional) symbols that have been missing...