Unicode Consortium Releases Unicode 8.0.0
An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.
That slashdot didn't support unicode
CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.
What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.
But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.
Is Unicode supposed to separate characters that look the same but are semantically different?
Looks like the answer is yes...
'LATIN CAPITAL LETTER A' (U+0041)
'GREEK CAPITAL LETTER ALPHA' (U+0391)
Looks like the answer is no...
'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
(An apostrophe and closing a quotation are two very different things.)
Most languages can be written with English characters (ie. plain latin).
Name a language written with Latin characters (other than English) that does not use any special characters or diacritical marks whatsoever.
(Even English requires extensions to the Latin character set. which originally had no "U", "J", or "W". )
Il n'y a pas de Planet B.
Unicode now has a set for pre-Latin Hungarian runes!
Hanging out for the keyboard....
Don't be apathetic. Procrastinate!
That slashdot didn't support unicode
You thought right, Slashdot does not support unicode, this story is just news for nerds that is reported by accident, as stuff that matters for G[r]eeks only!
note: i now continue my comment with a very interesting paragraph, but it is in Greek, so you can not read it, not even if you want to translate it:
Antisthenes: "Wisdom begins by examining the words/names." - excuse my English, i am (slightly...) better with my Greek!
I'd be happy with a reduced set of 64 characters. ,./?;":[]\ =-+)(*&^%$#@!~
A-Z
0-9
EOF
Return/Newline {we don't use typewriters, let's use a single character)
Drop ', `, _, &;t;, >, {, }, tab, and |
Yes, there are all in use, but fuck it.
Sorry, why do we need multiple languages again?
“Common sense is not so common.” — Voltaire
Dutch has an extra character not in the Latin character set.
"Old man yells at systemd"
what are you using @ ~ ^ # for other than Rogue/Nethack ?
ITA2 without the letter/figure shift (so 6-bit instead of 5-bit) would be fine.
A single alphabet for every computer user seems like an efficient use of resources and enables wider communication. (and an alphabet that isn't constantly expanding in size at a seemingly exponential rate). Latin alphabet as used in English seems OK, Cyrillic is better in many ways.
“Common sense is not so common.” — Voltaire
You mean "ij"? The unified ij character isn't used by anyone. Not sure if it's even recommended by any body.
But Dutch does have accents (één, vóór, ...). News headlines this morning:
"Verstekeling valt boven Londen uit vliegtuig na 11u lange vlucht, één overleeft"
"Grieken demonstreren ook vóór de euro"
Help build the anti-software-patent wiki
Accents in Tagalog are optional and rarely used, but they are there.
I've never seen them used on websites, but they're used in most or all dictionaries.
Help build the anti-software-patent wiki
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
perl -e 'fork||print for split//,"hahahaha"'
I'm seeing this problem too.
Help build the anti-software-patent wiki
Don't do it then.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Tagalog and Bahasa Malaysia/Indonesia aren't "most languages". :)
That being said, I should have thought of the latter myself.
Il n'y a pas de Planet B.
You can look at the size of the required but insufficient supporting libraries to get an indication (but only an indication, mind) of the cost of unicode. It's quite high. It even has a capturing effect for English, since lots of devs believe "it is the standard" or "it is the future" or somesuch nonsense, enabling the thing by default and adding even more code to "nicen up" any and all output even for text where pure ASCII would have been sufficient. This actually reduces interoperability for reasons of "modernity".
You know, something like "smart quotes" in IRC (which strictly is against the IRC standard, since they do define the character set in use and it typically isn't utf-8). Petty? Eh, I still use ASCII-only on English-written IRC channels and I get to see the fall-out, even if you don't. I like my client, why are you throwing crap at it? Because your software thinks that's a good default, that's why.
There are many more problems with unicode, including security problems, wilfully introduced interoperability problems, problems with having too many different encodings to do the same thing, and so on, and so forth. Usually subtle and hard-to-see problems, and there really isn't a good "universal" alternative, so people keep on using this one. Because it's "universal", see? Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.
In short, unicode is about as universal as USB, including the built-in crappiness. That means that while it is something of an enabler, there's quite a cost attached. We do the unicode thing because it seems universal, but in practice it is far less so than it promises. And most of the time you don't really need that universality.
Counterpoint: If we had a clear marker for encoding used, you could switch encodings and thereby switch rules on the fly, and use shorter encodings for the non-latin1-languages you use the most. Of course, you couldn't mix characters from fifty scripts at will, even mix and match accents among them. But again, the ability to do that is awesome expressive power that comes at a continuous cost but no practical gain.
They lost any and all respectability when they let the emoji cancer in. To hell with them.
Don't let the door hit you in the ass on the way out to the pasture of obsolescence. The rest of us will continue to use Unicode, which, despite some flaws, such as their mess-up with Han Unification, does a pretty good job at solving the problem of language intercommunication. If anyone thinks they can do a better job (not counting reverting back to English-only ASCII), have at it.
So go ahead and use ASCII, don't type in any of those dern foreign charcturs, and pretend you're back in the happy past where we had a mess of incompatible standards, and no way to easily discern which of the many possible encodings was actually used, resulting in the scrambled text we always used to see (notice you *don't* actually see that much anymore?). And most software just ignored the rest of the non-English-speaking world anyhow because of that mess.
Personally, I'm thankful people are willing to take on largely thankless (and mind-numbing to most of us) tasks such as these.
Irony: Agile development has too much intertia to be abandoned now.
We're shedding languages like crazy, that with lots of small languages going extinct and all. And as sad as extinction is, for practical purposes over 9000 languages is a bit much. Yet I don't think a single language would be a good thing either. Language shapes your thought-space, meaning that some languages make it easier, sometimes outright enable, thinking thoughts that in other languages aren't very accessible or even possible. So having at least a couple sufficiently different languages is the better idea. Otherwise you foster a monoculture of thought, and we already know what such things do to crops or even "operating system ecosystems": It fosters sickliness and unhealth.
Besides, without conflict we won't improve. We actually do need a bit of disagreement now and then. With but a single language we'd be stuck to a single thought-space, making it that much harder to shed new light on old conflicts. For all of you sad sacks with fluency in only one human language, go learn at least one other, preferrably from a completely different language family. See what it does for your ability to think.
Humans developed different languages in different regions. Now we have different languages with different features and different cultural ties. While it is often possible to translate the semantics of one language is an equivalent in another language, you have more trouble doing so with pragmatics. And in addition the result does not "taste" as good as the original. It is a little bit like food. You could just consume a nutritious supplement to sustain life. However, all the culture and tastes and emotions around food would be wasted. Even as an US-American you are aware that their are different feelings and moods attached to, lets say, porridge, a steak, a burger, a donut, a beer, Chinese take-out, pizza, corn etc.
Recent studies showed that we even have different personalities depending what language we are using. So it would be great to be only able to speak, read, and listen to one single language. And if we should agree on one. Are you willing too learn Chinese?
No we would not do well with only one language, we would loose a lot of culture. It would be like one standard food for everyone. Furthermore, your proposition is ludicrous, as language changes all the time. That's why new street languages pop up and then evolve in something different. Language is reflection of culture. It is not like a programming language. If you want to communicate with other people you should learn additional languages. And while you are at it, also try to learn something about their culture. Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.
BTW: The most problems between societies are not based on religion. Religion is only used as a vehicle to transport the hostility. It is about greed, ignorance, stupidity, and frustrations.
You don't have much to spare, it seems.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
I would have said that even English requires diacritics to support some of its loanwords.
grep -E "[éóèâêûäöñç]" /usr/share/dict/words | grep -v "[A-Z]" | wc
174 174 1720
This is an orthographic mistake which may happen even to native speakers. So please forgive me when I make some mistakes. Hopefully you still got the message.
And literacy...
Not if you use "OK Google" Voice search.
Slashdot, fix the reply notifications... You won't get away with it...
It's a shame Unicode has become the standard. It's numerous flaws can't be overlooked, but we seem to be stuck with it now. Maybe the consortium will eventually fix those things, especial CJK support, but so far there is little sign that they care.
It's terrible that we screwed up something so important.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
This is exactly why it took decades and crazy hacks for people to write their own language electronically.
Thank God virtually failed (but won) Plan 9 (UNIX2) came by with idealistic developers who respects other cultures came up with Unicode and companies like IBM/Microsoft/Adobe along with Free software supported it.
Who knows if the software/hardware/network combination you use had a line coded by a person who is from those "computer illiterate" regions?
I know bits are cheap, but...really?. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".
Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.
Enjoy life! This is not a dress rehearsal.
Not like a programming language? Then what's with all the different numbers after C++ standards and all the different variants of BASIC?
Not too bad, only about 112765 too many.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Does anybody else read this as the Librarian?
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
What are you doing these days? When Doves Cry was part of my childhood.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Sorry, why do we need multiple languages again?
http://www.scientificamerican....
C isn't a good example. It doesn't even do a good job of handling strings of 1-byte characters.
You're confusing languages and alphabets. Ever heard of pinyin?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Replying to undo incorrect moderation.
Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).
Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
You may have to adjust your expectations as to the correct form of your name.
I would be OK defining a few standards for transcription. This is different than just using UTF-8 because the undecoded form is still human readable.
ps - people pronounce my name wrong half the time but it's only 4 letters long and is a common noun in most parts of the US.
“Common sense is not so common.” — Voltaire
Sorry, why do we need multiple languages again?
Have you read the latest C++ spec? That's what happens when a single language does everything.
The same effect happens in people languages too.
Don't tell HR. They'll be adding "fluent in Kernowek and Ainu" to every job description.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
...even more "dominoes" to show up on my screen because the OS/applications can't render/display Unicode properly to save their lives.
Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).
You almost had a point until you suggested that current wars in the middle east and africa don't have a major religious component. Then you lost credibility. Then I started to rethink the earlier ones and began to note that religion existed their too. For example the Nazis in-fact were trying to create an alternative religion with Naziism, complete with its own tenants of faith, saints, mythologies, etc. Imperial Japan's military justified everything done as service to the living-god emperor. So yeah, religion has helped give us many of the wars you claim otherwise.
... for practical purposes over 9000 languages is a bit much ...
But for historical purposes it makes sense. Shouldn't we be digitizing as much of antiquity and vanishing cultures/languages as possible. Note that its a pretty bad time for the physical preservation of antiquities in the cradle of civilization right now.
It would be helpful to academics to have such languages in a textual format not merely an image format.
You've confusing causation with correlation. Religion is a good way to make people do thing for you, but your actual reasons are different. From the current conflicting parties, only ISIS is really driven by religion first and foremost, so I'll concede on that one. As for the others, nope, the driving impetus is non-religious even though religion is used for gluing purposes.
Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
Not a bad idea - abolishing every other language in the world - Chinese, Spanish, Arabic, Russian, Hindi, Urdu, Bengali, Swahili, Portugese and the whole bunch of them. Just have ENGLISH - that too, the US one, and nothing else!!! Let everyone, including the Brits and Kanucks, have to adjust - some more than others.
Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
While those may be the biggest recent wars, they are by no means the only wars in history. There was the Muslim conquests of everything from Spain to India b/w the 7th to 10th centuries, which obliterated Christianity, Zoroastrianism, Animism, Buddhism and Hinduism from a lot of the territories it conquered. There were the Conquistadoras, who overran the Aztec, Mayan & Inca empires and replaced it w/ the Spanish inquisition. There was the Thirty Years War, fought to determine whether Central Europe should be Catholic or Lutheran dominated. And today, there is the global Muslim campaign to destroy as much as possible of non-Muslim countries and subvert them until they become Islamic - that's the underpinnings of the campaigns of al Qaeda, ISIS, Hizbullah, Muslim Brotherhood and so on. Also, if one considers Communism a 'religion', which it is except that it substitutes some imaginary friends w/ dead friends, then you have the entire Soviet Purges, the Chinese Cultural Revolution and Pol Pots holocaust in Cambodia to add to the mix.
Back in days of yore, before Facebook messaging and Whatsapp but after bang paths, @ was an essential part of a quaint communications system called "e-mail". Some old fogies still use it. Now get off my lawn and go ask the person who (I presume) sold you that 6-digit uid for a history lesson.
Han unification.
Variable with encodings.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
CJK is supported, but the early Unicode committee unfortunately decided to "unify" the codepoints of characters with shared ancestry, even though they may be rendered differently in different language.
Technically speaking, Unicode represents characters, not glyphs, so it's a matter of whether you consider the different language-specific visual representations of those characters to be distinct characters themselves, not simply visual differences of the same character (which most native speakers would, of course). Obviously, since most text would not bother to interject a Chinese version of a character into an otherwise Japanese text, this works out reasonably well in most situations, so long as you're indicating which language you're reading (something Unicode was not supposed to need, right?). However, this multiple mapping creates significant difficulty for scholarly work in particular, which may need to reference individual glyphs of various countries within a single piece of work - something that was not easily done thanks to the way the characters were unified. Or for instance, say you have a small section of Japanese text inside a larger Chinese document, or vice versa, as you might see on the web... oops, sorry.
In version 3.2, Unicode introduced the ability to select among visual variants to help solve this issue, but I think many people feel this is a bandaid (and apparently poorly supported) solution over a poorly thought-out decision in the first place, made by representatives from North America corporations (with Chinese advisers) who didn't appreciate or care about the problems they were creating with this decision for other countries like Japan and Korea. It also may be the case that there was pressure to fit the codes into a 16-bit space, which was an initial goal until it was recognized to be completely infeasible to do so, but is pointless with the large encoding space we have today of a million character code points. Note that separate encodings would take 120,000 points instead of 21,000, which while significant, could easily be accommodated even today.
I have no idea what other flaws AmiMoJo is talking about, but the gripes with the way the committee handled Han unification was legitimate.
Irony: Agile development has too much intertia to be abandoned now.
There is a difference between forcing everyone to speak a certain language at home, and having one language that we use online.
I'm also pretty skeptical of the value of "culture".
Language does impact your brain in profound ways, there is real science behind that. Unlike most(all?) claims to the value of culture. (often it implies one culture is more valuable than another, which smells a bit like stuff old racist white guys say)
“Common sense is not so common.” — Voltaire
Thank you for ninjaing me. I often chime in about this issue when someone complains about Slashdot's lack of support for Unicode. Most of the time, after I explain the code point whitelist and the reason for it, someone complains that a blacklist of dangerous code points would work better. My usual reply is that new versions of Unicode may insert new control code points that get activated before the Slashdot admins have the chance to add them to the blacklist. And besides, many characters outside the current whitelist are far more useful for what used to be called "ASCII art" than for readable text in the English language. For example, Oriya letter ii (U+0B08) looks to English speakers more like the head of a Smurf. And ASCII Goatse and ASCII Jack Off are why Slashdot had to add a lameness filter in the first place.
But apparently, Slashdot doesn't strip bad characters on display, only on post. This post, for example, still contains a bidirectionality override.
You need to concede a lot of the fighting going on in Africa too.
As for Imperial Japan, the emperor worship was far more than a sales technique. The leadership were true believers. The god status of their emperor, their head of state, made all other heads of state vastly inferior, all other peoples vastly inferior, the wishes of all others vastly inferior. Imperial Japan's "superiority" was firmly based in their religion, it inspired their "divine destiny" to rule vast parts of Asia.
As for Naziism, it manifested religious overtones and included efforts to create an alternative religious experience for the people that predated the start of the war. Early on they absolutely recognized the power of religion and were creating an alternative one to displace christianity. It was far more than a simple method of selling the war. Its religious-like tenets, mythologies, religious knights were also part of their belief in their "superiority", in their "destiny" to rule Europe. It really was a "religion", not an established one, an emerging one and thankfully a failed one.
And now that you inspired further thought we have the communist states of Stalin's Soviet Union and Mao's Communist China. Here too we have a religious-like activity, a worship of the state. Again, not a sales method but a fervent belief system. I suppose you could counter with all sort of euphemisms regarding Stalin's and Mao's states but at its heart we will also find a worship of the state, a faith based belief system, numerous religious-like behaviors. The newness of such belief systems don't really undermine their religious-like nature.
In short you seem to be focused on established religions providing inspiration. I'm focused on a "religious" belief system being behind the motivations for conflict. I think the former is a more valid basis for examining the influence of religion on conflict.
That's called ideology, which is a useful distinction. You can have religion as ideology, and a-religion as ideology. Conversely, you can have non-ideological religions and a-religions.
About Japan, not really. The Japanese religion was forcefully changed by the state for the purposes of ideological indoctrination. Temples were closed, split, merged, priests reallocated and replaced, official doctrines for the specific purpose of mass submission developed, non-related philosophies (such as Bushido) reinterpreted and inserted into the mix etc. It was a religion constructed top-down for reasons of state. Ditto for Nazism. So, in both cases the use comes first, and the religious formulation later, as a byproduct.
Whether for this they use preexisting cultural elements is a matter of ease of manipulation. Reworking something is easier than developing something from scratch.
Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
Sorry, why do we need multiple languages again?
Originally, to punish ancient Babylonians for trying to build a dangerously tall ziggurat. Since then, to preserve access to oral tradition.
Both of this musician's names can be represented in ASCII: "Prince Rogers Nelson" and "O(+>".
Culture is the ways people live together, their music and art, the way they address problems in life etc. There is no better culture. Only because racists think of their culture (which is often only a subculture as in a partial culture in a wider culture) as superior does not mean that it is that way or that we should use it in that way. I personally think culture is dynamic changing thing and it helps to learn from other cultures as it enriches me and my fellow humans around me.
> Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.
Funny that, I can get by in 3 and am fluent in 2 more, and I still don't know what that first sentence was supposed to mean.
Wonderful for you. I cannot understand it either. Most likely I should stop using my smartphone when writing comments online.
So yet another major version number and they still haven't bothered to add the many arrow (and other directional) symbols that have been missing...