Slashdot Mirror


Unicode Consortium Releases Unicode 8.0.0

An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.

164 comments

  1. Ithought by rossdee · · Score: 4, Funny

    That slashdot didn't support unicode

    1. Re:Ithought by Anonymous Coward · · Score: 2, Informative

      > That slashdot didn't support unicode

      However Soylent News has had full unicode support since last year.
      Here is a recent thread with lots of greek.

    2. Re:Ithought by hcs_$reboot · · Score: 4, Funny

      Slashdot supports Unicode / UTF8 from 0x20 to 0x7F.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    3. Re:Ithought by JayStraw · · Score: 0

      I want my, I want my, I want my line feed pleeeaaaase

    4. Re:Ithought by antiperimetaparalogo · · Score: 0

      > That slashdot didn't support unicode

      However Soylent News has had full unicode support since last year. Here is a recent thread with lots of greek.

      "Greek" you wrote? Hmmm... i know many people will be glad if a Greek Nationalist like me leave this barbaric Slashdot for the less barbaric Soylent, but even here exist people needing Greek support (and a Greek like me i may add!) - e.g., recently i read (and made a comment about it) a Slashdot summary where they missed the Greek "m" for the Micro unit, not to mention some (few) barbarians here who like and know Greek.

      Anyway, i congratulate Soylent News, not only for their unicode support but also for the rest developments of the code - but let's not criticize Slashdot so much, just a couple of days ago the added... a "share" button!

      --
      Antisthenes: "Wisdom begins by examining the words/names." - excuse my English, i am (slightly...) better with my Greek!
    5. Re:Ithought by sound+vision · · Score: 3, Informative

      Your post... has a Facebook icon next to it.
      I knew Slashdot was going in a different direction, but... Facebook? The alt-text says "From Facebook". I'm not even completely sure what that means, but I don't want anything "from Facebook" in here. I hope you know your post has fucked up my world. I'm going to have a hard time sleeping now. Bro... your post has a Facebook icon on it! However you managed to get that to appear, don't do it ever again! And tell all your friends not to. Together, we can make Slashdot sane again...

    6. Re:Ithought by tlhIngan · · Score: 4, Interesting

      That slashdot didn't support unicode

      It does. It's actually fully Unicode-compliant. It's just on the input and recently (as of a couple of years ago) the output side passes through a Unicode whitelist.

      You see a Unicode codepoint is not necessarily a character.. It can be a character modifier. So you can be handling a string containing multiple codepoints, and yet on screen it only resolves to one character. Some of these include right-to-left overrides (which alter the flow of text on the screen so you can write a string and the display agent will reverse it). There are other modifiers that include flourishes, and Unicode 8 adds "skin type modifiers" as well for emoji. As in, if you display a face, the font should use a "non-human shading" (Apple chose a Simpsons-like yellow, Microsoft chose a pale zombie-ish hue). But with the addition of a skintone/diversity modifier, when combined with the emoji codepoint, can give you a variety of skin tones.

      And it's also what screwed up iOS - the string you send is full of modifiers which makes it extremely hard to decide where to break the line. (Arabic is one where there are lots of modifiers because a character can appear differently based on the characters that appear before and after it).

      And what does this have to do with /.? Easy - a lot of commenters abused the modifiers to screw with the website. And unless you know how to handle Unicode, it's really hard to properly reset the parser state. /. used to be able to display the screwed up the comments - if you Google for the oddball string n"5:erocS" it would show it (because Google ignores modifiers). If you wonder, that's the string "Score:5" as commenters use to fake-moderate their posts. But since /. strips unicode on display now, you get to see the messed up post as it was typed out

    7. Re: Ithought by Anonymous Coward · · Score: 0

      There are all sorts of character modifiers, but not a "unconditionally end character and reset state" byte?

    8. Re:Ithought by KiloByte · · Score: 4, Insightful

      That slashdot didn't support unicode

      It does. It's actually fully Unicode-compliant

      No, Slashdot's database works in ISO-8859-1. You're confusing Slashcode which can do Unicode with Slashdot which still hasn't deployed it.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    9. Re:Ithought by Smallpond · · Score: 2

      You're just mad because your post didn't get an AOL icon.

    10. Re:Ithought by KGIII · · Score: 1

      Make a version of /. in Greek. Put it out there and let it take the time to grow. If enough people like it then they will go to it.

      --
      "So long and thanks for all the fish."
    11. Re:Ithought by phantomfive · · Score: 1

      It means he logged in using a facebook account instead of a slashdot account.

      --
      "First they came for the slanderers and i said nothing."
    12. Re:Ithought by Hognoxious · · Score: 1

      Good idea. As a bonus it might keep the fucking windbag quiet for a while.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    13. Re:Ithought by Anonymous Coward · · Score: 0

      Only 20 years from now it will.

    14. Re:Ithought by unixisc · · Score: 1

      Also, /. even has an IPv6 address, from what I read

    15. Re:Ithought by Anonymous Coward · · Score: 1

      It means he logged in using a facebook account instead of a slashdot account.

      And barely anyone flinches when the G+ authentication users post, which is much more common.

    16. Re:Ithought by Anonymous Coward · · Score: 0

      I throw up in my mouth a bit nearly every time I see these cross community yayhoos.

    17. Re:Ithought by daveime · · Score: 1

      when the G+ authentication user posts.

      FTFY

    18. Re:Ithought by antiperimetaparalogo · · Score: 1
      You can't avoid Greek in any "barbaric" language (in the same way French people can't avoid English nowdays, regardless of how much they try!) - in the Soylent news story "Is the Internet a Failed Utopia?" the other guy mentioned as an example of Greek support, you already have the word "Utopia" as a problematic (Greek originating) English term, because if it is an "a/u-topos" already, then it makes no-sense (even in English - from my signature: Antisthenes: "Wisdom begins by examining the words/names.").

      "Make a version of /. in Greek"? Sorry, but it's a bad idea dude - like every other Greek... i can't stand Greeks!

      --
      Antisthenes: "Wisdom begins by examining the words/names." - excuse my English, i am (slightly...) better with my Greek!
  2. Audience consortium decides... by Anonymous Coward · · Score: 0

    Beta creep sucks :(

  3. I'm going back to ASCII by OrangeTide · · Score: 0, Flamebait

    I'm kind of sick of all of this nonsense. If you have something to say that isn't in ASCII, use a GIF or MP3. Even ASCII has a lot of garbage in it like @.

    --
    “Common sense is not so common.” — Voltaire
    1. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      It's garbage until you have to read or write one of those languages. Chinese is often rendered that way making the text unreadable unless you have good eyesight and know the character.

      You can't resize the text, search it or copy from it making it hugely frustrating.

    2. Re:I'm going back to ASCII by sexconker · · Score: 1

      I'd be happy with a reduced set of 64 characters.
      A-Z
      0-9 ,./?;":[]\ =-+)(*&^%$#@!~
      EOF
      Return/Newline {we don't use typewriters, let's use a single character)

      Drop ', `, _, &;t;, >, {, }, tab, and |
      Yes, there are all in use, but fuck it.

    3. Re: I'm going back to ASCII by OrangeTide · · Score: 2, Insightful

      Sorry, why do we need multiple languages again?

      --
      “Common sense is not so common.” — Voltaire
    4. Re:I'm going back to ASCII by OrangeTide · · Score: 1

      what are you using @ ~ ^ # for other than Rogue/Nethack ?

      ITA2 without the letter/figure shift (so 6-bit instead of 5-bit) would be fine.

      A single alphabet for every computer user seems like an efficient use of resources and enables wider communication. (and an alphabet that isn't constantly expanding in size at a seemingly exponential rate). Latin alphabet as used in English seems OK, Cyrillic is better in many ways.

      --
      “Common sense is not so common.” — Voltaire
    5. Re: I'm going back to ASCII by Hognoxious · · Score: 1

      It's garbage until you have to read or write one of those languages.

      Don't do it then.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    6. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0, Insightful

      the world would do well with a single global language instead of the literally 1000s of languages and dialects we have now. communication, or rather the inability to communicate with one another, is the key reason (religion being a close second) for conflict and strife in our society.

    7. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      Latin alphabet as used in English seems OK

      I promise to start using the English alphabet exclusively as soon as you add the letters in it that are necessary for writing my name correctly.

    8. Re:I'm going back to ASCII by Dutch+Gun · · Score: 4, Interesting

      Don't let the door hit you in the ass on the way out to the pasture of obsolescence. The rest of us will continue to use Unicode, which, despite some flaws, such as their mess-up with Han Unification, does a pretty good job at solving the problem of language intercommunication. If anyone thinks they can do a better job (not counting reverting back to English-only ASCII), have at it.

      So go ahead and use ASCII, don't type in any of those dern foreign charcturs, and pretend you're back in the happy past where we had a mess of incompatible standards, and no way to easily discern which of the many possible encodings was actually used, resulting in the scrambled text we always used to see (notice you *don't* actually see that much anymore?). And most software just ignored the rest of the non-English-speaking world anyhow because of that mess.

      Personally, I'm thankful people are willing to take on largely thankless (and mind-numbing to most of us) tasks such as these.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    9. Re: I'm going back to ASCII by Anonymous Coward · · Score: 1

      We're shedding languages like crazy, that with lots of small languages going extinct and all. And as sad as extinction is, for practical purposes over 9000 languages is a bit much. Yet I don't think a single language would be a good thing either. Language shapes your thought-space, meaning that some languages make it easier, sometimes outright enable, thinking thoughts that in other languages aren't very accessible or even possible. So having at least a couple sufficiently different languages is the better idea. Otherwise you foster a monoculture of thought, and we already know what such things do to crops or even "operating system ecosystems": It fosters sickliness and unhealth.

      Besides, without conflict we won't improve. We actually do need a bit of disagreement now and then. With but a single language we'd be stuck to a single thought-space, making it that much harder to shed new light on old conflicts. For all of you sad sacks with fluency in only one human language, go learn at least one other, preferrably from a completely different language family. See what it does for your ability to think.

    10. Re: I'm going back to ASCII by prefec2 · · Score: 2

      No we would not do well with only one language, we would loose a lot of culture. It would be like one standard food for everyone. Furthermore, your proposition is ludicrous, as language changes all the time. That's why new street languages pop up and then evolve in something different. Language is reflection of culture. It is not like a programming language. If you want to communicate with other people you should learn additional languages. And while you are at it, also try to learn something about their culture. Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.

      BTW: The most problems between societies are not based on religion. Religion is only used as a vehicle to transport the hostility. It is about greed, ignorance, stupidity, and frustrations.

    11. Re: I'm going back to ASCII by Hognoxious · · Score: 1

      we would loose a lot of culture.

      You don't have much to spare, it seems.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    12. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      ASCII isn't English-only. It supports latin-1 just fine. It's the implementations that don't, and you can blame stuck-in-the-mud 'merkin computer nerds for that.

      It was the glass terminal that didn't support <character><BS><accent> that you could do with paper terminals, in fact that was the way the ASCII committee designed the thing to work. Clunky, multi-byte construct, you say? unicode with its multi-byte codepoints and its encouragement to use multiple code points even if a single do-all equivalent code point exists, I say.

      Personally, I think unicode is a bad idea done badly, by committee. You could say "but at least it exists", and I could just as easily reply "precluding serious effort to come up with something better, though such is indeed needed." It's not cost-free to use unicode, you know.

      It's not even close to "cheap" if you tally all the costs you usually overlook, like how more than half the space taken by C libraries is devoted to "multi-byte support" and yet it isn't enough for serious unicode handling. Concepts that we take for granted with ASCII, like a "canonical representation" plain don't exist, are not even defined for unicode. This example in fact caused an insidious security problem not long ago. So yeah, unicode isn't free and not obviously better than the alternatives except for the buzzword band wagon factor. That's really the only thing it has going for it.

    13. Re: I'm going back to ASCII by prefec2 · · Score: 1

      This is an orthographic mistake which may happen even to native speakers. So please forgive me when I make some mistakes. Hopefully you still got the message.

    14. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?

    15. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      Sorry, why do we need multiple languages again?

      We don't, so why don't you stop speaking English and start to use Finnish like God himself does.

    16. Re:I'm going back to ASCII by AmiMoJo · · Score: 1

      It's a shame Unicode has become the standard. It's numerous flaws can't be overlooked, but we seem to be stuck with it now. Maybe the consortium will eventually fix those things, especial CJK support, but so far there is little sign that they care.

      It's terrible that we screwed up something so important.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    17. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      It's barbarians like you who are responsible for Slashdot's inability to accept ye English "thorn" character!

    18. Re: I'm going back to ASCII by hackwrench · · Score: 1

      Not like a programming language? Then what's with all the different numbers after C++ standards and all the different variants of BASIC?

    19. Re:I'm going back to ASCII by Hognoxious · · Score: 2

      What are you doing these days? When Doves Cry was part of my childhood.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    20. Re: I'm going back to ASCII by Smallpond · · Score: 1

      Sorry, why do we need multiple languages again?

      http://www.scientificamerican....

    21. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      Religions don't give rise to war. They are used as a justification for war.

    22. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      > as language changes all the time
      Let's pretend that "loose" means "lose", "looser" means "loser".

    23. Re:I'm going back to ASCII by Smallpond · · Score: 1

      C isn't a good example. It doesn't even do a good job of handling strings of 1-byte characters.

    24. Re: I'm going back to ASCII by alexgieg · · Score: 1

      Replying to undo incorrect moderation.

      --
      Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
    25. Re: I'm going back to ASCII by alexgieg · · Score: 1

      Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?

      Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).

      --
      Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
    26. Re: I'm going back to ASCII by Anonymous Coward · · Score: 0

      let's suppose we implement this immediately. what do we do with all the written language up till now.
      do we translate it all, live with the loss of meaning, and burn the originals? how do we do this for
      video?

    27. Re:I'm going back to ASCII by OrangeTide · · Score: 1

      You may have to adjust your expectations as to the correct form of your name.

      I would be OK defining a few standards for transcription. This is different than just using UTF-8 because the undecoded form is still human readable.

      ps - people pronounce my name wrong half the time but it's only 4 letters long and is a common noun in most parts of the US.

      --
      “Common sense is not so common.” — Voltaire
    28. Re: I'm going back to ASCII by Cassini2 · · Score: 1

      Sorry, why do we need multiple languages again?

      Have you read the latest C++ spec? That's what happens when a single language does everything.

      The same effect happens in people languages too.

    29. Re: I'm going back to ASCII by Hognoxious · · Score: 1

      We're shedding languages like crazy, that with lots of small languages going extinct and all.

      Don't tell HR. They'll be adding "fluent in Kernowek and Ainu" to every job description.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    30. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      be specific. what numerous flaws, and why can't they be ignored? the one example
      you give is bogus. cjk are supported.

    31. Re: I'm going back to ASCII by drnb · · Score: 1

      Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?

      Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).

      You almost had a point until you suggested that current wars in the middle east and africa don't have a major religious component. Then you lost credibility. Then I started to rethink the earlier ones and began to note that religion existed their too. For example the Nazis in-fact were trying to create an alternative religion with Naziism, complete with its own tenants of faith, saints, mythologies, etc. Imperial Japan's military justified everything done as service to the living-god emperor. So yeah, religion has helped give us many of the wars you claim otherwise.

    32. Re: I'm going back to ASCII by alexgieg · · Score: 1

      You've confusing causation with correlation. Religion is a good way to make people do thing for you, but your actual reasons are different. From the current conflicting parties, only ISIS is really driven by religion first and foremost, so I'll concede on that one. As for the others, nope, the driving impetus is non-religious even though religion is used for gluing purposes.

      --
      Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
    33. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      what do you mean "undecoded"? even ascii needs to be transformed from a bit pattern to a different bit pattern on the screen.

    34. Re:I'm going back to ASCII by pjt33 · · Score: 1

      what are you using @ for other than Rogue/Nethack ?

      Back in days of yore, before Facebook messaging and Whatsapp but after bang paths, @ was an essential part of a quaint communications system called "e-mail". Some old fogies still use it. Now get off my lawn and go ask the person who (I presume) sold you that 6-digit uid for a history lesson.

    35. Re:I'm going back to ASCII by AmiMoJo · · Score: 1

      Han unification.
      Variable with encodings.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    36. Re:I'm going back to ASCII by Dutch+Gun · · Score: 1

      CJK is supported, but the early Unicode committee unfortunately decided to "unify" the codepoints of characters with shared ancestry, even though they may be rendered differently in different language.

      Technically speaking, Unicode represents characters, not glyphs, so it's a matter of whether you consider the different language-specific visual representations of those characters to be distinct characters themselves, not simply visual differences of the same character (which most native speakers would, of course). Obviously, since most text would not bother to interject a Chinese version of a character into an otherwise Japanese text, this works out reasonably well in most situations, so long as you're indicating which language you're reading (something Unicode was not supposed to need, right?). However, this multiple mapping creates significant difficulty for scholarly work in particular, which may need to reference individual glyphs of various countries within a single piece of work - something that was not easily done thanks to the way the characters were unified. Or for instance, say you have a small section of Japanese text inside a larger Chinese document, or vice versa, as you might see on the web... oops, sorry.

      In version 3.2, Unicode introduced the ability to select among visual variants to help solve this issue, but I think many people feel this is a bandaid (and apparently poorly supported) solution over a poorly thought-out decision in the first place, made by representatives from North America corporations (with Chinese advisers) who didn't appreciate or care about the problems they were creating with this decision for other countries like Japan and Korea. It also may be the case that there was pressure to fit the codes into a 16-bit space, which was an initial goal until it was recognized to be completely infeasible to do so, but is pointless with the large encoding space we have today of a million character code points. Note that separate encodings would take 120,000 points instead of 21,000, which while significant, could easily be accommodated even today.

      I have no idea what other flaws AmiMoJo is talking about, but the gripes with the way the committee handled Han unification was legitimate.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    37. Re: I'm going back to ASCII by OrangeTide · · Score: 1

      There is a difference between forcing everyone to speak a certain language at home, and having one language that we use online.

      I'm also pretty skeptical of the value of "culture".

      Language does impact your brain in profound ways, there is real science behind that. Unlike most(all?) claims to the value of culture. (often it implies one culture is more valuable than another, which smells a bit like stuff old racist white guys say)

      --
      “Common sense is not so common.” — Voltaire
    38. Re: I'm going back to ASCII by drnb · · Score: 1

      You need to concede a lot of the fighting going on in Africa too.

      As for Imperial Japan, the emperor worship was far more than a sales technique. The leadership were true believers. The god status of their emperor, their head of state, made all other heads of state vastly inferior, all other peoples vastly inferior, the wishes of all others vastly inferior. Imperial Japan's "superiority" was firmly based in their religion, it inspired their "divine destiny" to rule vast parts of Asia.

      As for Naziism, it manifested religious overtones and included efforts to create an alternative religious experience for the people that predated the start of the war. Early on they absolutely recognized the power of religion and were creating an alternative one to displace christianity. It was far more than a simple method of selling the war. Its religious-like tenets, mythologies, religious knights were also part of their belief in their "superiority", in their "destiny" to rule Europe. It really was a "religion", not an established one, an emerging one and thankfully a failed one.

      And now that you inspired further thought we have the communist states of Stalin's Soviet Union and Mao's Communist China. Here too we have a religious-like activity, a worship of the state. Again, not a sales method but a fervent belief system. I suppose you could counter with all sort of euphemisms regarding Stalin's and Mao's states but at its heart we will also find a worship of the state, a faith based belief system, numerous religious-like behaviors. The newness of such belief systems don't really undermine their religious-like nature.

      In short you seem to be focused on established religions providing inspiration. I'm focused on a "religious" belief system being behind the motivations for conflict. I think the former is a more valid basis for examining the influence of religion on conflict.

    39. Re: I'm going back to ASCII by alexgieg · · Score: 1

      That's called ideology, which is a useful distinction. You can have religion as ideology, and a-religion as ideology. Conversely, you can have non-ideological religions and a-religions.

      About Japan, not really. The Japanese religion was forcefully changed by the state for the purposes of ideological indoctrination. Temples were closed, split, merged, priests reallocated and replaced, official doctrines for the specific purpose of mass submission developed, non-related philosophies (such as Bushido) reinterpreted and inserted into the mix etc. It was a religion constructed top-down for reasons of state. Ditto for Nazism. So, in both cases the use comes first, and the religious formulation later, as a byproduct.

      Whether for this they use preexisting cultural elements is a matter of ease of manipulation. Reworking something is easier than developing something from scratch.

      --
      Conservatism: (n.) love of the existing evils. Liberalism: (n.) desire to substitute new evils for the existing ones.
    40. Re:I'm going back to ASCII by Anonymous Coward · · Score: 0

      Yet there is are allographs of a in Unicode. Those guys are so full of shit. /. will filter it out but I'm referring to U+0251

    41. Re: I'm going back to ASCII by prefec2 · · Score: 1

      Culture is the ways people live together, their music and art, the way they address problems in life etc. There is no better culture. Only because racists think of their culture (which is often only a subculture as in a partial culture in a wider culture) as superior does not mean that it is that way or that we should use it in that way. I personally think culture is dynamic changing thing and it helps to learn from other cultures as it enriches me and my fellow humans around me.

    42. Re: I'm going back to ASCII by daveime · · Score: 1

      > Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.

      Funny that, I can get by in 3 and am fluent in 2 more, and I still don't know what that first sentence was supposed to mean.

    43. Re: I'm going back to ASCII by prefec2 · · Score: 1

      Wonderful for you. I cannot understand it either. Most likely I should stop using my smartphone when writing comments online.

  4. bloatware by Anonymous Coward · · Score: 0, Funny

    Adding a bunch of useless characters, especially from computer-illiterate regions... Most languages can be written with English characters (ie. plain latin). That would make things a lot simpler.

    1. Re:bloatware by Zontar+The+Mindless · · Score: 2

      Most languages can be written with English characters (ie. plain latin).

      Name a language written with Latin characters (other than English) that does not use any special characters or diacritical marks whatsoever.

      (Even English requires extensions to the Latin character set. which originally had no "U", "J", or "W". )

      --
      Il n'y a pas de Planet B.
    2. Re: bloatware by Anonymous Coward · · Score: 0

      Dutch

    3. Re: bloatware by Anonymous Coward · · Score: 0

      There are plenty. I see Tagalog and Malaysian/Indonesian regularly.

    4. Re: bloatware by SirSlud · · Score: 1

      Dutch has an extra character not in the Latin character set.

      --
      "Old man yells at systemd"
    5. Re: bloatware by ciaran2014 · · Score: 2

      You mean "ij"? The unified ij character isn't used by anyone. Not sure if it's even recommended by any body.

      But Dutch does have accents (één, vóór, ...). News headlines this morning:

      "Verstekeling valt boven Londen uit vliegtuig na 11u lange vlucht, één overleeft"

      "Grieken demonstreren ook vóór de euro"

      --
      Help build the anti-software-patent wiki
    6. Re: bloatware by ciaran2014 · · Score: 1

      Accents in Tagalog are optional and rarely used, but they are there.

      I've never seen them used on websites, but they're used in most or all dictionaries.

      --
      Help build the anti-software-patent wiki
    7. Re: bloatware by Zontar+The+Mindless · · Score: 1

      Tagalog and Bahasa Malaysia/Indonesia aren't "most languages". :)

      That being said, I should have thought of the latter myself.

      --
      Il n'y a pas de Planet B.
    8. Re: bloatware by Anonymous Coward · · Score: 0

      Malaysian and Indonesian combined have 77 million speakers.

    9. Re: bloatware by Ilgaz · · Score: 2

      This is exactly why it took decades and crazy hacks for people to write their own language electronically.

      Thank God virtually failed (but won) Plan 9 (UNIX2) came by with idealistic developers who respects other cultures came up with Unicode and companies like IBM/Microsoft/Adobe along with Free software supported it.

      Who knows if the software/hardware/network combination you use had a line coded by a person who is from those "computer illiterate" regions?

    10. Re: bloatware by smallfries · · Score: 1

      Does anybody else read this as the Librarian?

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    11. Re: bloatware by Hognoxious · · Score: 1

      This is exactly why it took decades and crazy hacks for people to write their own language electronically.

      You're confusing languages and alphabets. Ever heard of pinyin?

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  5. CJK is Unicode's big failing by Anonymous Coward · · Score: 5, Interesting

    CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.

    What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.

    But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.

    1. Re:CJK is Unicode's big failing by fisted · · Score: 1

      the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.

      For example? I thought exactly this was /not/ being done by unicode.

    2. Re:CJK is Unicode's big failing by mwvdlee · · Score: 2

      There are certainly plenty of "repeat" characters in different contexts.
      For example the math alphanumerics: http://unicode.org/charts/PDF/...

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    3. Re:CJK is Unicode's big failing by SirSlud · · Score: 0

      Unicode trips up shitty programmers. It sounds a little more like it was one of those "out of your league" problems. Probably best to say away from CJK contracts tho. I've shipped shitloads of software with CJK localizations, and frankly, from that word soup you've produced, I don't think you have any idea what you're talking about.

      --
      "Old man yells at systemd"
    4. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0
    5. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      Ahhhh. Now I understand why Japanese or Chinese documents sometimes documents render with *two* fonts depending on the fonts installed on the machine, and the language that machine is set to. Rendering is falling back to another font for glyphs not in the local font.

      Totally annoying.

    6. Re:CJK is Unicode's big failing by gustygolf · · Score: 4, Informative

      In short:
      To render text properly in Japanese, you need a Japanese font. To render text properly in Chinese, you need a Chinese font. It's not just because of character coverage, but because of a thing called Han unification the consortium did.

      The Unicode consortium decided to map similar characters to the same code-point. Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.

      And this is problematic when there are two CJK languages mixed in the same document -- in the GP's case, Chinese and Japanese --, or when a program must automatically decide which font to render things in.

      Take a web browser for example. It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.

      (Typically, software used base the guesses on the user's locale. It's pretty accurate -- Chinese users tend to view Chinese documents, Japanese Japanese ones. But the problems start when someone tries viewing a 'foreign' document...)

      It's really quite ironic that the consortium decided on codepoint unification for the three languages that would most benefit from Unicode.

      --
      "Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
    7. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      I thought a while back unicode did a deunification so each language has all their character in their own pages.

    8. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      You angry Bro?

    9. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      Hi. GP here. I am glad you are such a leet programmer. I just try to feed my famil by writing boring business software.

      Interestingly, most of the people I've met who have done text processing in CJK have an opinion of some sort on the Han unification beyond "good programmers can handle it". Some are for, others against. I can tell that you've never done either but you think that passing strings to the application framework to let it handle everything counts. While that is the safest and best approach 99.99% of the time, sometimes it does not work. This was one of those times.

      In this corner case, the Unicode encoding is ambiguous and in my mind that means it is broken.

    10. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      They're not "similar" they're the same character

      You know how different people write a 4 in different ways? For some people its two intersecting lines that don't join - for others it's one stroke. But those aren't two different numbers, are they? They're the same number, written in a different way.

      That's all that happened to Han characters.

      The question of what the character should look like is a typography question, which goes to the OS text renderer, not Unicode's problem. For several languages you need to know which language it is before you know how to render the character correctly. This includes some minority European languages. CJK is notable because lots of people threw their toys out of the pram, all in the same exact manner

      "This is the only way to draw character X, it's the way I learned when I was a baby, therefore other ways are wrong or a different character, never draw character X any other way than my way"

      But none of the actual linguists felt this way, Han unification is a product of local linguists who understand the language, as opposed to the usual armchair fans who've got no idea what they're up to. For the armchair fans it's "obvious" that their native tongue is correct and everybody else is in error and should go away.

    11. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      Note that the Coward's proposed example of "Russian" doesn't work. Russian is a Cyrillic language and there isn't agreement on the entirety of Cyrillic for the correct appearance of characters (particularly in italic). So if you get a non-Russian Cyrillic Unicode font and render Russian text you get something which is "wrong" to the eyes of Russians.

      German would have been the same until relatively recently. Germans had a different preferred script (Fraktur) for Latin characters, so until the mid-20th century they would have been very unhappy with a Latin font that made the German letter 'k' look the same as the English letter 'k'.
       

    12. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      For those who still hate Han unification, composing character schemes, etc that Unicode adapted: back in the '90s, Lotus came up with an umbrella character set called LMBCS which allowed a stream of text to contain characters interleaved from different native character sets, using a prefix byte scheme. They used it across their product line, which included the 1-2-3 spreadsheet and Notes. Then it got blown away by Unicode.

      Bob Balaban wrote up LMBCS here: http://www.bobzblog.com/tuxedoguy.nsf/dx/introducing-geek-o-terica

    13. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.

      This is not substantially different for what happened with Latin and Greek characters, both pre-Unicode and in Unicode.

      Yet, the glyphs are substantially different when rendered.

      Most of those glyph variants are similar enough that people have no trouble figuring them out. Those are the ideographs that were unified. For the rest, Unicode chose non-unified ideographs. Since CJK writing systems are big, unwieldy, and complex, they didn't get it all right on the first try, which is why they are adding new characters in new releases of Unicode.

      Furthermore, why does it even make sense to show a Japanese reader a Chinese glyph variant? Who are you designing your application for? I mean, if all you care about is appearance, you might as well embed images. But let's say a Japanese reader reads a text on Chinese philosophy. He wants to be able to enter and search for Chinese loan words and Chinese names; what good would it do him if the Chinese loan words and names are encoded in ways that he can't enter or search for?

      Unicode isn't that great after all

      These problems are intrinsic to the languages; they are not problems with Unicode. The real solution is political and cultural: if using strings across languages is a frequent use case, that use case can only be addressed by harmonizing the writing systems themselves and adapting real-world usage; it's not something that the encoding can solve. And, of course, that unification is already happening, just like it happened in the West when we mostly unified our variants of the Latin writing system across Europe.

    14. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.

      That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.

      Most of the time, you should to show Japanese glyph variants to Japanese readers because that's what they know how to read and how to enter with their keyboard.

      It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.

      The reader decides which font he wants to use based on what he is comfortable reading, and you have no business overriding that. In case a character variant is significant, the Unicode consortium will probably have allocated a separate codepoint for it, and the writer has the option of choosing that.

    15. Re:CJK is Unicode's big failing by chad_r · · Score: 1

      That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.

      Yet, that was the spec the GP was trying to write to: a single PDF needing to render both Chinese and Japanese, each in their own font, yet with no language tagging on any of the text. I give him credit for trying to meet the requirements, but they were crap requirements.

    16. Re: CJK is Unicode's big failing by hackwrench · · Score: 1

      Says an armchair expert.

    17. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      Yes, I agree that that's what the GP was writing for. But it isn't Unicode's fault that the original text lacked the language tags or font information for his needs; Unicode didn't prevent the original authors from putting that in, either using Unicode's own language tag characters or (preferred) XML/HTML and/or metadata. However, most people really don't want the behavior he wants, which is why people don't usually do this.

    18. Re:CJK is Unicode's big failing by Anonymous Coward · · Score: 0

      not Unicode's problem

      Sure, Unicode is nothing more than a character classification system for linguists. The problem is that we're using it outside of its domain for applications such as desktop publishing and document interchange where it is not a very good fit. It just doesn't have the properties that people need in practice. Yet programmers (probably Westerners who think that Unicode is the best thing since sliced bread) continue cramming it into every corner of computing.

    19. Re:CJK is Unicode's big failing by baka_toroi · · Score: 1
      I really can't explain this properly because I can't show you the symbols, but I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese. Of all the huge corpus of Chinese ideograms, the ones with different stroke order should've been separately encoded. It's really weird behavior and to me it's a bad enough oversight.

      These problems are intrinsic to the languages; they are not problems with Unicode. The real solution is political and cultural: if using strings across languages is a frequent use case, that use case can only be addressed by harmonizing the writing systems themselves and adapting real-world usage; it's not something that the encoding can solve.

      I don't follow what you are trying to say. Are you saying the Japs and the Chinks should unify their writing systems? Because that's as disrespectful as the demonyms I have just used.

    20. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      I really can't explain this properly because I can't show you the symbols,

      I know the symbols; I can read Japanese.

      I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese

      Why would you be seeing the Chinese variant if you're writing an E-mail in Japanese, presumably using a Japanese font? Note that even Google Translate manages to show you the correct local variants:

      https://translate.google.com/#...

      Are you saying the Japs and the Chinks should unify their writing systems? Because that's as disrespectful as the demonyms I have just used.

      I don't see how suggesting that the Chinese and Japanese do what we in the West have done for thousands of years, namely rationalize, unify, and adapt our writing systems is "disrespectful". Writing in the West is several thousand years older than in China; we discarded ideographs in favor of our alphabet before the Chinese even had writing. I have lived in Western cities that had literate cultures a thousand years before the Japanese even had any writing.

      Your analogy between ethnic slurs and cultural disrespect doesn't work. Ethnicity is an arbitrary accident of birth and has no bearing on anyone's abilities, morals, or other characteristics. Disrespecting someone's ethnicity is therefore not rational. Culture, on the other hand, is a collection of values, norms, behaviors, and achievements. And you are absolutely right: while I find Chinese and Japanese culture interesting and like some aspects of each, overall, I consider those cultures failures and examples of how human societies and affairs should not be organized. And I think I have history on my side. So, in that sense, I "disrespect" those cultures as cultures.

    21. Re:CJK is Unicode's big failing by BobbyWang · · Score: 1

      That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.

      Latin, greek and cyrillic scripts have their own code points for (historically) common characters. For example the latin letter B (U+0042), the greek letter Beta (U+0392) and the cyrillic letter Ve (U+0412) are historically the same symbol but have their own code points in unicode. This makes it easy to embed snippets of greek script in an english text, for example, since a greek font will automatically be used for the greek script (instead of getting randomly mixed fonts risking a suboptimal rendition).

    22. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      Latin, greek and cyrillic scripts have their own code points for (historically) common characters.

      My point was about usage, not about the history of the alphabets: readers simply aren't served well by seeing characters whose meaning (phonetic or ideographic) they understand but that they don't recognize because they are rendered in a font that uses conventions they don't know.

      This makes it easy to embed snippets of greek script in an english text, for example, since a greek font will automatically be used for the greek script (instead of getting randomly mixed fonts risking a suboptimal rendition).

      Greek and Latin scripts are mutually unintelligible. That is, most of the letters differ between the alphabets. If you print a Latin text in Greek characters, people can't read it at all. (Also, to put this into perspective, Greek and Latin separated about the time that the Chinese got their writing system.)

      Chinese and Japanese mostly share the same character shapes; that's why it makes sense to code only the characters that are substantially different, and that's what Unicode does. Small variants are not coded separately not because evil Westerners are miserly with codepoints for CJK language, but because it's actually not useful.

      And this is exactly the same thing that we did for the different variants of the Latin alphabet (which are themselves roughly as old as Japanese writing): we share the common letters and provide variants for those that differ between Latin alphabets.

    23. Re:CJK is Unicode's big failing by BobbyWang · · Score: 1

      Yes, the western scripts have separated more (and for a longer time). But those eastern scripts have separated a bit too, at least then it comes to typesetting. It seems logical and convenient to have common code points for all CJK languages. But in reality it's actually causing problems since the same symbol is expected to look in one way for Chinese and slightly different for Japanese. It would probably be most convenient for everyone if they agreed on a common convention (as with the Latin script as you mentioned), but apparently they haven't. One solution to this could be to have code points for language context hints. Another could be to have entirely different sets of code points for the different languages. Both seem quite bad, but at least better than having algorithms trying to guess the language (which still is to prefer over having suboptimal typesetting).

      By the way, there are plenty of examples where the same symbols have different code points intended for different contexts (Greek letters used for math etc). There are even Latin letters that look slightly different in different language contexts like U+0152 (filtered out by Slashdot), Ø and Ö (they all stem from a combination of O and E, Ö from the convention of writing the E above the O). Agreeing on one of the symbols for all affected languages would be logical and fully intelligible for everyone, but it would look wrong. The difference might not be as big for CJK languages (I don't know), but apparently big enough for it to matter. It's seems easy to distinguish between what is a typesetting detail (like bold, italic and letters with our without serifs) and what is an entirely different symbol (like upper or lower case). But it's not in many cases. And I expect the view on these matters will continue to change over time.

      It's all a big mess. Not unicode specifically, but human writing in general.

    24. Re:CJK is Unicode's big failing by NostalgiaForInfinity · · Score: 1

      But in reality it's actually causing problems since the same symbol is expected to look in one way for Chinese and slightly different for Japanese.

      Well, a large number of people (including myself) believe it's the right thing to do. People like you lost that argument, that's why Unicode is the way it is. I'm simply explaining it, and I'm telling you that the justification isn't Western imperialism or American ignorance or whatever other cultural b.s. people like to attach to it.

      By the way, there are plenty of examples where the same symbols have different code points intended for different contexts (Greek letters used for math etc). There are even Latin letters that look slightly different in different language contexts like U+0152 (filtered out by Slashdot), Ø and Ö (they all stem from a combination of O and E, Ö from the convention of writing the E above the O). Agreeing on one of the symbols for all affected languages would be logical and fully intelligible for everyone, but it would look wrong.

      Yes, and Unicode CJK support does exactly the same thing that Latin script does for Latin alphabets: characters that look similar enough to be recognizable are shared, and characters that look significantly different and would be unintelligible get different codepoints. Since this is a much harder problem for CJK, they keep adding new codepoints.

      Unicode used to have language contexts, as well as other contexts. But markup standards like HTML and XML simply ignored the Unicode facilities. Having two separate standards for marking up regions of texts, possibly conflicting, overlapping, and inconsistently, was a problem. And people weren't using the Unicode facilities. So they were deprecated, then dropped.

      It's all a big mess. Not unicode specifically, but human writing in general.

      No, most writing systems are pretty simple: they have a few hundred symbols that are arranged usually linear ways. In fact, even CJK isn't all that different and could easily be encoded in a few hundred codepoints (here); it was mostly a policy decision not to do that.

  6. the existing 21,499? by Anonymous Coward · · Score: 0

    According to "http://babelstone.blogspot.com.au/2005/11/how-many-unicode-characters-are-there.html", the last version has 113021 encoded characters.

    1. Re:the existing 21,499? by Hognoxious · · Score: 1

      the last version has 113021 encoded characters.

      Not too bad, only about 112765 too many.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  7. Unicode is badly designed by Anonymous Coward · · Score: 2, Interesting

    Is Unicode supposed to separate characters that look the same but are semantically different?

    Looks like the answer is yes...
    'LATIN CAPITAL LETTER A' (U+0041)
    'GREEK CAPITAL LETTER ALPHA' (U+0391)

    Looks like the answer is no...
    'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
    (An apostrophe and closing a quotation are two very different things.)

    1. Re:Unicode is badly designed by Anonymous Coward · · Score: 0

      Good luck getting me to use quotes outside of ASCII.

    2. Re:Unicode is badly designed by Anonymous Coward · · Score: 1

      For you, they *look the same*, for a philologists and typographers (the ones who must learn and create them) they ain't.

      They not only differ in shape (though to your eyes they *look the same*) but also in kerning and spacing (both vertical and horizontal separation from other characters).

    3. Re:Unicode is badly designed by Anonymous Coward · · Score: 0

      Surely the apostrophe is the preferred character to use for the apostrophe?
      'APOSTROPHE' (U+0027)
      http://unicode-table.com/en/0027/

    4. Re:Unicode is badly designed by Anonymous Coward · · Score: 0

      I know, but read its comments section and cry.

    5. Re:Unicode is badly designed by Anonymous Coward · · Score: 0

      They not only differ in shape (though to your eyes they *look the same*)

      It's very clear that they don't differ in shape if you open Latin A and Greek Alpha in two different tabs and switch back and forth between them.

    6. Re:Unicode is badly designed by Anonymous Coward · · Score: 0

      An apostrophe and closing a quotation are two very different things.

      True, but English traditionally uses the same character for it, and Unicode reflects that. Unicode also doesn't make a distinction between a "silent-e" and a regular "e", or between a German Umlaut and an English dieresis because, again, those distinctions are not made in the respective writing systems. Unicode doesn't usually try to introduce semantic distinctions that don't already exist in the writing system. And it inherits a number of legacy usages from ASCII anyway even if they contravene its own design principles.

    7. Re:Unicode is badly designed by arth1 · · Score: 1

      That depends on the font. Not on Unicode.
      You can have fonts where a Greek capital alpha looks very different from a Latin capital A, but that doesn't mean anything. There are fonts where zero and capital O look identical too, but that doesn't mean they are the same character, just because they appear identical looking in one particular font.

    8. Re:Unicode is badly designed by arth1 · · Score: 1

      or between a German Umlaut and an English dieresis

      Not to forget languages like Swedish, where à is a letter in its own right, and neither an umlaut nor a dieresis.
      That means that technically, when written in a language that uses diereses, the dots can be stacked. A Swedish word like "nÃÃ" (nah-ah) rendered in a language with diereses would be written with two extra dots on the second letter to show that the second à should also be pronounced.
      That's where Unicode fails - instead of having the diereses as a separate marker only, it has allowed for characters like "a with diereses", but not for varieties of letters that look like they already have them. What letters look like should not be any concern of Unicode. An à with both umlaut and dieresis added should be perfectly acceptable to Unicode, and how the presenter wants to present it none of Unicode's concern. Whether it shows up with two, four or six dots above it, or a colon before it, or any other visual representation.

    9. Re:Unicode is badly designed by Anonymous Coward · · Score: 1

      It's cool that the consortium made it possible to have fonts where a Greek capital alpha looks very different from a Latin capital A. But what if I'd like to have a font where a "right single quotation mark" and a "preferred apostrophe" look different? The consortium made it technically impossible by reusing the same unicode. They happily add 7,716 new characters, but the preferred apostrophe is still not its own character?

  8. Is it possible to see the version? by Anonymous Coward · · Score: 0

    Slashdot Glyphs obscure the titles. Please fix.

    1. Re:Is it possible to see the version? by Anonymous Coward · · Score: 0

      If only there was some other place to put the number of comments on each article, then it wouldn't cover up the title on screens that are less than 4k.

      I wonder where the number of comments could go?

  9. Runes by Whiteox · · Score: 1

    Unicode now has a set for pre-Latin Hungarian runes!
    Hanging out for the keyboard....

    --
    Don't be apathetic. Procrastinate!
  10. Existing 21,499? by Anonymous Coward · · Score: 0

    The comment about growth is so wrong as to be mind-boggling. Where on earth did that figure come from? Unicode 1.0.1 had more than that in the early 90s. See here for a good table with all the gory details.

  11. Re:I thought by antiperimetaparalogo · · Score: 1, Offtopic

    That slashdot didn't support unicode

    You thought right, Slashdot does not support unicode, this story is just news for nerds that is reported by accident, as stuff that matters for G[r]eeks only!

    note: i now continue my comment with a very interesting paragraph, but it is in Greek, so you can not read it, not even if you want to translate it:

    --
    Antisthenes: "Wisdom begins by examining the words/names." - excuse my English, i am (slightly...) better with my Greek!
  12. Good for Uganda by hcs_$reboot · · Score: 0

    Unicode adds support for new languages like Ik, used in Uganda

    Now Uganda needs computers to see what Unicode looks like.

    --
    Slashdot, fix the reply notifications... You won't get away with it...
    1. Re:Good for Uganda by hcs_$reboot · · Score: 1

      And literacy...

      Not if you use "OK Google" Voice search.

      --
      Slashdot, fix the reply notifications... You won't get away with it...
  13. Already = 65K characters by divec · · Score: 4, Informative

    "...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"

    There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:

    --

    perl -e 'fork||print for split//,"hahahaha"'

    1. Re:Already = 65K characters by fnj · · Score: 1

      1. Thank you! I KNEW that 21,499 figure was wrong

      2. Why does ANYBODY still use the mind-numbingly stupid UTF-16?

    2. Re:Already = 65K characters by xOneca · · Score: 1

      To add to #1: "UTF-16 is two bytes per character" is true as in "UTF-8 is one byte per character".

    3. Re:Already = 65K characters by unixisc · · Score: 1

      "...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"

      There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:

      But wasn't UTF-16 supposed to cover all the practical languages (I'm not talking about Klingon or other languages created out of movies). In which case, the 65k should have covered it. Why does Unicode need weirdass characters for playing cards or stuff of that nature? Just stick to their original roles - supporting the implementation of written & spoken languages in computers, and leave it at that.

    4. Re:Already = 65K characters by Anonymous Coward · · Score: 0

      Microsoft uses it for historic reasons, the "multibyte" implementation of the win32 API uses 16-bit characters and goes as far back as Win95.

    5. Re:Already = 65K characters by petermgreen · · Score: 1

      2. Why does ANYBODY still use ........... UTF-16?

      Programmers use it because the programming environments they work in use it. Notably Windows, .net and Java.

      the mind-numbingly stupid

      I wouldn't call it stupid. It was a way to add support for more characters to existing 16 bit unicode systems with minimal breakage.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    6. Re:Already = 65K characters by Matthias+Wiesmann · · Score: 1
      UTF-16 is an encoding which explains how to map bytes to code-points (what you call characters), like UTF-8. UTF-16 encodes data in chunks of 16 bits, while UTF-8 encodes the data in chunks of 8 bits. UCS-2 was an encoding where only the 2^16 first code-points could be encoded, in the same way that ASCII is an encoding where only the first 2^7 code-points can be expressed, and ISO-latin only encodes the 2^8 first code-points. UCS-2 was an attempt to encode the "most common case" as you describe it. The problem is, in order to achieve this, Chinese and Japanese characters were crammed together (look up Han Unification) and were basically not usable. We are talking about around 1.5 billion people here. The fix was to add back the characters that had been removed, and go above the FFFF line.

      As to why we need trading cards and smiley in Unicode, the reason is pretty simple: compatibility. The goal is to be able to convert all existing text data into Unicode, this is why DOS area block drawing are defined as codepoints. Emoji were added to add compatibility to the Japanese systems so that companies like Apple could enter that market with the iPhone, without this, iPhone users would not have been able to exchange messages with other users.

      Remember that at one point in time, ASCII was the extended character set with unnecessary symbols like curly braces, this is why C++ compilers still have trigraph support

  14. Slashdot Glyphs obscure the titles. Please fix. by ciaran2014 · · Score: 1

    I'm seeing this problem too.

    --
    Help build the anti-software-patent wiki
  15. Seems like it, but doesn't by Anonymous Coward · · Score: 1

    You can look at the size of the required but insufficient supporting libraries to get an indication (but only an indication, mind) of the cost of unicode. It's quite high. It even has a capturing effect for English, since lots of devs believe "it is the standard" or "it is the future" or somesuch nonsense, enabling the thing by default and adding even more code to "nicen up" any and all output even for text where pure ASCII would have been sufficient. This actually reduces interoperability for reasons of "modernity".

    You know, something like "smart quotes" in IRC (which strictly is against the IRC standard, since they do define the character set in use and it typically isn't utf-8). Petty? Eh, I still use ASCII-only on English-written IRC channels and I get to see the fall-out, even if you don't. I like my client, why are you throwing crap at it? Because your software thinks that's a good default, that's why.

    There are many more problems with unicode, including security problems, wilfully introduced interoperability problems, problems with having too many different encodings to do the same thing, and so on, and so forth. Usually subtle and hard-to-see problems, and there really isn't a good "universal" alternative, so people keep on using this one. Because it's "universal", see? Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.

    In short, unicode is about as universal as USB, including the built-in crappiness. That means that while it is something of an enabler, there's quite a cost attached. We do the unicode thing because it seems universal, but in practice it is far less so than it promises. And most of the time you don't really need that universality.

    Counterpoint: If we had a clear marker for encoding used, you could switch encodings and thereby switch rules on the fly, and use shorter encodings for the non-latin1-languages you use the most. Of course, you couldn't mix characters from fifty scripts at will, even mix and match accents among them. But again, the ability to do that is awesome expressive power that comes at a continuous cost but no practical gain.

    1. Re:Seems like it, but doesn't by Dutch+Gun · · Score: 4, Insightful

      Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.

      Nonsense. The Unicode encoding formats are stable, and have been for a very long time. New character are added all the time, but the underlying OS and it's fonts are typically upgraded to support these, and so most programs need to do absolutely nothing once their support is in place. The vast, vast majority of applications that support Unicode don't actually explicitly need to use those "official" Unicode libraries (which are monstrously complex), because all modern operating systems provide most of the support they need. For simple conversions, there are a number of excellent free and simple-to-use libraries (many languages have standard libraries available), or you can just use OS-specific versions, or a number of very easy-to-use free and open-source libraries.

      If you're concerned about size, just use UTF-8. There's no need to "switch encodings on the fly", because that's what variable-width encodings already do for you. And the vast majority of common encodings, even in Asian languages, are only 16-bits, not 24 or 32. The issue of inefficiency of text size with Asian languages is greatly exaggerated, and becoming less and less relevant anyhow with our machines with gigabytes of RAM and processors efficient enough to compress and decompress text on the fly. BTW, you can do that just fine even in Microsoft and Apple environments. It just means you need to transcode from UTF-8 to UTF-16 or back again at any API boundary that takes text, and this is fairly simple to do. I've written my own cross-platform code this way because UTF-8 is a much easier encoding to work with internally IMO.

      I don't think anyone would try to argue that Unicode is a perfect solution, but it's a damn sight better than what we used to have. Your comparison to USB is pretty good, in fact. Ask just about any PC user what they'd prefer - modern USB devices or the old system of parallel, serial, PS/2, and joystick ports. Whatever faults USB has, it's a hell of an improvement over the old system.

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:Seems like it, but doesn't by NostalgiaForInfinity · · Score: 1

      In short, unicode is about as universal as USB, including the built-in crappiness.

      And like USB, it's a lot better than what we had before. There are a few things wrong with Unicode, but nothing major. And where Unicode has problems, they are usually just problems that specific writing system users did to themselves and don't hurt anybody else (e.g., Chinese and Japanese screwed up a bit, but that really doesn't matter to the rest of the world).

      There are many more problems with unicode, including security problems, wilfully introduced interoperability problems, problems with having too many different encodings to do the same thing, and so on, and so forth.

      Those are problems with how you use Unicode, not with Unicode itself. Unicode addresses them through normalization; there are standard normalized forms you can use, plus a standard database of Unicode character properties for coming up with your own normalizations. That's really the best any system can do. The rest of the mess is just the inherent messiness of human writing systems.

      Counterpoint: If we had a clear marker for encoding used, you could switch encodings and thereby switch rules on the fly, and use shorter encodings for the non-latin1-languages you use the most. Of course, you couldn't mix characters from fifty scripts at will, even mix and match accents among them. But again, the ability to do that is awesome expressive power that comes at a continuous cost but no practical gain.

      I don't see how that would be any better: general purpose string libraries would have to support all major encodings, plus switching on the fly anyway. The only difference would be whether you indicate the encoding for every character or for parts of strings. If you chose markers to switch between encodings, libraries would simply convert strings into sequences of (encoding, character) pairs anyway and then transform those into integers because string algorithms and theoretical computer science have always assumed that strings are members of \Sigma^*, where \Sigma is a set of characters.

    3. Re:Seems like it, but doesn't by Anonymous Coward · · Score: 0

      New character are added all the time, but the underlying OS and it's fonts are typically upgraded to support these

      I still have Solaris 10 systems that other than fonts are still quite usable. Why should we get on the software upgrade treadmill? Install the software you need to solve the problem, and run it until it breaks. (which for Solaris seems to be a very long time)

      My language hasn't changed it's alphabet in a very long time(English and C), nothing has been added. But people still think it's OK to post silly unicode hacks. If you want to make art, maybe you should use an imageboard instead of a forum or IRC.

    4. Re:Seems like it, but doesn't by AmiMoJo · · Score: 1

      Variable width encodings are a bad solution. UTF8 was a reasonable hack to ease the transition to Unicode, but the standard encoding should have been 32 bits with no modifiers it multi-word encodings at all. Just give every character and every variation/modification a code and let the font rendering system worry about compounds and stuff like that.

      Then string manipulation is easy. No need to try to interpret the characters and understand every language.

      Instead we are stuck with UTF16 as the default, and even the larger encodings use modifiers etc.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    5. Re:Seems like it, but doesn't by Dutch+Gun · · Score: 2

      Instead we are stuck with UTF16 as the default, and even the larger encodings use modifiers etc.

      Who's "we"? Windows and Mac use UTF-16, while Linux and the web use the vastly superior UTF-8. Internally, assuming you're in a language that supports it like C++, you can actually use any encoding you want - it just means you need to transcode strings at API boundaries. You'd have to do this for one or more of your target platforms anyhow if you're writing cross-platform code (all three major PC OSes).

      A lot of Windows programmers think "Unicode == UTF-16", which is not the case at all. In my own applications, I use UTF-8 as the native format, even on Windows and Mac. When I need to render glyphs (I write games, so I have my own low-level bitmapped based glyph rendering system), I convert them to UTF-32 code points for simple mapping. If you want to, nothing is stopping you from using UTF-32 internally as you'd seem to prefer, but I've found there's really no need, because you can always convert between formats on the fly as needed.

      --
      Irony: Agile development has too much intertia to be abandoned now.
  16. Oh yes, of course. by Anonymous Coward · · Score: 0

    So it's hard to use when it promises to be a single easy-to-use solve-all. And of course that's the fault of "shitty programmers".

    I say the shittiness starts with the unicode committee.

  17. Unicode can go fuck themselves by Anonymous Coward · · Score: 1

    They lost any and all respectability when they let the emoji cancer in. To hell with them.

    1. Re:Unicode can go fuck themselves by Anonymous Coward · · Score: 0

      Support this. There are now so many different emoji characters that you need a translation chart just to figure out what most of them mean. I thought emoji were a way of translating simple emotions (that everyone can understand without having to think about it) into simple symbols (that everyone can understand without having to think about it) to avoid having to type a lot of text.

      This idea was probably too Western-civilisation-centric in the first place, because different facial expressions can be interpreted very differently in different places. But /smile/ :-) and /sadface/ :-( are probably universally understood.

      Even in ASCII emojis started to proliferate more than was good for communication, but that was usually done in sub-cultures where the novel codes were easy to decode for that subculture's members.

      Additionally, introducing the same emoji in different colors was a really stupid idea (I'm looking at you, Apple, and you, UC) - they are just representations of text. Text is whatever color you want it to be.

      If people wanted them in their own skin tone, there should just be a setting in app preferences where you can pick a color for your symbols.

    2. Re:Unicode can go fuck themselves by unixisc · · Score: 1

      Emoji is the equivalent of ICANN's TLDs - too many of them just smothering a limited resouce. There is no way TLDs can be adequately supported within IPv4, and IPv6 is by no means as widely adapted to justify being able to support this. Emojis are different b/w the platforms - iOS emojis can't be read on Android, whose Emojis can't be read on Window Phone.... And I agree on the skin tone emojis - that's a really stupid one. And sports - why are there symbols for just some sports (soccer, baseball,...) but not others (volleyball, cricket...)

    3. Re:Unicode can go fuck themselves by oggiejnr · · Score: 1

      There was a reason for it. It was to allow for interop with Japanese text messaging systems.
      https://www.youtube.com/watch?v=tITwM5GDIAI

    4. Re:Unicode can go fuck themselves by pjt33 · · Score: 1

      You may or may not be pleased to know that this latest release of Unicode adds glyphs for volleyball and cricket.

    5. Re:Unicode can go fuck themselves by Anonymous Coward · · Score: 0

      You may or may not be pleased to know that this latest release of Unicode adds glyphs for volleyball and cricket.

      Hah!
      It's almost like the consortium just bought him off by careful planning
      or time travel. Bravo!

  18. Why we need multiple languages by prefec2 · · Score: 1

    Humans developed different languages in different regions. Now we have different languages with different features and different cultural ties. While it is often possible to translate the semantics of one language is an equivalent in another language, you have more trouble doing so with pragmatics. And in addition the result does not "taste" as good as the original. It is a little bit like food. You could just consume a nutritious supplement to sustain life. However, all the culture and tastes and emotions around food would be wasted. Even as an US-American you are aware that their are different feelings and moods attached to, lets say, porridge, a steak, a burger, a donut, a beer, Chinese take-out, pizza, corn etc.

    Recent studies showed that we even have different personalities depending what language we are using. So it would be great to be only able to speak, read, and listen to one single language. And if we should agree on one. Are you willing too learn Chinese?

    1. Re:Why we need multiple languages by Anonymous Coward · · Score: 0

      So it would be great to be only able to speak, read, and listen to one single language. And if we should agree on one. Are you willing too learn Chinese?

      Chinese isn't a good choice because it's tonal and because of its writing system.

      English and/or Spanish are actually pretty good choices. Since English is the de-facto international standard, we might as well stick with it. What makes English hard is mostly its phonetics, but a "global standard English" could simplify that.

      (NB: I'm not a native English speaker.)

      Recent studies showed that we even have different personalities depending what language we are using.

      The languages aren't the cause of the different personalities; they simply serve as markers for the context.

      Support for the Sapir-Whorf hypothesis is weak; except for practical considerations (difficulty of learning the language, size of vocabulary, etc.), it doesn't matter that much which language you use.

    2. Re:Why we need multiple languages by OrangeTide · · Score: 1

      Cyberspace is a single region.

      I'm willing to learn a new language if a reasonable proposal was put forward. Chinese is not a reasonable language for a cyberspace culture though. Russian might be OK, as would Greek, but Swahili seems like a good option, Korean would probably work fine as well. Japanese would probably be a terrible choice, and they would likely not appreciate a lot of foreigners contributing to alterations of their language.

      --
      “Common sense is not so common.” — Voltaire
  19. Touché by pjt33 · · Score: 1

    I would have said that even English requires diacritics to support some of its loanwords.

    grep -E "[éóèâêûäöñç]" /usr/share/dict/words | grep -v "[A-Z]" | wc
            174 174 1720

    1. Re:Touché by fnj · · Score: 1

      Odd; that command only works for me if I replace the second grep with egrep. I wonder why.

    2. Re:Touché by fnj · · Score: 1

      Hilarious; I had grep aliased to "grep -i --color=auto" because the idiots have deprecated GREP_OPTIONS. A lesson in unexpected interaction (-i and -v in this case).

      I fixed my alias. Thank you for leading to me finding my bad practice.

    3. Re:Touché by Anonymous Coward · · Score: 0

      I get the same result as you, and it does not even include "déjà vu" for example, or "naïve" (in this case both spellings are allowed).

    4. Re:Touché by Hognoxious · · Score: 1

      That's because he missed out some characters from his pattern - there's no u with an umlaut either. As to u with a circumflex, I don't think I've ever seen that.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    5. Re:Touché by pjt33 · · Score: 1

      No, it's because those words aren't in /usr/share/dict/words. I started with a much larger list of characters, and filtered it down before posting to just those which actually matched something. The û is for croûton, croûton's, croûtons.

  20. Getting carried away? by bradley13 · · Score: 1

    I know bits are cheap, but...really?. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".

    Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.

    --
    Enjoy life! This is not a dress rehearsal.
  21. How about Wenlin's CDL ? by Anonymous Coward · · Score: 0

    Is it better at CJK ?

  22. Cool... by Red_Chaos1 · · Score: 1

    ...even more "dominoes" to show up on my screen because the OS/applications can't render/display Unicode properly to save their lives.

    1. Re:Cool... by Anonymous Coward · · Score: 1

      Take your device over to http://www.alanwood.net/unicode and click around. I have android 4.4.2 and a ton of the reference charts show seemingly inefficient repeats of glyphs, as well as reveal the fact that LG (or AOSP? --or someone involved with my default font, NewSmartGothic?) took liberty to colorize SOME emoji, and convert some entries to media-player-app-friendly button-faceplate-glyphs. There are some similar arrow fonts where you can see bad size differences between "similar" arrow characters. I am sure someone out there writing technical or scientific docs is miffed after seeing his mono-chromatic PDF have childish spurts of blue, gray or pink by unexpected system-level font substitutions.
      And while creating the "pile of poo" character was not a good idea, some new systems even give it eyes and a happy smile. Good is bad and bad is good. I wonder who these artists are, getting commissioned to fontify concepts into scalable glyphs for thousands and thousands of nonessential ideas that can be convered by an normal emoticon, some creative stacking of emoji, or even words.

      Before rambling more, I'll mention the coming mess that is chromatic fonts - https://lwn.net/Articles/564944/ (encoding colo/gradient info per letter IIRC)
      You already have support since Firefox 26. Enjoy! https://people.mozilla.org/~jkew/opentype-svg/GeckoEmoji.html

  23. Its for historical purposes by drnb · · Score: 1

    ... for practical purposes over 9000 languages is a bit much ...

    But for historical purposes it makes sense. Shouldn't we be digitizing as much of antiquity and vanishing cultures/languages as possible. Note that its a pretty bad time for the physical preservation of antiquities in the cradle of civilization right now.

    It would be helpful to academics to have such languages in a textual format not merely an image format.

  24. ENGLISH uber alles by unixisc · · Score: 1

    Not a bad idea - abolishing every other language in the world - Chinese, Spanish, Arabic, Russian, Hindi, Urdu, Bengali, Swahili, Portugese and the whole bunch of them. Just have ENGLISH - that too, the US one, and nothing else!!! Let everyone, including the Brits and Kanucks, have to adjust - some more than others.

  25. Religions & war by unixisc · · Score: 1

    Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?

    While those may be the biggest recent wars, they are by no means the only wars in history. There was the Muslim conquests of everything from Spain to India b/w the 7th to 10th centuries, which obliterated Christianity, Zoroastrianism, Animism, Buddhism and Hinduism from a lot of the territories it conquered. There were the Conquistadoras, who overran the Aztec, Mayan & Inca empires and replaced it w/ the Spanish inquisition. There was the Thirty Years War, fought to determine whether Central Europe should be Catholic or Lutheran dominated. And today, there is the global Muslim campaign to destroy as much as possible of non-Muslim countries and subvert them until they become Islamic - that's the underpinnings of the campaigns of al Qaeda, ISIS, Hizbullah, Muslim Brotherhood and so on. Also, if one considers Communism a 'religion', which it is except that it substitutes some imaginary friends w/ dead friends, then you have the entire Soviet Purges, the Chinese Cultural Revolution and Pol Pots holocaust in Cambodia to add to the mix.

  26. The (5:erocS) problem by tepples · · Score: 1

    Thank you for ninjaing me. I often chime in about this issue when someone complains about Slashdot's lack of support for Unicode. Most of the time, after I explain the code point whitelist and the reason for it, someone complains that a blacklist of dangerous code points would work better. My usual reply is that new versions of Unicode may insert new control code points that get activated before the Slashdot admins have the chance to add them to the blacklist. And besides, many characters outside the current whitelist are far more useful for what used to be called "ASCII art" than for readable text in the English language. For example, Oriya letter ii (U+0B08) looks to English speakers more like the head of a Smurf. And ASCII Goatse and ASCII Jack Off are why Slashdot had to add a lameness filter in the first place.

    But apparently, Slashdot doesn't strip bad characters on display, only on post. This post, for example, still contains a bidirectionality override.

  27. Korean is not an ideographic language by Anonymous Coward · · Score: 0

    While Unicode lumps Korean together with Chinese and Japanese ("CJK"), Korean has an alphabet. It is not an ideographic language like Chinese. https://en.wikipedia.org/wiki/Hangul.

  28. Tower of Babel by tepples · · Score: 1

    Sorry, why do we need multiple languages again?

    Originally, to punish ancient Babylonians for trying to build a dangerously tall ziggurat. Since then, to preserve access to oral tradition.

  29. Prince Rogers Nelson by tepples · · Score: 1

    Both of this musician's names can be represented in ASCII: "Prince Rogers Nelson" and "O(+>".

  30. Lack of arrows, STILL by petteyg359 · · Score: 1

    So yet another major version number and they still haven't bothered to add the many arrow (and other directional) symbols that have been missing...