Slashdot Mirror


Unicode 7.0 Released, Supporting 23 New Scripts

An anonymous reader writes "The newest major version of the Unicode Standard was released today, adding 2,834 new characters, including two new currency symbols and 250 emoji. The inclusion of 23 new scripts is the largest addition of writing systems to Unicode since version 1.0 was published with Unicode's original 24 scripts. Among the new scripts are Linear A, Grantha, Siddham, Mende Kikakui, and the first shorthand encoded in Unicode, Duployan."

21 of 108 comments (clear)

  1. Seriously? by newsman220 · · Score: 4, Funny

    Still no Klingon?

  2. Pictographic symbols by toejam13 · · Score: 2

    Good. If you do a search of Wingdings on Google, many of the top results are questions on how to use the font with browsers other than IE. Since it isn't a Unicode compliant font, you can't. This update helps correct that problem.

    1. Re:Pictographic symbols by narcc · · Score: 2

      Used all-over?

  3. Why emoji? by Anonymous Coward · · Score: 2, Insightful

    What's the point of adding pictographic symbols to Unicode? Is this really something we want frozen in time for eternity? What's the benefit of standardizing them anyway?

    Wouldn't we be better off standardizing all characters used in written language and be done with it?

    1. Re:Why emoji? by RyuuzakiTetsuya · · Score: 4, Insightful

      Not everyone speaks English or Chinese or Spanish.

      Everyone recognizes stop sign, airport, pile of poop and other symbols. So communicating via pictographs is actually good. Even if it was incidental.

      --
      Non impediti ratione cogitationus.
    2. Re:Why emoji? by Guy+Harris · · Score: 3, Informative

      Not everyone speaks English or Chinese or Spanish.

      Everyone recognizes stop sign, airport, pile of poop and other symbols. So communicating via pictographs is actually good. Even if it was incidental.

      And many of them recognize this as well.

    3. Re:Why emoji? by BitZtream · · Score: 5, Interesting

      But they're not "standard" even if Unicode claims they are.

      They are standard in reference to Unicode because the Unicode Consortium defines the Unicode standard. Someone has to be the first to define the standard.

      but there is not central body that dictates exactly what they look like, so that pile of poop symbol will vary depending upon which texting app you use it with

      Yes, those are called fonts, and in case you haven't noticed, that was true before digital computers with silicon microprocessors even existed and has been true for thousands of years.

      The apps that use emojis are not coordinating with any standard's body or ensuring that the intended meaning is preserved.

      Apple does, hence why the Messages app already matches the new code points. Google Hangouts seems to work fine as well. Both Messages and Hangouts convert even things like :) into the proper unicode code point and use standard fonts for display. Sure, some half assed apps may not work correctly, but anyone that supports unicode and has fonts will receive them properly already.

      Emoji is somewhat silly, but its hardly new, just go ask Japan. Just because you're new to the ballgame doesn't mean its a new ballgame.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  4. Re:Linear A? by Livius · · Score: 4, Insightful

    There are a few, and researchers and historians would like to have them on computer.

  5. Re:Klingon in more useful by LordLucless · · Score: 2

    but there just aren't enough extant samples to justify adding it to Unicode, and nobody can translate it.

    Unicode is supposed to be universal, and it has more than enough codepoints to spare - why is there a problem adding it? I'm sure having it in a standard encoding would prove useful to anyone who is trying to translate Linear A, or to archeologists/historians looking to digitize fragments we do have, etc.

    --
    Just because you're paranoid doesn't mean there isn't an invisible demon about to eat your face
  6. less useful how? Re:The larger, the less useful by Fubari · · Score: 4, Interesting
    Fragmented? I haven't heard of any unicode forks. The people at the Unicode_Consortium seem like they're doing ok. Unicode seems pretty backwards compatible; have any of the the newer versions overwritten or changed the meaning of older versions (e.g. caused damage)? That isn't true for various ascii encodings, which is an i18n abomination on the hi-bit characters. Or with ebcdic, which isn't self compatible. One of the things I love about unicode is the characters (glyphs) stay where you put them, and don't transmute depending on what locale a program happens to run in.

    The larger Unicode becomes, the more fragmented the implementations will be.

    Maybe instead of fragmented, you mean there won't be font sets that can't render all of unicode's characters?
    *shrug* Even if that were a problem, the underlying data is intact and undamaged and will be viewable once a suitable font library is obtained.

    The more fragmented it is, the more errors and incompatibilities will compound. It will get less and less useful, and more and more bulky, and will eventually be as useful as Flash. (well, it may not be that bad, but still, Flash was all things to all people, and almost universally installed, until it wasn't.

    Can you give me an example of an incompatibility? I'm not saying there are none, just that I don't know of anything and that, in general, I've been very pleased with Unicode's stability - compared to other encodings - for doing data exchange.

    1. Re:less useful how? Re:The larger, the less useful by BetterThanCaesar · · Score: 2

      Unicode seems pretty backwards compatible; have any of the the newer versions overwritten or changed the meaning of older versions (e.g. caused damage)?

      Yes. Version 2.0 completely changed the Hangul character set. Korean texts written with Unicode 1.1 were not readable in Unicode 2.0, and vice versa. This was 17 years ago, but note that it was after ISO had accepted version 1.1 as an ISO/IEC standard.

      --
      "Stop failing the Turing test!" -- Dilbert
    2. Re:less useful how? Re:The larger, the less useful by AmiMoJo · · Score: 4, Interesting

      The main problem is the broken CJK (Chinese, Japanese, Korean) support that has caused numerous ad-hok work-arounds and hacks to be developed. In a nutshell all three languages shared some common characters in the past, but over time they diverged. Unfortunately these characters share the same code points in Unicode, even though they are rendered differently depending on the language. A Japanese and Chinese font will contain different glyphs for the same character.

      It is therefore impossible to mix Chinese and Japanese in the same plain text document. You need extra metadata to tell the editor which parts need Chinese characters and which need Japanese. There are Japanese bands that release songs with Chinese lyrics and vice versa, and books that contain both (e.g. textbooks, dictionaries). Unicode is unable to encode this data adequately.

      Even the web is somewhat broken because of this. If a random web page says it is encoded with Unicode there is no simple way for the browser to choose a Japanese, Korean or Chinese font, and all the major ones just use whatever the user's default is.

      It really isn't clear how this can be fixed now. Unicode could split the code pages but a lot of existing software will carry on using the old ones. It's a bit of a disaster, but most westerners don't seem to be aware of it.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  7. On Earth, Klingon is written in Latin by tepples · · Score: 2

    First I'll assume that you're talking about the KLI pIqaD for tlhIngan Hol, and not the Skybox pIqaD or the Mandel script. The Unicode team looked at encoding KLI pIqaD but decided against it because the Klingon-speaking community on Earth had already adopted a Latin-based script. (Reference: Klingon alphabets on Wikipedia) But it could use a slight spelling reform to make it case-insensitive.

  8. Peso vs. Dollar by steelfood · · Score: 2

    It's great they're adding new currency symbols for new currencies, but there's still a long-standing issue of the $ with one bar and $ with two bars. It's currently still considered a stylistic difference, but the scope of Unicode has evolved to account for every glyph known to man. Certainly, one- and two-bar $ can hardly be said to be the same glyph within this new context.

    Especially considering that there are already stylistic duplicates (half-width and full-width latin forms vs. plain latin), I can't seem to understand the justification behind letting one- and two-bar $, which are historically separate glyphs, be underrepresented.

    --
    "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
    1. Re:Peso vs. Dollar by lithis · · Score: 4, Informative

      Many of the stylistic duplicates, for example the half-width and full-width latin forms that you mentioned, are only in Unicode because of backwards compatibility with pre-Unicode character sets. If there hadn't been character sets that had different encodings for half- and full-width forms, Unicode never would have had them either. So you can't use them to argue for more glyph variations in Unicode. The same applies to many of the formatted numbers, such as the Unicode characters "VII" (U+2166), "7." (U+248E), "(7)" (U+247A), and "1/7" (U+2150), and units of measure ("cm^2", U+33A0).

      (Oh, for Unicode support in Slashdot....)

  9. Latin unification too by tepples · · Score: 2

    True, some characters have forms that differ between traditional Chinese and Japanese. But that's not limited to Chinese and Japanese, as Unicode also has Latin unification. For example, the letter "i" is the same whether in English or Turkish, but its capital form differs between the two languages. And in Dutch, the letter 'y' with umlaut/diaeresis is supposed to be written using the rounded form, as it's considered a ligature of "ij". Implementations are supposed to define out-of-band language markers, such as HTML's lang= attribute, to handle this.

    1. Re:Latin unification too by AmiMoJo · · Score: 2

      The problem with unification is that metadata is often either unavailable or inadequate. The goal should be to represent all characters in plain text, not rely on specific document formats to provide context.

      How would a music player app handle a file tagged with a unified character? How would a file manager handle it? There is no context, no metadata to tell it what language is in use and what font to select. Anyone who uses both Japanese and Chinese can tell you this is a common problem, and I imagine Dutch people get it too.

      Even in HTML you only get to set one language for the entire document. Good luck writing a page in Chinese about learning Japanese. The ones I have seen tend to use GIFs to represent the characters that Unicode can't differentiate, but that means you can't copy/paste them and the fonts don't match.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  10. Re:Klingon in more useful by lithis · · Score: 2

    There is already at least one effort to extend Unicode beyond the current maximum of 1.1 million characters: The UCS-X Family of UCS Extensions. It defines UCS-G, which supports over two billion characters, UCS-E with over nine quintillion, and UCS-Infinity with no upper bound. They each support 8-, 16-, and 32-bit variable-byte encodings (e.g. UTF-E-32, UTF-Infinity-8). Itâ(TM)s been a while since I read about them, but I believe they are all compatible with UTF- 8, 16, and 32.

  11. Shit in one font vs. shit in another font by tepples · · Score: 2

    that pile of poop symbol will vary depending upon which texting app you use it with

    So will any symbol. Though A, A, and A probably produce distinct glyphs on your machine, you can recognize them all as U+0041 LATIN CAPITAL LETTER A. Likewise, though U+1F4A9 appears different in different fonts, it'll look like shit in all of them.

  12. Proprietary fonts by ortholattice · · Score: 5, Insightful

    Over the years, I've tried to use Unicode for math symbols on various web pages and tend to revert back to GIFs or LaTeX-generating tools due to problems with symbols missing from the font used by this or that browser/OS combination, or even incorrect symbols in some cases.

    IMO the biggest problem with Unicode is the lack of a public domain reference font. Instead, it is a mishmash of proprietary fonts each of which only partly implements the spec. Even the Unicode spec itself uses proprietary fonts from various sources and thus cannot be freely reproduced (it says so right in the spec), a terrible idea for a supposed "standard".

    I'd love to see a plain, unadorned public-domain reference font that incorporates all defined characters - indeed, it would seem to me to be the responsibility of the Unicode Standard committee to provide such a font. Then others can use it as a basis for their own fancy proprietary font variations, and I would have a reliable font I could revert to when necessary.

  13. Emoji? by bradley13 · · Score: 2

    Great, Unicode is already a fragmented mess, and now the standards organization justifies its existence by adding characters that do not exist.

    An earlier poster asked why anyone thinks Unicode is fragmented. The answer in one word: fonts. Different fonts support different subsets of Unicode, because the whole thing is just too big. If you expect your font to mostly be used in Europe, you are unlikely to bother with Asian characters. if you have an Asian font, it probably has only English characters, not the rest of Europe. huge. If you have a font with complete mathematical symbols, it will include the Greek alphabet, but actual language support is a crapshoot.

    So the solution to this problem is to add made-up characters that no one cares about. "Man in business suit, levitating". Really?

    --
    Enjoy life! This is not a dress rehearsal.