Slashdot Mirror


New Unicode Bug Discovered For Common Japanese Character "No"

AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.

22 of 196 comments (clear)

  1. What bug? by Ark42 · · Score: 4, Informative

    The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").

    1. Re:What bug? by AmiMoJo · · Score: 2, Interesting

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      You can't mix C/J/K and mathematics in Unicode, which is a new bug beyond just the failure to support mixing C/J/K.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  2. Re:Is it the same as in Chinese? by Chris+Mattern · · Score: 4, Funny

    Like æ or ÃüY

    If so, seems many Chinese website will have problems too, becuase it's used so often in Chinese.

    As you have just discovered, Slashdot cleverly avoids all Unicode bugs by not supporting Unicode at all.

  3. Re:Bug? by AmiMoJo · · Score: 2

    It's a Unicode bug. Unicode tries to merge different characters into a single code point, because long ago they had the same origin. This particular character exists in Japanese, Chinese, Korean and mathematics, so can be rendered four different ways, but they all share one code point.

    Applications have to guess what font to use. Being a mathematical program, this one defaults to the system language (Japanese) but has logic to detect this "no" character and render it in a different font. It isn't clever enough to notice that the rest of the sentence is Japanese, but it shouldn't have to be.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  4. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 2, Insightful

    There are more native Chinese speaker than English speaker. How about you learn Chinese and shut the fuck up?

  5. Nitpick by msobkow · · Score: 5, Informative

    This is not a "Unicode bug". It is a rendering bug exhibited by some applications.

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Nitpick by AmiMoJo · · Score: 2

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:Nitpick by AmiMoJo · · Score: 2

      Software that doesn't have this bug only avoids it by not supporting mathematical symbols. So far there is no known software that avoids the CJK confusion problem either.

      Most software doesn't even try. How many programmers are even aware of the issue? No Unicode library is immune. It's a problem with the standard that can only be fixed by starting fresh with about 150,000 new CJK characters, and then updating all fonts and libraries to handle translation and equivalence.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    3. Re:Nitpick by Kjella · · Score: 3, Informative

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says and if you express it as simply a style issue, it actually sounds plausible.

      On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.

      I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?

      --
      Live today, because you never know what tomorrow brings
    4. Re:Nitpick by AmiMoJo · · Score: 2

      I agree, font encoding should not be part of the character encoding. Unicode even screws that up though, because there are things like text direction marks in it. Anyway, the problem is that often you have text without metadata. A file name, audio file metadata, a plain text database entry etc. You have to pick a font to render it, and the choice depends on the language because thanks to Unicode it's impossible to have a universal all-language font.

      You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  6. Re: Why not just use English, and only English? by John+Allsup · · Score: 3, Insightful

    Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.) When it comes to ideograph based languages, we would have been better off designing an entirely separate text system rather than trying to shoehorn it into a font-character paradigm derived from the needs of writing and printing latin scripts. Indeed having a writing system designed around the needs of calligraphy would be a useful thing, but like with ideograph based writing systems it is a long way from the use case we normally see with alphabet based writing systems.

    --
    John_Chalisque
  7. Re: Or speak English, it's 7bit clean by Anonymous Coward · · Score: 2, Interesting

    As I pointer out elsewhere here, Chinese can be written with a latin alphabet and a few accents. Likewise languages such as Sanskrit. Just as there is a difference between English handwriting and what can be represented in Ascii, we face a related issue with ideograph based writing systems. We would be better of writing Chinese webpages in pinyin, and developing a separate system for calligraphy and ideographs.

    Except that there are so many homonyms in pinyin that a strong sense of the context is needed to read it. The logograms are much harder to write but reading is quite a bit easier, which is why they are still in use. That's not the same as English handwriting vs printing, where the differences are only in rendering and there is a 1:1 correspondence between a handwritten and a printed character.

  8. Re: Why not just use English, and only English? by amake · · Score: 2, Insightful

    we would have been better off

    No, you might have been better off. Chinese speakers would not. They would like to use their written language, as it exists today, on computers just like everyone else.

  9. Mandarin dependency and homophone confusion by tepples · · Score: 4, Interesting

    Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.)

    "Chinese" is not a single spoken language. A passage written in one Chinese language, such as Mandarin, is often readable in another Chinese language, such as Cantonese, so long as they're written with Han characters. It's as if French could be read as Italian or Spanish with the same characters. In addition, different words that sound the same in a given Chinese language due to historic sound changes usually have different Han characters. They may end up sounding different in a different Chinese language whose different historic sound changes produced different homophone sets. Pinyin, on the other hand, depends on Mandarin and confuses homophones.

    1. Re:Mandarin dependency and homophone confusion by Fire_Wraith · · Score: 4, Informative

      To a degree, yes, because the symbols themselves are the same. Note however that some of the original Chinese characters have been altered in use (simplified) by the PRC in the 50s and 60s, but those are only used in mainland China (and I think Singapore maybe?), but not Taiwan or Japan. Aside from that though, the characters for something like 'University' would still be a combination of the character for 'large' and the character for 'school'. It might be pronounced totally differently, but could be read and understood by all. Fun fact: The proper reading of the characters for the country of "Japan" in Japanese is actually "Nihon" or "Nippon." However, in certain Chinese dialects, the characters that comprise it are pronounced more like "Zep-pen" or "Japan." What's also fascinating to consider is that Korean is the same way, but that in modern usage you hardly ever see the Chinese characters (Hanja) used, even though I think they're still taught in some schools. Almost everything I saw when I was in Korea was in Hangul, the Korean native alphabetic script.

    2. Re:Mandarin dependency and homophone confusion by aix+tom · · Score: 2

      Another interesting thing little tidbit I stumbled upon while learning Japanese: "Peking" is written with the Characters North-Capital, "Nanking" is written with the Characters "South-Capital", while "Tokio" is written with the Characters "East-Capital".

  10. They're trying to unify *similar* characters by ciaran2014 · · Score: 4, Informative

    A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)

    The issue:

    There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.

    Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.

    (Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)

    A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.

    An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

    Here are some example symbols:

    https://en.wikipedia.org/wiki/...

    unicode.org's FAQ also has clarifications:

    If the character shapes are different in different parts of East Asia, why were the characters unified?
    http://www.unicode.org/faq/han...

    Isn't it true that some Japanese can't write their own names in Unicode?
    http://www.unicode.org/faq/han...

    (All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)

    --
    Help build the anti-software-patent wiki
  11. Can anyone illustrate? by BlueMonk · · Score: 3, Insightful

    I have been reading the comments for 20 minutes because I don't understand Japanese, but I still don't understand the problem. There's a Japanese character called no, it looks very much like a lowercase English/Latin "e" rotated clockwise about 80 degrees and then flipped over the vertical axis. Is this being mixed up with something else or rendered wrongly? Can anybody provide examples of what it's getting mixed up with or how or where it's being rendered improperly?

  12. Re: Why not just use English, and only English? by interval1066 · · Score: 2

    I'll buy that, but even native Sinolanguage speakers have told me the learning curve for an alphabet is much shallower. Like, MUCH shallower. And since most modern technical terms have Greek and Latin roots, sometimes its simpler for them to just use the Latin words, otherwise they have to convert the terms to native sounds using bizarre and difficult to use conversion systems. I do agree however that it would have been nice to use a system similar to Kanji right from the beginning had we had one.

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  13. Re: Why not just use English, and only English? by interval1066 · · Score: 2

    English is the official technical language for flight. ALL international pilots, military and civ, MUST know enough English to pass flight school and to fly international commercial flights. Its also the official language of sea navigation, but to a lesser extent. I don't think you need to be as proficient. And English with a number of loan words from Greek and Latin are used in international Engineering. But yeah, English is spoken by the majority of technical people around the world as a common information exchange language.

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  14. Kanbun: Reordering Chinese to Japanese by tepples · · Score: 2

    Japanese and Chinese syntax differ too much for parallels as close as those of Mandarin and Cantonese. Japanese puts the verb at the end (SOV) and marks noun case with postpositions (wa, ga, o, e). Chinese, on the other hand, puts the verb in the middle (SVO), more like English. (Other orders are possible: Welsh and Arabic put the verb at the beginning, or VSO, and Kashmiri and Dutch split the verb into a part that's second and a part at the end, or V2.)

    Chinese also uses serial verb construction, where verbs before the sentence's main verb double as prepositions. For example, a sentence that glosses literally as "I sit aircraft depart Shanghai arrive Beijing travel" is understood as "I by aircraft from Shanghai to Beijing travel." (English is also SVO, but manner and place phrases follow the verb, producing "I travel from Shanghai to Beijing by aircraft.") In Japanese, each of these prepositional verbs would have to go after the noun and would probably need a participle ending like -tte to link them into the sentence.

    For about eight centuries prior to World War II, Japanese used kanbun, a way to mark up Chinese text to show the equivalent word order in Japanese, allowing it to be read as Japanese. It used reordering marks called kaeriten.

    1. Re: Kanbun: Reordering Chinese to Japanese by _merlin · · Score: 2

      That sentence doesn't require multiple verb clauses in Japanese. You can use destination, origin and means particles "ni", "kara" and "de": Watashi wa Shanghai kara Beijing ni hikouki de ikimasu. Since it's a single verb clause you can reorder it however you want for emphasis as long as the verb comes last - the way I have it there emphasises the subject. If you want to emphasise means of travel and use implicit speaker-as-subject, you can say: Beijing ni Shanghai kara hikouki de ikimasu. It's all easy as long as you get your particles right.