New Unicode Bug Discovered For Common Japanese Character "No"

← Back to Stories (view on slashdot.org)

New Unicode Bug Discovered For Common Japanese Character "No"

Posted by timothy on Friday July 17, 2015 @11:32PM from the perfectly-nice-in-garbage-out dept.

AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.

6 of 196 comments (clear)

Min score:

Reason:

Sort:

What bug? by Ark42 · 2015-07-17 23:53 · Score: 4, Informative

The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").

--
Morphing Software
Nitpick by msobkow · 2015-07-18 00:21 · Score: 5, Informative

This is not a "Unicode bug". It is a rendering bug exhibited by some applications.

--
I do not fail; I succeed at finding out what does not work.
1. Re:Nitpick by Kjella · 2015-07-18 02:18 · Score: 3, Informative
  
  How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.
  Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says and if you express it as simply a style issue, it actually sounds plausible.
  On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.
  I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?
  
  --
  Live today, because you never know what tomorrow brings
They're trying to unify *similar* characters by ciaran2014 · 2015-07-18 02:25 · Score: 4, Informative

A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)
The issue:
There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.
Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.
(Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)
A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.
An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.
Here are some example symbols:
https://en.wikipedia.org/wiki/...
unicode.org's FAQ also has clarifications:
If the character shapes are different in different parts of East Asia, why were the characters unified?
http://www.unicode.org/faq/han...
Isn't it true that some Japanese can't write their own names in Unicode?
http://www.unicode.org/faq/han...
(All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)

--
Help build the anti-software-patent wiki
Re:Why not just use English, and only English? by interval1066 · 2015-07-18 03:17 · Score: 1, Informative

I think Chinese is the only language we need, it's already the most spoken language in the world.
Only in head count, not by region. If the world was populated only by the Chinese, which seems to be their goal, then yes, Chinese is the most spoken language in the world. However, if you break that fact down by dialect, your statement is really weak. Mao's goal to have the entire PRC speak Mandarin really failed.

It's a democratic language that will draw from other languages where necessary and useful.
Not really. Mao tried to force all Chinese to speak Mandarin, and he failed miserably. Kinda the exact opposite of "Democratic". But of course that's not the fault of the language per se...

It's a language that has proven it can adapt to changing circumstances.
Chinese may be, but if Japanese is an example, and Japanese is adapted from Chinese by Han explorers to Japan in the Iron Age; its not very adaptable at all. The Japanese have developer THREE different writing systems to cope with with some shortcomings of the language (only two tenses, underdeveloped pronoun system, etc). That may be a shortcoming of Japanese, but Japanese is just a symptom of a language root that isn't very forgiving. I will say however that a language that can be nuanced such that 9 different meanings from changing the tone of one word may be more flexible than I give it credit for,

--
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
Re:Mandarin dependency and homophone confusion by Fire_Wraith · 2015-07-18 04:03 · Score: 4, Informative

To a degree, yes, because the symbols themselves are the same. Note however that some of the original Chinese characters have been altered in use (simplified) by the PRC in the 50s and 60s, but those are only used in mainland China (and I think Singapore maybe?), but not Taiwan or Japan. Aside from that though, the characters for something like 'University' would still be a combination of the character for 'large' and the character for 'school'. It might be pronounced totally differently, but could be read and understood by all. Fun fact: The proper reading of the characters for the country of "Japan" in Japanese is actually "Nihon" or "Nippon." However, in certain Chinese dialects, the characters that comprise it are pronounced more like "Zep-pen" or "Japan." What's also fascinating to consider is that Korean is the same way, but that in modern usage you hardly ever see the Chinese characters (Hanja) used, even though I think they're still taught in some schools. Almost everything I saw when I was in Korea was in Hangul, the Korean native alphabetic script.