New Unicode Bug Discovered For Common Japanese Character "No"
AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.
Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language.
Meanwhile, others have not.
And I have noticed that the yes, which is common in all languages.
Like æ or äï¼Y
If so, seems many Chinese website will have problems too, becuase it's used so often in Chinese.
"The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae"
How is this a "bug"? The character IS sometimes used in formulae. The fact that certain programs interpret it as such, even though the surrounding characters indicate that a particular instance is not part of a formula, is a bug in those applications, not Unicode.
Avantslash: low-bandwidth mobile slashdot.
There are way more than 64K code points now. So are there any plans to de-unify or separate codepoints for Chinese, Japanese and Korean? Or do we still have to specify language in HTML pages that contain a mix of two languages of CJK?
You shouldn't need 16 bits to write a letter.
which is a little strangely phrased.
You have to assume there will be problems.
The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").
Morphing Software
It's the character "no", not the word for "no". As this: ã®. It is usually used as the English, "of".
In practice, English is the only language we need.
It's already a first language to hundreds of millions of people. It's already a second language to billions. It's even a third, fourth, fifth and sometimes sixth language to many millions more.
It's the language of international business. It's the language of international academia. It's the language of international engineering. It's the language of international aircraft control.
It uses a sensible alphabet that's easy to represent digitally. It's a democratic language that will draw from other languages where necessary and useful. It's a language that has proven it can adapt to changing circumstances.
We shouldn't strive to eliminate other languages, of course. They do have their value, but more as historic curiosities for linguists and historians rather than something to use on a daily basis.
Computers are already able to do amazing things with English text. There's just no need to support other languages.
english haters or america/british haters will disagree, but english is the best choice--as the most widely spoken language (among all speakers, not just natives. sorry china, sorry arabs). it is also the language of the seas, and the language of the air, as well as the primary global language for commerce, internet and diplomacy, and it is often the fallback language among native speakers of non-english languages.
how many others are as easily typed and written (use 26 or fewer Latin characters, do not use accent marks, and use Arabic numerals) and widely spoken languages (billion+ speakers) are there? none.
when will it happen?
This is not a "Unicode bug". It is a rendering bug exhibited by some applications.
I do not fail; I succeed at finding out what does not work.
displays a lack of maturity (at least in this article) in addressing the issues surrounding CJK unification. He writes: "I can understand the rationale behind Han Unification but, since I have the emotional capacity of a child and just want things to work, I’m going to say that it’s dumb and stupid and I hate it." Then he notes in the comments (in response to "Eee" explaining the reasons the author claimed to understand): "I'm not sure if anything you've said is wrong, because I think it was mostly over my head, sorry :x"
Apart from that point, the debate over CJK unification IMO boils down to the role a character encoding should play in the display of information (i.e. what level should language-specific information be stored?).
The character in the Unicode table looks like a mashup of the hiragana (grammar-forming) version of the character, and the katakana (used as we do italics) form.
While there is the technology to do this on the web (the HTML ruby element), you won’t see it much. It just doesn’t work on all web browsers (like Firefox),
Wrong, in fact Firefox is the only browser with almost full ruby support and for older Firefoxen there have been ruby extensions for ages.
and few people choose to use it on their websites.
Their problem.
Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.)
"Chinese" is not a single spoken language. A passage written in one Chinese language, such as Mandarin, is often readable in another Chinese language, such as Cantonese, so long as they're written with Han characters. It's as if French could be read as Italian or Spanish with the same characters. In addition, different words that sound the same in a given Chinese language due to historic sound changes usually have different Han characters. They may end up sounding different in a different Chinese language whose different historic sound changes produced different homophone sets. Pinyin, on the other hand, depends on Mandarin and confuses homophones.
Please name said software.
Any HTML renderer ought to be able to tell an element with lang="zh-Hans" (Chinese using simplified characters) from one with lang="ja" (Japanese).
A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)
The issue:
There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.
Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.
(Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)
A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.
An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.
Here are some example symbols:
https://en.wikipedia.org/wiki/...
unicode.org's FAQ also has clarifications:
If the character shapes are different in different parts of East Asia, why were the characters unified?
http://www.unicode.org/faq/han...
Isn't it true that some Japanese can't write their own names in Unicode?
http://www.unicode.org/faq/han...
(All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)
Help build the anti-software-patent wiki
Hey timothy, why not include the character in the summary, so we can all see it?
Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language
I have been reading the comments for 20 minutes because I don't understand Japanese, but I still don't understand the problem. There's a Japanese character called no, it looks very much like a lowercase English/Latin "e" rotated clockwise about 80 degrees and then flipped over the vertical axis. Is this being mixed up with something else or rendered wrongly? Can anybody provide examples of what it's getting mixed up with or how or where it's being rendered improperly?
It would be funnier if the bug was on character "Ni"
This isn't the first time I've seen someone complain about the han unification. Since unicode was supposed to solve the problems of multiple encodings there is something to be said for the proposition that it shouldn't introduce new problems or continue problems such that it doesn't actually solve the problems it set out to solve. In that light, the pain and suffering resulting from trying to combine chinese and japanese text in a single document resulting in confused renderers and mixed-up fonts, like here but there are more ways that goes wrong, shows that unicode fails to properly fulfil this proposition and that the han unification as such wasn't such a great idea. Clever in theory, impractical in practice.
Since unicode was exactly supposed to be a solve-all silver bullet leaving us with nothing but rainbows and an easy life, the expressed sentiment is the salient point.
Therefore it can only be ignored.
Anyway, the unification idea does have a practical drawback in that using more than one language using those unified slots is going to be nothing but pain and suffering. Personally I think that unicode's less-than-21-bits codepoint space isn't going to be enough and that a system of encodings, rather than one do-all encoding, would be the more practical approach. That way renderers would know what language the text is in and so make better decisions picking fonts and such.
Just use English, people! If people did not use all these other silly little languages the world would be so much better.
Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).
What did they notice? Something about the Japanese character "no". But what exactly? The parenthetical gives some background on the character, but the sentence telling us what got noticed is never completed.
Oh well. Maybe it wasn't important.
Japanese and Chinese syntax differ too much for parallels as close as those of Mandarin and Cantonese. Japanese puts the verb at the end (SOV) and marks noun case with postpositions (wa, ga, o, e). Chinese, on the other hand, puts the verb in the middle (SVO), more like English. (Other orders are possible: Welsh and Arabic put the verb at the beginning, or VSO, and Kashmiri and Dutch split the verb into a part that's second and a part at the end, or V2.)
Chinese also uses serial verb construction, where verbs before the sentence's main verb double as prepositions. For example, a sentence that glosses literally as "I sit aircraft depart Shanghai arrive Beijing travel" is understood as "I by aircraft from Shanghai to Beijing travel." (English is also SVO, but manner and place phrases follow the verb, producing "I travel from Shanghai to Beijing by aircraft.") In Japanese, each of these prepositional verbs would have to go after the noun and would probably need a participle ending like -tte to link them into the sentence.
For about eight centuries prior to World War II, Japanese used kanbun, a way to mark up Chinese text to show the equivalent word order in Japanese, allowing it to be read as Japanese. It used reordering marks called kaeriten.
Let's be clear here: the character is U+306E, "hiragana letter no"
http://www.fileformat.info/info/unicode/char/306e/index.htm
The general category is "letter, other"
Nothing to do with math (that would be "math symbol")
If there is a bug, it is not in Unicode, but in some crappy software.
Nothing to see, move along.
Just say "No" to Unicode.
Table-ized A.I.
Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).
That isn't even a sentence in English. It is extremely grating to read crap like this, and it does not convey much about the story. .
unicode *is* a bug.
Even using vector fonts doesn't fix the problem that Unicode wasn't a great solution for managing the diversity of characters in many Asian languages.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I was once at a conference in Germany, most of which was given in English because it was an international crowd. One of the German speakers started off by saying that he used to start by apologizing for his bad English, but the host (who was Turkish) told him not to worry; Bad English is the most widely spoken language in the world. (Which is fine; English is flexible enough about most things that if you don't need to be subtle, Bad English will usually do.)
German's the only non-English language that I'm even vaguely functional in, and even then it was much more useful for me in Czechoslovakia, where people had learned German in school to deal with tourists, and I mainly wanted to talk to them about the same sets of things, like train schedules and getting food and hotels and which bridge went to the castle. Northern Germans speak a relatively comprehensible dialect, though too fast for me to do much in real time; understanding Austrians is more like being a New Yorker in deep Alabama. (I play music at a local German jam session, and some of the tunes have the lyrics translated from Bavarian or Swiss into German...)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I'm pretty sure my mom's manual typewriter when I was a kid didn't have 1, less sure about whether it had 0. But it did have the proper French and Spanish accent marks (left, right, circumflex, N~, cedilla, most of which my PC keyboard doesn't have), and you composed them with letters by using the backspace.
And yes, she could do two-column left-and-right-justified newsletters on it - she'd type a draft, count the letters, type the final. But she happily switched to using a Macintosh to type them, and let it handle that stuff.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Yes, I know you were trolling, but in your mythical 7-bit-clean English, even if you're not using English letters like ð or , or ligatures like æ , or distinguishing between short and long S's (you know, the s you used to think were f's), how do you put diaeresis marks over words like cooperate, or distinguish between m-dash and n-dash and hyphen, or get the left- and right-side quotation marks without using some Microsoft or Apple ``smart quote'' breakage, much less deal with accent marks in words of foreign origin that are now part of English because we stole them fair and square and they're ours now, or handle degree marks, or words with superscript letters like the abbreviations for the and that and George and Your, or ...
And turning them all into leet-speak, like earlier Ye Olde Hwaetever's, just doesn't count.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks