Slashdot Mirror


New Unicode Bug Discovered For Common Japanese Character "No"

AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.

196 comments

  1. Indeed by Anonymous Coward · · Score: 0

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language.
    Meanwhile, others have not.

    1. Re:Indeed by smittyoneeach · · Score: 1

      Sum of some, sometimes
      Somersaults sagaciously
      In the summertime

      --
      Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
    2. Re:Indeed by Anonymous Coward · · Score: 0

      This article is actually a mass awareness campaign for 'no'.

    3. Re:Indeed by Anonymous Coward · · Score: 0

      Not even a fucking complete sentence! Does nobody edit this shit? Oh yeah, never mind.

    4. Re:Indeed by LordSamanon · · Score: 1

      I cringed immediately on seeing this.

    5. Re:Indeed by Anonymous Coward · · Score: 0

      I cringed about the "similar to the English word 'of'" part. That's not quite right.

      "No" is a possessive. It's actually roughly equivalent to the English apostrophe-S, but applies in situations where an English speaker would never consider the relationship to be possessive (like city of birth, belonging to a family, or other part-of-a-group type relationships that don't involve actual ownership).

  2. And I have. by Anonymous Coward · · Score: 0

    And I have noticed that the yes, which is common in all languages.

    1. Re:And I have. by Anonymous Coward · · Score: 0

      Actually, "no" is more common IMHO.

      Yes can be si (/. swallows accents), da, ja, sim, oui etc.

      No is usually no, but sometimes non, which is close, ne (similar), não (sounds like no... I hope the tilde shows). And there are no's like the Japanese one, which are not a real no, but are written as such.

    2. Re: And I have. by Anonymous Coward · · Score: 0

      I have noted that the too.

  3. Is it the same as in Chinese? by Anonymous Coward · · Score: 0

    Like æ or äï¼Y

    If so, seems many Chinese website will have problems too, becuase it's used so often in Chinese.

    1. Re:Is it the same as in Chinese? by Chris+Mattern · · Score: 4, Funny

      Like æ or ÃüY

      If so, seems many Chinese website will have problems too, becuase it's used so often in Chinese.

      As you have just discovered, Slashdot cleverly avoids all Unicode bugs by not supporting Unicode at all.

    2. Re:Is it the same as in Chinese? by ChunderDownunder · · Score: 1

      meanwhile the folks at soylent implemented it ages ago.

      With all the effort wasted on 'beta', I wonder how much of the open source slashcode remains.

    3. Re:Is it the same as in Chinese? by KiloByte · · Score: 1

      Actually, slashcode does support Unicode, all that needs to be done for /. to get Unicode is reconfiguring the database (and converting old comments, I guess).

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    4. Re:Is it the same as in Chinese? by Anonymous Coward · · Score: 0

      Unicode is a character set, not encoding.
      UTF-8 is often used for encoding Unicode when transitioning, mostly because 7-bit ASCII is a compatible subset of UTF-8.
      So it would be possible to just switch to UTF-8 and say that the old comments should be considered as if they were already converted.

    5. Re:Is it the same as in Chinese? by Carewolf · · Score: 1

      Actually, slashcode does support Unicode, all that needs to be done for /. to get Unicode is reconfiguring the database (and converting old comments, I guess).

      No, it already works. It was active for a while some 10 years ago, but was removed because it was hard to sanitize. You could easily write you own comment score by reversing direction at the right time.

      Still they could reactivate it if they just found a reasonable way of sanitizing features they don't want.

    6. Re:Is it the same as in Chinese? by Bing+Tsher+E · · Score: 1

      Slashdot is more in the spirit of Usenet than anything else. I wish they'd just strip the 8th bit on everything.

    7. Re:Is it the same as in Chinese? by Megane · · Score: 1

      Slashdot does support Unicode (assuming your browser can be convinced to post in the right encoding). It just happens to have most of the code points (basically everything above U+00FF) blacklisted.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    8. Re:Is it the same as in Chinese? by interval1066 · · Score: 1

      If only everyone just used UTF8 encoding. Unfortunately, Microsoft insisted on using UTF16 and now here we are...

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    9. Re:Is it the same as in Chinese? by jones_supa · · Score: 1

      No, it already works. It was active for a while some 10 years ago, but was removed because it was hard to sanitize. You could easily write you own comment score by reversing direction at the right time.

      Still they could reactivate it if they just found a reasonable way of sanitizing features they don't want.

      Dude, all other websites support Unicode. Sanitizing it properly cannot be rocket science.

    10. Re:Is it the same as in Chinese? by KingMotley · · Score: 1

      Considering that Windows NT was around *before* UTF-8, it would have been rather difficult to implement it. What you really meant to say was, unfortunately, standards committees are often too slow to implement things like UTF-8 in a timely manner.

    11. Re:Is it the same as in Chinese? by billyswong · · Score: 1

      Then what about just stripe unicode for comment subject, and leave the comment content intact?

    12. Re: Is it the same as in Chinese? by Anonymous Coward · · Score: 0

      Slashdot disabled unicode support because people were using it to post goatse "ascii" art in the comments

  4. Bug? by Anonymous Coward · · Score: 0

    "The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae"

    How is this a "bug"? The character IS sometimes used in formulae. The fact that certain programs interpret it as such, even though the surrounding characters indicate that a particular instance is not part of a formula, is a bug in those applications, not Unicode.

    1. Re:Bug? by AmiMoJo · · Score: 2

      It's a Unicode bug. Unicode tries to merge different characters into a single code point, because long ago they had the same origin. This particular character exists in Japanese, Chinese, Korean and mathematics, so can be rendered four different ways, but they all share one code point.

      Applications have to guess what font to use. Being a mathematical program, this one defaults to the system language (Japanese) but has logic to detect this "no" character and render it in a different font. It isn't clever enough to notice that the rest of the sentence is Japanese, but it shouldn't have to be.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:Bug? by Anonymous Coward · · Score: 0

      even though the surrounding characters indicate

      If you have to resort to heuristics to work with unicode then it's a failure. Dealing with characters should not require heuristics to work 95% of the time, and require AI to work 99% of the time -- it should be dead simple and work 100% of the time.

    3. Re:Bug? by Carewolf · · Score: 1

      It's a Unicode bug. Unicode tries to merge different characters into a single code point, because long ago they had the same origin. This particular character exists in Japanese, Chinese, Korean and mathematics, so can be rendered four different ways, but they all share one code point.

      Applications have to guess what font to use. Being a mathematical program, this one defaults to the system language (Japanese) but has logic to detect this "no" character and render it in a different font. It isn't clever enough to notice that the rest of the sentence is Japanese, but it shouldn't have to be.

      The funny thing is that the same have never been done with latin letters and symbols, because that would be a mess. I really don't understand why they couldn't see it would be the same in Asian langauges.

    4. Re:Bug? by Anonymous Coward · · Score: 0

      In programming often people will be too clever. They will add in extra complexity so the code matches the ideal form in their head. You can be very proud of this when you write it and be totally embarrassed by it months later.

    5. Re:Bug? by Anonymous Coward · · Score: 0

      I'm guessing due to 16 bit UCS2. Another thing to blame MS for ?

    6. Re:Bug? by Darinbob · · Score: 1

      Because the Europeans override the Asians when creating the unicode "standard". They wanted to save code space, despite not being short on it (maybe some idiots think it could be done in 16 bits, but no one on the committee was that naive).

      In English, why is 1 and l not the same code point, despite having the same look in so many fonts, and even many typewriters did not have a separate 1 and 0 key (tell that to kids these days and they won't believe you). It sounds idiotic to us to give them the same ASCII code. Now imagine native speakers of Asian languages being told similar things about their writing systems.

      The problem with "no" is sort of a side issue in some sense to all this, but the problems with Han unification have been known for decades.

    7. Re:Bug? by Perky_Goth · · Score: 1

      I haven't read it, but the full explanation is on https://en.wikipedia.org/wiki/...

  5. No? by hankwang · · Score: 1
    It tried to RTFA, but it was in Japanese! I thought Japanese didn't have a word for "no":

    Japanese also lacks words for yes and no. The words "hai" and "iie" are mistaken by English speakers for equivalents to yes and no, but they actually signify agreement or disagreement with the proposition put by the question: "That's right." or "That's not right.

    1. Re:No? by Anonymous Coward · · Score: 0

      I think what is meant is not the word "no" as in "negatory" but the character no which is pronounced "noh".

    2. Re:No? by Anonymous Coward · · Score: 0

      "No" as in the kana character pronounced like "no".
      https://en.wikipedia.org/wiki/No_(kana)

    3. Re:No? by Anonymous Coward · · Score: 0

      They're referring to the character (no) rather than the word "no". They do actually say in the summary it's the CHARACTER, not the word.

      Also while that description for "hai" and "iie" is correct, I think it's splitting hairs. They can refer to yes and no, depending on the context.

    4. Re:No? by AmiMoJo · · Score: 1

      Correct. Unfortunately Slashdot does not allow me to enter Japanese text, hence the confusion.

      This is what happens when I type that character in Japanese: ã®

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    5. Re:No? by Applehu+Akbar · · Score: 1

      Most Japanese characters in the two phonetic alphabets stand for a consonant tied to a vowel. The no phonetic in grammar indicates possession. In the phrase Katoh no boshi (Kato's cap) all of the other characters would be the Chinese-derived kanji.

    6. Re: No? by Anonymous Coward · · Score: 0

      Pedant time. 'No' is a sign, not a character. (Speaking in terms of English terminology for Japanese language.., not unicode)

  6. Re:What, unicode less than perfect? by Anonymous Coward · · Score: 0

    There are way more than 64K code points now. So are there any plans to de-unify or separate codepoints for Chinese, Japanese and Korean? Or do we still have to specify language in HTML pages that contain a mix of two languages of CJK?

  7. Or speak English, it's 7bit clean by Anonymous Coward · · Score: 0

    You shouldn't need 16 bits to write a letter.

    1. Re: Or speak English, it's 7bit clean by John+Allsup · · Score: 1

      As I pointer out elsewhere here, Chinese can be written with a latin alphabet and a few accents. Likewise languages such as Sanskrit. Just as there is a difference between English handwriting and what can be represented in Ascii, we face a related issue with ideograph based writing systems. We would be better of writing Chinese webpages in pinyin, and developing a separate system for calligraphy and ideographs.

      --
      John_Chalisque
    2. Re: Or speak English, it's 7bit clean by Anonymous Coward · · Score: 2, Interesting

      As I pointer out elsewhere here, Chinese can be written with a latin alphabet and a few accents. Likewise languages such as Sanskrit. Just as there is a difference between English handwriting and what can be represented in Ascii, we face a related issue with ideograph based writing systems. We would be better of writing Chinese webpages in pinyin, and developing a separate system for calligraphy and ideographs.

      Except that there are so many homonyms in pinyin that a strong sense of the context is needed to read it. The logograms are much harder to write but reading is quite a bit easier, which is why they are still in use. That's not the same as English handwriting vs printing, where the differences are only in rendering and there is a 1:1 correspondence between a handwritten and a printed character.

    3. Re: Or speak English, it's 7bit clean by amake · · Score: 1

      No one, absolutely no one who is actually proficient in any of these languages, would find your proposal acceptable. The only people who advocate such things are, deservedly, dismissed as cranks.

      So instead, how about we fix the problems with the current, largely acceptable system we have now?

    4. Re: Or speak English, it's 7bit clean by Anonymous Coward · · Score: 0

      In Latin-derived languages, we must look at how words are written, not just how they sound. From the characters used it's often possible to derive the etymology and thus know the original meaning of the word.

      This seems not to be the case with English. Alas, just because it has no accents, some confusion arises. "Façade" (with a cedilla), for instance, clearly reminds of "face" (a "house face"), whereas "facade" does not.

      Same thing, I suppose, with ideograms, which contains subdrawings which eases their interpretation (not a Chinese, I'm gathering all bits of knowledge I have, including those about Japanese ideograms). There are the famous examples of crisis (opportunity+danger), happiness (man+woman inside a house), confusion (man + 2 women in a house).

      Also, to make my point clearer, why are you using monospaced characters? Are you aware that proportional fonts are easier to read? They even lend themselves to some clever formatting, as used in concrete poetry, for instance.

      We also have (manu)script and Roman letters -- and they impart a totally different tone to the message. That's why people still use the abominable Comics Sans -- because it's the only alternative they got to make a child's birthday invitation look less formal and more funny, while at the same time we use script letters to make a wedding more formal.

      Your idea of using ASCII because it uses less bits can be understood as "loss of information" and it's ok for simple telegrams (for instance). But not to represent all richness necessary, for instance, in diplomatic messages.

    5. Re: Or speak English, it's 7bit clean by ciaran2014 · · Score: 1

      Example: the story of the man who tried to eat ten lions:

      Shí shì sh shì sh shì, shì sh, shì shí shí sh. Shì shí shí shì shì shì sh. Shí shí, shì shí sh shì shì. Shì shí, shì Sh shì shì shì. Shì shì shì shí sh, shì shì shì, sh shì shí sh shì shì. Shì shí shì shí sh sh, shì shí shì. Shí shì shì, shì sh shì shù shí shì. Shí shì shì, shì sh shì shí shì shí sh. Shí shí, sh shì shì shí sh sh, shí shí sh. Shì shì shì shì

      For the Chinese ideograph version and English translation, see slides 12 and 13 of

      https://web.csulb.edu/~txie/38...

      --
      Help build the anti-software-patent wiki
    6. Re: Or speak English, it's 7bit clean by Anonymous Coward · · Score: 0

      Or everyone could just learn IPA and STFU. I bet its entirety would fit nicely within an 8-bit character set and then everything else could be completely dropped for everything but historical reasons. This would eliminate the need for Unicode and would replace ASCII as "the" fixed-byte character set of choice.

    7. Re: Or speak English, it's 7bit clean by Zanadou · · Score: 1

      For the Chinese ideograph version and English translation, see slides 12 and 13 of...

      I know I'm a few days late to this thread, but how about we forgo the propriety ppt filetype plug-in hell, and go straight to Wikipedia ?

    8. Re: Or speak English, it's 7bit clean by ciaran2014 · · Score: 1

      Excellent. I wish I'd found that when writing my comment. I'd read a printed version so I was just happy to find any version of it online.

      --
      Help build the anti-software-patent wiki
  8. I've seen that the summary by Anonymous Coward · · Score: 0

    which is a little strangely phrased.

  9. When you have 6000 keys on a typewriter by Anonymous Coward · · Score: 0

    You have to assume there will be problems.

  10. What bug? by Ark42 · · Score: 4, Informative

    The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").

    1. Re:What bug? by AmiMoJo · · Score: 2, Interesting

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      You can't mix C/J/K and mathematics in Unicode, which is a new bug beyond just the failure to support mixing C/J/K.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:What bug? by Anonymous Coward · · Score: 1

      This is plainly wrong. The character discussed here is Hiragana, not Kanji. there is no unification for the Kana alphabets. As such, this *is* an application bug.

    3. Re:What bug? by Florian+Weimer · · Score: 1

      “-” looks very differently in text and formulas, too. I don't get why people assume that you can get nice rendering without additional markup.

    4. Re:What bug? by Ark42 · · Score: 1

      I'm aware of the problems with the han unification and certain Kanji being displayed "wrong" because the Chinese equivalent is drawn significantly different from the Japanese Kanji, but this doesn't seem to be anything close to that kind of problem. I'm also aware of the Unicode block U+1D400 "Mathematical Alphanumeric Symbols" which is what should be used for formulas. Any application that is rendering one particular character in the Hiragana block in a different font than the rest of the Hiragana block, is quite frankly, just rendering it wrong. The bug is with the application as far as I'm concerned, and this clearly does not impact default system rendering or any common web browsers as far as I can see either.

    5. Re:What bug? by Anonymous Coward · · Score: 0

      Unicode is just one encoding out of many. Is there a universal character encoding where versions of characters have different code points? How about creating a new character encoding if none of the existing ones are satisfactory? If the Unicode consortium insists in making wrong technical decisions, then competition might be the solution.

    6. Re:What bug? by amake · · Score: 1

      There are no Chinese or Korean versions of this Japan-specific character. This is the first time I've ever heard of a "mathematical use" of this character, and I suspect the vast majority of users would be surprised at this as well.

    7. Re:What bug? by Anonymous Coward · · Score: 0

      Good luck with that. We're all nostalgic for "code pages", especially obscure ones that iconv can't handle. Even better, why don't you render your private characters as flashing gifs?

    8. Re:What bug? by Anonymous Coward · · Score: 0

      The bug seems to be that some agents try to render stuff as math that isn't marked up as math, and to be too smart about it. If you don't use MathML there should be no expectation of a formula being rendered nicely.

    9. Re:What bug? by Ark42 · · Score: 1

      Except while that is called "Hyphen-Minus" and can be used for two things, Unicode does try to solve that problem by having:
      00AD Soft Hyphen
      2010 Hypen
      2011 Non-Breaking Hyphen
      2012 Figure Dash
      2013 En Dash
      2014 Em Dash
      2015 Horizontal Bar
      2212 Minus Sign
      2796 Heavy Minus Sign

      There is no "Mathematical Hiragana No" glyph defined by Unicode, and as such, it should never be rendered in a different font just because somebody *might* use it in a formula. The application is wrong, and there is no bug in Unicode.

    10. Re:What bug? by Megane · · Score: 1

      My guess is that it can be used in certain numerical contexts, sort of like "No." ("number") in English. It can mean a quantity as in "n no x" (ippiki no neko), and maybe some other contexts. So something, probably an application, was coded to think of it as used in numerical contexts. The specific instance is about LaTeX, which is one of those ancient apps like emacs that is so old it had to create everything from scratch, so it's possibly specific to LaTeX or some port thereof.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    11. Re:What bug? by AmiMoJo · · Score: 1

      It's been imported to China: http://portal.nifty.com/koneta...

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    12. Re:What bug? by t551 · · Score: 1

      The bug is not in LaTeX, but in MathJax, an HTML/Javascript reimplementation of the TeX mathematical markup for use on the web.

    13. Re:What bug? by butlerm · · Score: 1

      How can you tell that any of those pictures are from China? They all look like they are from Japan, a country that makes extremely heavy use of Chinese characters (much more than Korea for example), to me.

    14. Re:What bug? by AmiMoJo · · Score: 1

      All the text is in Chinese. The blog post itself in in Japanese, and it says that the pictures are of China.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    15. Re:What bug? by NostalgiaForInfinity · · Score: 1

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      What you probably mean is that an application can't select the right glyph based on the Unicode string. That is correct, but nothing specific to CJK. Without markup or metadata, Unicode often won't render as expected by readers even in Western languages. Unicode used to have its own system for marking language context, but it was dropped since it was redundant with widely used markup and metadata. If you don't know what language your string is written in, you can't pick the right glyphs, and that's true in many languages besides CJK. (CJK has good heuristics for language identification, so it's not a problem.)

      Mostly, CJK deunification is something some Westerners try to use for showing off their (usually limited) knowledge of Japanese and Chinese and demonstrate how morally superior they are to the culturally ignorant, imperialist, evil white men that, in their imagination, made up the Unicode consortium.

      In reality, the Chinese and Japanese are big boys. If they wanted to de-unify their scripts in Unicode, they'd have the political clout, and no Westerner would stop them because, frankly, nobody outside Japan or China gives a f*ck. However, I suspect they are too smart to screw themselves that way.

  11. No, it is the character pronounced as "no" by Anonymous Coward · · Score: 0

    It's the character "no", not the word for "no". As this: ã®. It is usually used as the English, "of".

    1. Re: No, it is the character pronounced as "no" by Anonymous Coward · · Score: 0

      Duh /. don't support Unicode.

    2. Re: No, it is the character pronounced as "no" by Ash-Fox · · Score: 1

      Actually it does, it's just disabled in slashcode after the brief spam event when it was enabled.

      --
      Change is certain; progress is not obligatory.
    3. Re: No, it is the character pronounced as "no" by ledow · · Score: 1

      Fuck it doesn't even support ASCII, let alone Unicode.

      Try doing an English pound sign:

      £

      Nope.

    4. Re: No, it is the character pronounced as "no" by lokedhs · · Score: 0

      The pound sign is not part of ASCII though (its code point is greater than 127)

    5. Re: No, it is the character pronounced as "no" by alexhs · · Score: 1

      HTML entity pound sign (£): £
      Literal pound sign, as on my keyboard: £
      It's OK for me in preview mode.
      Maybe it's your browser's encoding that's broken ? I have it set as UTF-8. Your rendering (£ (*)) seems to indicate you sent the byte sequence for UTF-8. But I suspect that your browser set the character encoding as ISO-8859-1 in its headers.

      While I'm at it: "" <- This was supposed to be the "no" hiragana. Disallowed characters are stripped, rather than being "converted" to mojibake.

      (*) Fun fact: rendered as £ in your comment and in editing mode, but as £ in preview mode.

      --
      I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
    6. Re: No, it is the character pronounced as "no" by Anonymous Coward · · Score: 0

      Why should the American Standard Code for Information Interchange support an English pound sign??

    7. Re: No, it is the character pronounced as "no" by ledow · · Score: 1

      Then neither are basically all of the accented characters:

      ÃéÃÃÃ

      ÃÃÃÃ"Ãs

      Quarter, half, most of the currency symbols, etc.

      Extended ASCII is pretty bog-standard. But my point really? I press the pound-sign (or the other characters) on my keyboard, and Slashdot can't render them. Facebook can. The Register can. Every forum in the world can. But not Slashdot.

    8. Re: No, it is the character pronounced as "no" by epine · · Score: 1

      But I suspect that your browser set the character encoding as ISO-8859-1 in its headers.

      Drawing an inference from the not-fact that the top of the batting order in every Wikipedia FAQ does not include how to set your user agent to send the right encoding header, I'd suggest that Slashdot's long-disabled Unicode support fell far short of the mark in the first place. (2005 just called. It wants to dissolve its de facto clue-stick monopoly.)

      I authored a CJK word processor that ran under MS-DOS in the 1980s and early 1990s. Two of our linguists did our own in-house unification that ended up not so different than Unicode which came later.

      At the time that Unicode came out, our largest customer groups were embassies, diplomats (Snowden-style), and other academic linguists (with a strong representation from the Brigham Young young-adult diaspora). Maybe 40% of our new customers in the early 1990s were still running turbo XTs, 286s, and 386 castrati (16 MHz SX of the 16-bit bus resurrected). It takes a long time for the wallet of a dusty academic sinologist to recover from dolling out $5000 in 1985 (true story, many times over). 20-year-old Mormon missionaries where not especially flush, either.

      Imagine this as your early-adopter power-user-base for the newly ratified Unicode 1.0 Asian language support.

      Many people at the time running Windows 3.11 were running in 4 MB. Multilingual software remained stuck in this grotesquely underpowered rut until the P54 was introduced in the mid-nineties.

      It's not just the print and display fonts that were a burden to the software of the day, but the mere Unicode code point tables themselves. 256 KB of code-point mapping tables was the rough equivalent of Google grabbing another 256 MB to process-isolate another browser tab (4 MB then, 4 GB now).

      Of course, one can code up a bespoke compression method and clever language subset overlays. I'm sure we invested more man-hours in bespoke compression methods and clever data overlays than Zuckerberg invested in coding up The Facebook, original edition.

      It's probably a good thing that Unicode was rushed to fruition, however broken it now appears to be twenty-five years later, before the first release of NCSA Mosaic. Otherwise, Unicode might have been cobbled together Brendan Eich in a succession of 4 a.m. coding binges the week after he pounded out JavaScript.

      It's funny that this bug involves typesetting mathematics. If any software was broken with respect to Asian character support, it was surely the original TeX—paragon of infinite breakage that we all now know it to be.

      Back in the mid-to-late eighties, the very idea of sprinkling Asian fonts into math display mode would have been delegated to the savant sibling sequestered in Lamport's sound-proof attic.

    9. Re: No, it is the character pronounced as "no" by butlerm · · Score: 1

      ASCII is the American Standard Code for Information Interchange, a 7 bit encoding system. The most common strictly 8 bit encoding is ISO-8859-1, slightly expanded by Microsoft as Windows-1252, also known as Win-ASCII.

      Of course these days, everyone in their right mind should generally be using UTF-8 for transfer and storage. UCS-16 and UTF-16, though widely used internally, are basically a mistake for that kind of thing.

  12. Why not just use English, and only English? by Anonymous Coward · · Score: 0, Troll

    In practice, English is the only language we need.

    It's already a first language to hundreds of millions of people. It's already a second language to billions. It's even a third, fourth, fifth and sometimes sixth language to many millions more.

    It's the language of international business. It's the language of international academia. It's the language of international engineering. It's the language of international aircraft control.

    It uses a sensible alphabet that's easy to represent digitally. It's a democratic language that will draw from other languages where necessary and useful. It's a language that has proven it can adapt to changing circumstances.

    We shouldn't strive to eliminate other languages, of course. They do have their value, but more as historic curiosities for linguists and historians rather than something to use on a daily basis.

    Computers are already able to do amazing things with English text. There's just no need to support other languages.

    1. Re:Why not just use English, and only English? by JustOK · · Score: 1

      Que?

      --
      rewriting history since 2109
    2. Re:Why not just use English, and only English? by Anonymous Coward · · Score: 0

      When I was a kid, I wondered this. Then I realised that communication is prerequisite to understanding non-trivial concepts, that language evolves with human development (technology, culture, movement, etc.), and that prescribing language prevents that evolution, therefore holding back human progress.

      Special interests can and often do demand too much investment into forcing a dying language to survive, but this is less harmful than forcing a living language to die.

    3. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 2, Insightful

      There are more native Chinese speaker than English speaker. How about you learn Chinese and shut the fuck up?

    4. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

      In practice, English is the only language we need.

      I think Chinese is the only language we need, it's already the most spoken language in the world.

      It uses a sensible alphabet that's easy to represent digitally.

      Chinese is too.

      It's a democratic language that will draw from other languages where necessary and useful.

      Chinese does too.

      It's a language that has proven it can adapt to changing circumstances.

      Chinese does too.

      --
      Change is certain; progress is not obligatory.
    5. Re: Why not just use English, and only English? by John+Allsup · · Score: 3, Insightful

      Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.) When it comes to ideograph based languages, we would have been better off designing an entirely separate text system rather than trying to shoehorn it into a font-character paradigm derived from the needs of writing and printing latin scripts. Indeed having a writing system designed around the needs of calligraphy would be a useful thing, but like with ideograph based writing systems it is a long way from the use case we normally see with alphabet based writing systems.

      --
      John_Chalisque
    6. Re:Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Yes, please! Throw some steaks, ribs, sausages and sweetcorn on the 'que and grill 'em to perfection. Make sure you have enough for everyone here! Don't forget the brewskies, too!

    7. Re: Why not just use English, and only English? by amake · · Score: 2, Insightful

      we would have been better off

      No, you might have been better off. Chinese speakers would not. They would like to use their written language, as it exists today, on computers just like everyone else.

    8. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 1

      https://en.wikipedia.org/wiki/Romanization_of_Chinese

      they tried, but failed.

      If you actually does know a little bit more than "pinyin", you should understand why they failed.

    9. Re:Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Both are of course wrong.

      There are plenty of ideas and concepts that can't be expressed in neither English nor Chinese.
      Learning multiple languages doesn't just make it possible for you to communicate with more people, it makes it possible for you to express thoughts that you earlier had problems with getting a grip on.

      Perhaps that is why people who only advocates one language seems so simple minded.

    10. Re:Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Kanga Roo?

    11. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      because "symbols" occupy less space, are easier to comprehend without "reading"(i find it easier and less time consuming to glance at chinese characters and comprehend their meaning than having to read the same sentence or paragraph in romanized characters), and to be quite honest, are much more aesthetically pleasing.

    12. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      If you stop to think about why we're having *this* discussion in English, you should find your answer! Good luck.

    13. Re: Why not just use English, and only English? by tepples · · Score: 1

      "symbols" occupy less space

      Not if you have to make the font bigger to keep the strokes from touching each other. By that point, you could have used a smaller font on the Latin.

    14. Re: Why not just use English, and only English? by AmiMoJo · · Score: 1

      It would have been absolutely fine if they had just stuck to one codepoint per character and not tried to merge them.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    15. Re: Why not just use English, and only English? by Bing+Tsher+E · · Score: 1

      Then it would need to be written as it exists today, which would mean some sort of calligraphic text system. That wouldn't have been possible with the design of text-based system 30 years ago, where character bitmaps were stored in a lookup ROM for a rasterizer to use, but it shouldn't be difficult today.

    16. Re:Why not just use English, and only English? by ArcadeMan · · Score: 1

      We shouldn't strive to eliminate other languages, of course. They do have their value, but more as historic curiosities for linguists and historians rather than something to use on a daily basis.

      Quoi?

    17. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Shoulda woulda coulda. It's a little late for that that dontyathink?

    18. Re: Why not just use English, and only English? by interval1066 · · Score: 2

      I'll buy that, but even native Sinolanguage speakers have told me the learning curve for an alphabet is much shallower. Like, MUCH shallower. And since most modern technical terms have Greek and Latin roots, sometimes its simpler for them to just use the Latin words, otherwise they have to convert the terms to native sounds using bizarre and difficult to use conversion systems. I do agree however that it would have been nice to use a system similar to Kanji right from the beginning had we had one.

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    19. Re: Why not just use English, and only English? by interval1066 · · Score: 2

      English is the official technical language for flight. ALL international pilots, military and civ, MUST know enough English to pass flight school and to fly international commercial flights. Its also the official language of sea navigation, but to a lesser extent. I don't think you need to be as proficient. And English with a number of loan words from Greek and Latin are used in international Engineering. But yeah, English is spoken by the majority of technical people around the world as a common information exchange language.

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    20. Re:Why not just use English, and only English? by interval1066 · · Score: 1, Informative

      I think Chinese is the only language we need, it's already the most spoken language in the world.

      Only in head count, not by region. If the world was populated only by the Chinese, which seems to be their goal, then yes, Chinese is the most spoken language in the world. However, if you break that fact down by dialect, your statement is really weak. Mao's goal to have the entire PRC speak Mandarin really failed.

      It's a democratic language that will draw from other languages where necessary and useful.

      Not really. Mao tried to force all Chinese to speak Mandarin, and he failed miserably. Kinda the exact opposite of "Democratic". But of course that's not the fault of the language per se...

      It's a language that has proven it can adapt to changing circumstances.

      Chinese may be, but if Japanese is an example, and Japanese is adapted from Chinese by Han explorers to Japan in the Iron Age; its not very adaptable at all. The Japanese have developer THREE different writing systems to cope with with some shortcomings of the language (only two tenses, underdeveloped pronoun system, etc). That may be a shortcoming of Japanese, but Japanese is just a symptom of a language root that isn't very forgiving. I will say however that a language that can be nuanced such that 9 different meanings from changing the tone of one word may be more flexible than I give it credit for,

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    21. Re:Why not just use English, and only English? by Anonymous Coward · · Score: 0

      > In practice, English is the only language we need.

      How convenient for you, since you already speak it and don't have to make any effort to learn...

    22. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Chinese uses a sensible alphabet? What are you smoking?

    23. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      The argument he was making is that bitmap style fonts that are used are a bad solution to the ideograph problem. They should develop their own better solution instead of shoehorning a rendering system optimized for Latin scripts.

    24. Re: Why not just use English, and only English? by dunkelfalke · · Score: 1

      ICAO general rules and regulations

      4.4.1c - ICAO languages are English, Spanish, French, Arabic, Russian, and Chinese.

      --
      "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
    25. Re:Why not just use English, and only English? by KingMotley · · Score: 1

      I think Chinese is the only language we need, it's already the most spoken language in the world.

      That is false. English is the most spoken language in the world. Chinese is the most popular primary language.

    26. Re:Why not just use English, and only English? by KingMotley · · Score: 1

      Not even by head count. 1.5 billion people can speak English, contrasted to 1.0 billion can speak "Chinese".

    27. Re:Why not just use English, and only English? by AmiMoJo · · Score: 1

      Chinese may be, but if Japanese is an example, and Japanese is adapted from Chinese by Han explorers to Japan in the Iron Age; its not very adaptable at all. The Japanese have developer THREE different writing systems to cope with with some shortcomings of the language (only two tenses, underdeveloped pronoun system, etc). That may be a shortcoming of Japanese, but Japanese is just a symptom of a language root that isn't very forgiving. I will say however that a language that can be nuanced such that 9 different meanings from changing the tone of one word may be more flexible than I give it credit for,

      That's not right. The exact origins of the Japanese language are lost to pre-history, only guessed at. It was the writing system that was brought over from China. Then katakana and hiragana were added to support the parts of the Japanese language that can't be written adequately in the Chinese system. They were simply added to support the way the language was already spoken, not to make up for any limitations.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    28. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      American website.

    29. Re:Why not just use English, and only English? by Fire_Wraith · · Score: 1

      Keep in mind that India, which is nearly as populous as China, is a predominantly English speaking country. If sheer number of speakers is a key, the future will probably turn out to be something like Firefly, with a mishmash of Chinese and English.

    30. Re:Why not just use English, and only English? by JustOK · · Score: 1

      nuq ghe''or vIghel SoH?

      --
      rewriting history since 2109
    31. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

      I honestly don't really care for the argument. I just think it's a stupid argument to make because you can apply it to other languages.

      --
      Change is certain; progress is not obligatory.
    32. Re: Why not just use English, and only English? by Ash-Fox · · Score: 1

      Chinese uses a sensible alphabet? What are you smoking?

      Tell me more, educate me.

      --
      Change is certain; progress is not obligatory.
    33. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      Educate yourself. You could start by learning what an alphabet is.

    34. Re: Why not just use English, and only English? by Ash-Fox · · Score: 1

      Done, don't see the issue with Chinese writing and it's consonants. You have failed to identify the issue.

      --
      Change is certain; progress is not obligatory.
    35. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

      According to http://www.infoplease.com/ipa/...

      Chinese are apparently first when it comes to native speakers. What data distinguishes whether someone can speak it as a second language and what level of language knowledge does the person have to know to be counted to speak that language?

      --
      Change is certain; progress is not obligatory.
    36. Re: Why not just use English, and only English? by radarskiy · · Score: 1

      Note that problems also involve Korean, which is written with an alphabet not ideographs.

    37. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 0

      > You have failed to identify the issue.

      No, you have failed to understand the issue.

      Written Chinese doesn't use a sensible alphabet, because it doesn't use an alphabet.

  13. single global language by Anonymous Coward · · Score: 0

    english haters or america/british haters will disagree, but english is the best choice--as the most widely spoken language (among all speakers, not just natives. sorry china, sorry arabs). it is also the language of the seas, and the language of the air, as well as the primary global language for commerce, internet and diplomacy, and it is often the fallback language among native speakers of non-english languages.

    how many others are as easily typed and written (use 26 or fewer Latin characters, do not use accent marks, and use Arabic numerals) and widely spoken languages (billion+ speakers) are there? none.

    when will it happen?

    1. Re:single global language by AmiMoJo · · Score: 1

      English is fine for factual information like air traffic control or shipping, but it would never work for Japanese society. There are too many important things you can't adequately express in English that are essential to Japanese people. Same with Chinese.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:single global language by Anonymous Coward · · Score: 0

      Not for the next century or two. Language shapes the way we think and our culture shapes our language. Why do you think language gets overhauled in Orwell's 1984? You're suggesting almost the same thing. English is one of the most basic languages and one can definitely argue that it is quite limited and limiting in certain ways.

    3. Re:single global language by dunkelfalke · · Score: 1

      Actually, English isn't very good for factual information either. It has too many homonyms, a very inconsistent spelling, too ambiguous sentences even with the very strict word order English has to use, no single language authority and too many standard variations.

      Other Germanic languages are much more precise, as are Slavic languages. Due to the more complicated grammar and being synthetic instead of analytic, the meaning of a sentence is clear even if words in the sentence are shifted around, the spelling is usually also phonemic (you write words as you hear them - regular spelling).

      --
      "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
    4. Re:single global language by Xtifr · · Score: 1

      Why do you think language gets overhauled in Orwell's 1984?

      Because Orwell was a little too enamored of the so-called "Sapir-Whorf hypothesis"? I hate to break it to you, but, despite its many obvious parallels to the real world, 1984 was ultimately a work of fiction.

      While it's undeniable that language has some influence on culture and thought, the idea that it can be as influential as proposed by some early SF writers (e.g. Orwell, Jack Vance's The Languages of Pao, or Samuel Delaney's Babel-17) is mostly discredited.

  14. Nitpick by msobkow · · Score: 5, Informative

    This is not a "Unicode bug". It is a rendering bug exhibited by some applications.

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Nitpick by AmiMoJo · · Score: 2

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:Nitpick by msobkow · · Score: 1

      Ask the people who wrote the software that doesn't exhibit the bug. Obviously it can be done.

      --
      I do not fail; I succeed at finding out what does not work.
    3. Re:Nitpick by Anonymous Coward · · Score: 0

      Since Unicode decide to unify characters which looks the same in different languages, this is by design, not bug.

      You can say there is a design flaw instead, which I can't agree. Answer these questions:

      Why using code point to determine language is a supposed feature?

      What does the part of 'en_US.UTF-8' before the dot mean to you?

      How do we distinguish different languages using Latin alphabet by code point?

    4. Re:Nitpick by Anonymous Coward · · Score: 0

      Just to came to see the one comment that mattered

    5. Re:Nitpick by Anonymous Coward · · Score: 0

      Please name said software. The best CJK labeler that I know required 10,000 man-hours and 5 AI PhDs to develop, and is only 99.91% accurate.

    6. Re:Nitpick by Anonymous Coward · · Score: 0

      Not necessarily.

      You're assuming that other software has the same features and doesn't exhibit the bug. But it's quite possible that software that doesn't exhibit the bug is just rendering all the text in one font and isn't trying to make mathematical formulas look nice.

    7. Re:Nitpick by AmiMoJo · · Score: 2

      Software that doesn't have this bug only avoids it by not supporting mathematical symbols. So far there is no known software that avoids the CJK confusion problem either.

      Most software doesn't even try. How many programmers are even aware of the issue? No Unicode library is immune. It's a problem with the standard that can only be fixed by starting fresh with about 150,000 new CJK characters, and then updating all fonts and libraries to handle translation and equivalence.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    8. Re:Nitpick by thegarbz · · Score: 1

      In other news a new bug is shown to exhibit a behaviour where some mathematical programs substitute a Japanese character into the formula.

      The problem is it can't be done. Not without intelligent user / designer input (such as signifying that the unicode to be displayed is Japanese and not a maths formula). If an application is correct in determining one context it will be incorrect in determining the other.

    9. Re:Nitpick by Kjella · · Score: 3, Informative

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says and if you express it as simply a style issue, it actually sounds plausible.

      On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.

      I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?

      --
      Live today, because you never know what tomorrow brings
    10. Re:Nitpick by Anonymous Coward · · Score: 0

      How is an application to know if I write English, French, Spanish, German, Italian, Portuguese, or any of the other languages that share the Roman alphabet? Simply: the user tells it. Demanding different code points for different languages is ridiculous. What will be next? Different code points for serif and sans serif? Different code points for regional dialects (say British English vs US English)?

      It's just absurd.

    11. Re:Nitpick by loufoque · · Score: 1

      In LaTeX, it's mathematical if it occurs in a math context, which is separated by $ characters.

    12. Re:Nitpick by Anonymous Coward · · Score: 0

      You know what? I'm using French characters to write this post, they look like English but it's really not, why? I just like it this way.

      And for AmiMojo and who ever is giving score to his post, this is sarcastic.

    13. Re:Nitpick by AmiMoJo · · Score: 2

      I agree, font encoding should not be part of the character encoding. Unicode even screws that up though, because there are things like text direction marks in it. Anyway, the problem is that often you have text without metadata. A file name, audio file metadata, a plain text database entry etc. You have to pick a font to render it, and the choice depends on the language because thanks to Unicode it's impossible to have a universal all-language font.

      You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    14. Re:Nitpick by Megane · · Score: 1

      This is not a unified character, it is Japanese-only. Some program (apparently LaTeX) is using the wrong font because it thinks it is part of a mathematical equation, even to the point of showing the wrong font for the character in a font character viewer window.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    15. Re:Nitpick by Kjella · · Score: 1

      You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

      Actually I was thinking of it more like a "sticky" composite character, like you can have a + circle = å you'd have unihan + HAN(C|J|K) = "right" glyph while:

      a) Extending existing single-language CJK documents with just one character
      b) Preserving backwards compatibility with all current CJK systems
      c) Avoiding any complex CJK conversion functions
      d) Creating a simple way to override with "show as C/J/K"

      It would require adding a bit of intelligence to copy-paste for preservation, like:

      (HANC)abcde -> copy "cde" -> (HANC)cde

      But if the application doesn't, well you'll still get the correct unihan. Also on paste it could remove redundant markers, but they'd be harmless. Then you could have universal fonts with as little invasive changes as possible. The alternative would be creating literally hundreds of thousands of new code points.

      --
      Live today, because you never know what tomorrow brings
    16. Re:Nitpick by AmiMoJo · · Score: 1

      That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character. The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simple character count suddenly requires a massive look-up table of composite characters.

      The goal should be to make handling Unicode text as simple as possible without huge code libraries, metadata tables and the like. Everything else is prone to screw ups - for example with the text direction mark, there was a security flaw where you could include on in a file name to make "document.fdp.com" look like "document.moc.pdf". The right-to-left mark is after the first period and invisible.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    17. Re:Nitpick by Megol · · Score: 1

      The problem is outside the problem domain Unicode attempts to solve so it isn't strange it doesn't solve it. For some other problems Unicode try to solve the result is a mess (example: bidirectional text) so that is probably a good thing.

    18. Re:Nitpick by Kjella · · Score: 1

      That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character.

      Yes, until the software got updated to treat it as a non-printing character but it wouldn't make everything unreadable, there's bad and there's much much worse.

      The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simple character count suddenly requires a massive look-up table of composite characters.

      It already does for a huge number of reasons. Oh and if you thought giving every character a code point would mean a 1:1 mapping to glyphs that's still wrong, many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character.

      The goal should be to make handling Unicode text as simple as possible without huge code libraries, metadata tables and the like. Everything else is prone to screw ups - for example with the text direction mark, there was a security flaw where you could include on in a file name to make "document.fdp.com" look like "document.moc.pdf". The right-to-left mark is after the first period and invisible.

      Well, you should have a filter there anyway because "foo/bar.*<hello?>" is not a valid filename either, though it's a valid unicode string. That you don't restraint it to the valid subset isn't the standard's fault.

      --
      Live today, because you never know what tomorrow brings
    19. Re:Nitpick by GuB-42 · · Score: 1

      First of all, the hiragana "no" is always Japanese, not Chinese, not Korean. The CJK unification is only about han characters (in Japanese, that's kanji).
      As for maths, there are usually markers to indicate we are in an equation, which makes sense because Unicode is not powerful enough for this : fractions, integrals, matrices, etc... cannot be rendered with just code points. So in this case Unicode provide the characters (roman and geek letters, numbers, mathematical symbols, the hiragana "no", etc...) and a higher level language (like MathML or LaTeX) deal with the structure. Because of this, Unicode doesn't have to dedicate a special page for mathematical version of regular characters : the software can easily differentiate. If it is MathML / LaTeX "$" block, render it with the math font, otherwise, use the regular font.

  15. The author of the second linked article by Anonymous Coward · · Score: 0

    displays a lack of maturity (at least in this article) in addressing the issues surrounding CJK unification. He writes: "I can understand the rationale behind Han Unification but, since I have the emotional capacity of a child and just want things to work, I’m going to say that it’s dumb and stupid and I hate it." Then he notes in the comments (in response to "Eee" explaining the reasons the author claimed to understand): "I'm not sure if anything you've said is wrong, because I think it was mostly over my head, sorry :x"

    Apart from that point, the debate over CJK unification IMO boils down to the role a character encoding should play in the display of information (i.e. what level should language-specific information be stored?).

  16. JUst a rendering problem? by Applehu+Akbar · · Score: 1

    The character in the Unicode table looks like a mashup of the hiragana (grammar-forming) version of the character, and the katakana (used as we do italics) form.

  17. ruby is in Firefox by Anonymous Coward · · Score: 0

    While there is the technology to do this on the web (the HTML ruby element), you won’t see it much. It just doesn’t work on all web browsers (like Firefox),

    Wrong, in fact Firefox is the only browser with almost full ruby support and for older Firefoxen there have been ruby extensions for ages.

    and few people choose to use it on their websites.

    Their problem.

  18. Mandarin dependency and homophone confusion by tepples · · Score: 4, Interesting

    Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.)

    "Chinese" is not a single spoken language. A passage written in one Chinese language, such as Mandarin, is often readable in another Chinese language, such as Cantonese, so long as they're written with Han characters. It's as if French could be read as Italian or Spanish with the same characters. In addition, different words that sound the same in a given Chinese language due to historic sound changes usually have different Han characters. They may end up sounding different in a different Chinese language whose different historic sound changes produced different homophone sets. Pinyin, on the other hand, depends on Mandarin and confuses homophones.

    1. Re:Mandarin dependency and homophone confusion by interval1066 · · Score: 1

      Something I've always been curious about though; my understanding is that a Japanese speaker can understand written Chinese, to a certain extent. Is that not correct? I know that the reverse isn't really possible due to the Japanese use of Kana. But if the text is written using Han glyphs Cantonese, Mandarin, Hunan, Kan, Taiwan, etc, and Japanese speakers can sort-of understand each other's written stuff, or is that just nonsense?

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    2. Re:Mandarin dependency and homophone confusion by Fire_Wraith · · Score: 4, Informative

      To a degree, yes, because the symbols themselves are the same. Note however that some of the original Chinese characters have been altered in use (simplified) by the PRC in the 50s and 60s, but those are only used in mainland China (and I think Singapore maybe?), but not Taiwan or Japan. Aside from that though, the characters for something like 'University' would still be a combination of the character for 'large' and the character for 'school'. It might be pronounced totally differently, but could be read and understood by all. Fun fact: The proper reading of the characters for the country of "Japan" in Japanese is actually "Nihon" or "Nippon." However, in certain Chinese dialects, the characters that comprise it are pronounced more like "Zep-pen" or "Japan." What's also fascinating to consider is that Korean is the same way, but that in modern usage you hardly ever see the Chinese characters (Hanja) used, even though I think they're still taught in some schools. Almost everything I saw when I was in Korea was in Hangul, the Korean native alphabetic script.

    3. Re:Mandarin dependency and homophone confusion by phantomfive · · Score: 1

      But if the text is written using Han glyphs Cantonese, Mandarin, Hunan, Kan, Taiwan, etc, and Japanese speakers can sort-of understand each other's written stuff, or is that just nonsense?

      I went to China once with a professor of ancient Korean. He couldn't speak any Chinese, but he learned enough Chinese characters from studying Korean that he could write well enough to communicate with a taxi driver. They had to write to each other, they couldn't speak.

      Essentially, there was an old style of Chinese that everyone wrote in (but probably no one ever spoke, including Chinese). Over time, Japan, Korea, Hong Kong and eventually all of China modified the writing system to match the speaking system. (Here is a really good link on that topic).

      The meaning of the individual characters is mostly the same. Sometimes though, you combine two characters together to make a complete word, and there is more variation (in two-character words) between the different countries. Also, there is some variation in the styles of various characters.

      There's more to say but that's probably an earful already lol

      --
      "First they came for the slanderers and i said nothing."
    4. Re:Mandarin dependency and homophone confusion by aix+tom · · Score: 2

      Another interesting thing little tidbit I stumbled upon while learning Japanese: "Peking" is written with the Characters North-Capital, "Nanking" is written with the Characters "South-Capital", while "Tokio" is written with the Characters "East-Capital".

    5. Re:Mandarin dependency and homophone confusion by Anonymous Coward · · Score: 0

      On that same thought pattern, is "To-kyo" the reverse of "Kyo-to"? Because it sure looks like it to an otherwise uninformed English speaker. (And does that mean it means "Capital-East"?)

      Just curious.

    6. Re:Mandarin dependency and homophone confusion by billyswong · · Score: 1

      No. Kyoto is literally capital-city, or capital-cpaital (slashdot chopped all Han characters :( )

    7. Re: Mandarin dependency and homophone confusion by jrumney · · Score: 1

      If you write it with long vowels spelt out, Toukyou is clearly different than Kyouto, though the Kyou in both cases is the same, Kyoto being the former capital. On the subject of Chinese being able to understand written Japanese, it is only partially the case, as Chinese characters are not always used for their meaning in Japanese. Sometimes they were used for their (Middle Chinese) sound.

    8. Re:Mandarin dependency and homophone confusion by Fire_Wraith · · Score: 1

      No, it's a different character for 'To'.

      Different characters can have the same phonetic representation in Japanese, which is one of the tricky parts of the language. English has homonyms too, though they're usually easier to differentiate based on context. Kanji puns from this are definitely a big deal in Japanese humor, as you might expect.

      Also, fun fact, prior to the Tokugawa era where Tokyo became the capital, it was called Edo.

  19. Language markup by tepples · · Score: 1

    Please name said software.

    Any HTML renderer ought to be able to tell an element with lang="zh-Hans" (Chinese using simplified characters) from one with lang="ja" (Japanese).

  20. They're trying to unify *similar* characters by ciaran2014 · · Score: 4, Informative

    A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)

    The issue:

    There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.

    Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.

    (Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)

    A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.

    An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

    Here are some example symbols:

    https://en.wikipedia.org/wiki/...

    unicode.org's FAQ also has clarifications:

    If the character shapes are different in different parts of East Asia, why were the characters unified?
    http://www.unicode.org/faq/han...

    Isn't it true that some Japanese can't write their own names in Unicode?
    http://www.unicode.org/faq/han...

    (All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)

    --
    Help build the anti-software-patent wiki
    1. Re:They're trying to unify *similar* characters by Anonymous Coward · · Score: 0

      you are summarizing A issue, not THE issue the author was making up.

    2. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      > you are summarizing A issue, not THE issue the author was making up.

      Yes, my post only relates to the last line of the summary.

      --
      Help build the anti-software-patent wiki
    3. Re:They're trying to unify *similar* characters by AmiMoJo · · Score: 1

      An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

      The first one isn't really an advantage, since there is no shortage of code points. There are massive disadvantages though.

      From a software point of view it would be good to have universal fonts that can render any Unicode character correctly for anyone in the world. The Unicode consortium has tried to support this by splitting some of the more distinct symbols into separate code points for each language, but it's far from complete and every new version adds many more. The FAQ is a joke - when people point out that some Japanese people can't even write their name in Unicode they just wave their hands and say "oh we will fix it one day, and the older standards are just as bad!"

      Back to the point though, universal fonts are impossible. If it's a question of saving memory you just create a font format that lets you assign the same glyph to multiple code points. Instead the application has to figure out what language to use and load an appropriate font. If the wrong one is selected the result might be more or less readable, but again the Unicode guys acknowledge that there are many instances where it isn't and are still trying to fix it by adding new code points. New code points for existing characters are a disaster because older apps/fonts don't support them.

      Plus, I think we should be aiming higher than just "legible". There is a reason why a lot of Japanese software still uses Shift-JIS, it's because Unicode tends to render badly, especially on systems where multiple languages are in use. There are Unicode libraries with masses of code that tries to sort it all out, but it's hacky and incomplete.

      Neither option is perfect, but unification has really hindered adoption of Unicode in East Asia. Even when it is implemented, it tends to be broken.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    4. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      Thanks for this reply!

      Can you give me an example of a Japanese name that can't be written in unicode? I keep hearing English speakers mention this problem but I've never seen exactly what the problem is.

      --
      Help build the anti-software-patent wiki
    5. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      > it would be good to have universal fonts that can render
      > any Unicode character correctly for anyone in the world

      But a line has to be drawn between substance and style. There are two (main) ways to draw the number 4. One has a slanty line and is closed at the top, the other is made of straight lines and is open at the top. Or the number 7. For English speakers it's two lines, but for French speakers there's also a horizontal bar across the middle. Should unicode have two 4's and two 7's, or should this be left to the font? The unicode consortium (AFAIK) has decided to give 4 and 7 just one code point each and let the font decide how to display it.

      If you agree 4 and 7 should only have one code point then you agree that some unification is good. The question is the degree. It seems that for East Asian languages unicode started off conservative and they're adding more code points based on real world feedback. That sounds like a reasonable approach (given that there was no perfect approach they could have adopted from the start).

      (Or if you think 4 and 7 should each have (at least) two code points, then I think you're creating an impossible and impractical system which covers every way of writing every symbol.)

      Can Shift JIS display Chinese and Korean? If it can't then it "solves" the problem by ignoring the problem. Some people might find it better today, but unicode has a chance to eventually do what Shift JIS can do, and Shift JIS will never be able to do what unicode can do, so the eventual winner seems clear.

      > Even when [unicode] is implemented, it tends to be broken.

      Not really unicode's fault. Yes, they keep adding new code points (although this is partly because East Asian languages create new ideographs), and yes unicode is newer so application/font developers have had less time to implement it, but this is true of any big new project that's working on something massively complex and isn't finished yet.

      --
      Help build the anti-software-patent wiki
    6. Re:They're trying to unify *similar* characters by Anonymous Coward · · Score: 0

      ...a little slantier...

      You racist pig.

  21. Why not show us the character? by Anonymous Coward · · Score: 0

    Hey timothy, why not include the character in the summary, so we can all see it?

  22. In other news by rabbin · · Score: 1
    Some slashdot editors have failed to notice that incomplete sentences, which are less and less common in the first sentence of slashdot summaries.

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language

  23. Can anyone illustrate? by BlueMonk · · Score: 3, Insightful

    I have been reading the comments for 20 minutes because I don't understand Japanese, but I still don't understand the problem. There's a Japanese character called no, it looks very much like a lowercase English/Latin "e" rotated clockwise about 80 degrees and then flipped over the vertical axis. Is this being mixed up with something else or rendered wrongly? Can anybody provide examples of what it's getting mixed up with or how or where it's being rendered improperly?

    1. Re:Can anyone illustrate? by Anonymous Coward · · Score: 0

      As you see on the linked site, the char is being rendered as a math symbol instead of a normal letter. They look almost the same, but its always annoying if random letters are rendered with unexpected fonts.

      In this case, the common letter "no" (meaning 'of') is rendered as if was part of a math equation because it is also defined as being a valid math symbol (by Unicode), and the application failed to correctly guess that it was normal Japanese (even tho it was surrounded by normal Japanese).

      This basically means math and Japanese should not be mixed together.

    2. Re:Can anyone illustrate? by Anonymous Coward · · Score: 0

      https://twitter.com/hyuki/status/621667718056427521/photo/1

      this character, the last one in this image. It's shown with two strokes, but it should be a single line. And the font looks weird compared to the other letters. (And maybe doesn't seem like a big deal, but it is, just wrong.)

    3. Re:Can anyone illustrate? by phantomfive · · Score: 1

      Here's a picture. Notice that the character at the end is rendered in a different font than the rest of the characters. It's not a critical bug, the text is still legible, just an annoying cosmetic bug.

      --
      "First they came for the slanderers and i said nothing."
    4. Re:Can anyone illustrate? by BlueMonk · · Score: 1

      So, pardon my apparent inexperience with Unicode, fonts and glyphs, but this looks like an application or framework issue wherein someone decided that we should switch fonts in the middle of a string if there's another font that contains a glyph for the character we're after in some circumstances. Is that what's happening? Why shouldn't all text drawing operations be restricted to the currently active font, and make it the responsibility of the application developer and user to pick a font that contains all the glyphs required by their application. This doesn't really seem like a fault in Unicode, but in how the application or framework outsmarted itself in trying to switch fonts. Following the K.I.S.S. principle, this never would have happened, right? The application should simply stick to a single font. Also, under what circumstances (if any) would that "wrong" character ever be desired? Is it ever correct? Does it have a similar meaning in these other circumstances?

    5. Re:Can anyone illustrate? by Anonymous Coward · · Score: 0

      All applications switch fonts. The difference between bold, italic, combinations of those, those differences are all different fonts. A application wishing to use any of that stuff must support multiple fonts loaded.

      Most applications take it a step further and examines the text it wishes to render, making sure the selected font can render said glyph. If not, it loads a font that can display that glyph.

      One might now think "why not just render text using only(if possible) a font that supports the selected language?". And that is the problem. In this case, the glyph has two languages: Math and Japanese. Given that most applications are written in the west, they assume Math before Japanese.

      But even if the application correctly determines language, it cannot know 100% which "language" the user desired when this glyph is drawn; absent user markup, this can only be guessed at, and a Japanese writer may have wanted to show a Math symbol instead of a normal glyph. And the problem is that Unicode overlapped codepoints that really should not have been. A Math "no" should never be considered a Japanese "no" visually, non-visually it has always been up to applications to treat visually similar/same chars act like it.

    6. Re:Can anyone illustrate? by Actually,+I+do+RTFA · · Score: 1

      I can give an example, if you don't mind me running to greek. Imagine some program renders mathmatical symbols differently from text. Imagine that someone writes out, using unicode, the formula for the area of a circle. No problem, right? The pi is clearly a math symbol. But imagine the same thing if you were reading greek. And beyond that, imagine if all the greek you read though pi was being used in a mathematical sense.

      --
      Your ad here. Ask me how!
    7. Re:Can anyone illustrate? by AmiMoJo · · Score: 1

      It's rendered in a way that a Japanese person could read it, but looks ugly because software can't tell if it is Japanese, Chinese or mathematical. It's rather jarring in the middle of sentence and makes the output unsuitable for publishing without manual editing.

      This is due to Unicode assigning the same code to the Japanese, Chinese and mathematical versions. It would be like they tried to merge the Latin "o" and Cyrillic "o". Imagine if every "o" character you wrote was rendered in a different font to all the other Latin characters. Fortunately, even though they look the same they have unique code points in Unicode, so that doesn't happen.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    8. Re: Can anyone illustrate? by BlueMonk · · Score: 1

      What I still don't understand is, if there's only one code point for this character, where are the multiple renderings coming from? Multiple fonts? Is the source of the problem that Japanese fonts are providing a bad glyph/rendering for this character that doesn't match the style of the rest of the font, or is it that they are unable to provide both glyphs because there's only one code point? Would there still be a problem if they just changed their glyph to the other style; could this just be considered a bug in Japanese fonts?

    9. Re: Can anyone illustrate? by BlueMonk · · Score: 1

      It's still not clear how an application rendering Japanese text could end up making the bad assumption. If it's using a Japanese font, why would it bother to switch to another font when the character to be rendered exists in the current font? Does the problem only occur when the current font *doesn't* contain the character, and then the application goes hunting for it and ends up picking up characters from potentially multiple inconsistent fonts? That seems like an application issue, failing to try to retain a consistent font in this defaulting process. It points again to the notion that we should not even be doing that, but rather force applications to use "Unicode fonts" if they want to support Unicode text properly. This seems like a font issue more than a Unicode issue. Does Unicode have separate code points for italic and bold characters in other languages? Why should that information be part of the character instead of the font?

    10. Re: Can anyone illustrate? by Actually,+I+do+RTFA · · Score: 1

      There are multiple fonts: a "math" font and a "japanese" font. The problem is it goes jjjjjjjmjjjjjj for (j)apanese and (m)ath. It's just some programmer who has used the math usage, but never the japanese useage assuming that codepoint was always mathy, as opposed to doing some sane handling of the case.

      --
      Your ad here. Ask me how!
  24. Way to miss an opportunity! by juanfgs · · Score: 1

    It would be funnier if the bug was on character "Ni"

  25. Salient point: It’s dumb and stupid and I ha by Anonymous Coward · · Score: 0

    This isn't the first time I've seen someone complain about the han unification. Since unicode was supposed to solve the problems of multiple encodings there is something to be said for the proposition that it shouldn't introduce new problems or continue problems such that it doesn't actually solve the problems it set out to solve. In that light, the pain and suffering resulting from trying to combine chinese and japanese text in a single document resulting in confused renderers and mixed-up fonts, like here but there are more ways that goes wrong, shows that unicode fails to properly fulfil this proposition and that the han unification as such wasn't such a great idea. Clever in theory, impractical in practice.

    Since unicode was exactly supposed to be a solve-all silver bullet leaving us with nothing but rainbows and an easy life, the expressed sentiment is the salient point.

  26. Your post is in English by Anonymous Coward · · Score: 0

    Therefore it can only be ignored.

    Anyway, the unification idea does have a practical drawback in that using more than one language using those unified slots is going to be nothing but pain and suffering. Personally I think that unicode's less-than-21-bits codepoint space isn't going to be enough and that a system of encodings, rather than one do-all encoding, would be the more practical approach. That way renderers would know what language the text is in and so make better decisions picking fonts and such.

  27. Solution by Anonymous Coward · · Score: 0

    Just use English, people! If people did not use all these other silly little languages the world would be so much better.

  28. Notice This by Anonymous Coward · · Score: 0

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).

    What did they notice? Something about the Japanese character "no". But what exactly? The parenthetical gives some background on the character, but the sentence telling us what got noticed is never completed.

    Oh well. Maybe it wasn't important.

    1. Re:Notice This by Desty · · Score: 1

      Don't you just hate sentences that (while providing important background information)?

  29. Kanbun: Reordering Chinese to Japanese by tepples · · Score: 2

    Japanese and Chinese syntax differ too much for parallels as close as those of Mandarin and Cantonese. Japanese puts the verb at the end (SOV) and marks noun case with postpositions (wa, ga, o, e). Chinese, on the other hand, puts the verb in the middle (SVO), more like English. (Other orders are possible: Welsh and Arabic put the verb at the beginning, or VSO, and Kashmiri and Dutch split the verb into a part that's second and a part at the end, or V2.)

    Chinese also uses serial verb construction, where verbs before the sentence's main verb double as prepositions. For example, a sentence that glosses literally as "I sit aircraft depart Shanghai arrive Beijing travel" is understood as "I by aircraft from Shanghai to Beijing travel." (English is also SVO, but manner and place phrases follow the verb, producing "I travel from Shanghai to Beijing by aircraft.") In Japanese, each of these prepositional verbs would have to go after the noun and would probably need a participle ending like -tte to link them into the sentence.

    For about eight centuries prior to World War II, Japanese used kanbun, a way to mark up Chinese text to show the equivalent word order in Japanese, allowing it to be read as Japanese. It used reordering marks called kaeriten.

    1. Re: Kanbun: Reordering Chinese to Japanese by _merlin · · Score: 2

      That sentence doesn't require multiple verb clauses in Japanese. You can use destination, origin and means particles "ni", "kara" and "de": Watashi wa Shanghai kara Beijing ni hikouki de ikimasu. Since it's a single verb clause you can reorder it however you want for emphasis as long as the verb comes last - the way I have it there emphasises the subject. If you want to emphasise means of travel and use implicit speaker-as-subject, you can say: Beijing ni Shanghai kara hikouki de ikimasu. It's all easy as long as you get your particles right.

    2. Re: Kanbun: Reordering Chinese to Japanese by _merlin · · Score: 1

      Gah posting at 4:20AM is a bad idea. I emphasised destination in the second example. To emphasise means of transport: Hikouki de Beijing ni Shanghai kara ikimasu. Just put the aspect you want to emphasise (and it's associated particle) first. The only part that absolutely must be in a certain place in the sentence is the verb that comes last.

    3. Re:Kanbun: Reordering Chinese to Japanese by billstewart · · Score: 1

      And apparently Korean's even weirder. (I'm going by my childhood memories of my mom describing her job translating Korean during the early 50s. Unfortunately, I don't think she still has her books on basic Chinese characters these days, though I could just as easily find them in a bookstore around here.)

      Some parts of Silicon Valley have a lot of Korean restaurants. I don't think I've seen any Chinese characters on their signs or menus, just alphabetic Korean.

      --

      Bill Stewart
      New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    4. Re:Kanbun: Reordering Chinese to Japanese by Fire_Wraith · · Score: 1

      Korean sentence structure and grammar is pretty similar to Japanese. I had very little trouble picking up Korean after learning Japanese, because all the concepts were the same (topic/subject/object markers, use of counters, etc), it was just different words. A lot of the Sino-Korean words were also very similar to their Sino-Japanese counterparts, too. It's not surprising, since they're both from the same linguistic family and root, and both share a ton of Chinese influence.

      If anything, the biggest trouble I had was keeping the two separate, since using the wrong one is a rather bad faux pas...

    5. Re:Kanbun: Reordering Chinese to Japanese by StormShaman · · Score: 1

      Both the Korean and Japanese language groups are language isolates, and are not thought to be in the same language family.

  30. Absolute BS by Anonymous Coward · · Score: 1

    Let's be clear here: the character is U+306E, "hiragana letter no"
    http://www.fileformat.info/info/unicode/char/306e/index.htm

    The general category is "letter, other"
    Nothing to do with math (that would be "math symbol")
    If there is a bug, it is not in Unicode, but in some crappy software.

    Nothing to see, move along.

  31. Solution by Tablizer · · Score: 1

    Just say "No" to Unicode.

  32. Timothy can't write in English. by Gibgezr · · Score: 1

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).

    That isn't even a sentence in English. It is extremely grating to read crap like this, and it does not convey much about the story. .

    1. Re:Timothy can't write in English. by gustygolf · · Score: 1

      Yes, I found it difficult as well, took a while to figure what it was referring to. And I *know* Japanese.

      'Japanese hiragana character "no"' and I'll understand it. Hopefully even add a unicode codepoint (U+306E HIRAGANA LETTER NO), maybe even a link to that character's data on e.g. fileformat.info.

      Of course, it's still a sentence fragment, and that's pretty jarring.

      --
      "Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
  33. ascii forevar! by Anonymous Coward · · Score: 0

    unicode *is* a bug.

  34. MOD PARENT UP, PLEASE! by billstewart · · Score: 1

    Even using vector fonts doesn't fix the problem that Unicode wasn't a great solution for managing the diversity of characters in many Asian languages.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    1. Re:MOD PARENT UP, PLEASE! by Anonymous Coward · · Score: 0

      So... unicode is a system of inputting alphabets, not vector-based calligraphy characters. It's not our fault western language has declined so far into abstraction.

  35. Bad English is the world's most common language by billstewart · · Score: 1

    I was once at a conference in Germany, most of which was given in English because it was an international crowd. One of the German speakers started off by saying that he used to start by apologizing for his bad English, but the host (who was Turkish) told him not to worry; Bad English is the most widely spoken language in the world. (Which is fine; English is flexible enough about most things that if you don't need to be subtle, Bad English will usually do.)

    German's the only non-English language that I'm even vaguely functional in, and even then it was much more useful for me in Czechoslovakia, where people had learned German in school to deal with tourists, and I mainly wanted to talk to them about the same sets of things, like train schedules and getting food and hotels and which bridge went to the castle. Northern Germans speak a relatively comprehensible dialect, though too fast for me to do much in real time; understanding Austrians is more like being a New Yorker in deep Alabama. (I play music at a local German jam session, and some of the tunes have the lyrics translated from Bavarian or Swiss into German...)

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  36. Typewriter character sets without 1 and 0 by billstewart · · Score: 1

    I'm pretty sure my mom's manual typewriter when I was a kid didn't have 1, less sure about whether it had 0. But it did have the proper French and Spanish accent marks (left, right, circumflex, N~, cedilla, most of which my PC keyboard doesn't have), and you composed them with letters by using the backspace.

    And yes, she could do two-column left-and-right-justified newsletters on it - she'd type a draft, count the letters, type the final. But she happily switched to using a Macintosh to type them, and let it handle that stuff.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  37. English? 7bit clean?? Bwahahah! by billstewart · · Score: 1

    Yes, I know you were trolling, but in your mythical 7-bit-clean English, even if you're not using English letters like ð or , or ligatures like æ , or distinguishing between short and long S's (you know, the s you used to think were f's), how do you put diaeresis marks over words like cooperate, or distinguish between m-dash and n-dash and hyphen, or get the left- and right-side quotation marks without using some Microsoft or Apple ``smart quote'' breakage, much less deal with accent marks in words of foreign origin that are now part of English because we stole them fair and square and they're ours now, or handle degree marks, or words with superscript letters like the abbreviations for the and that and George and Your, or ...

    And turning them all into leet-speak, like earlier Ye Olde Hwaetever's, just doesn't count.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks