Slashdot Mirror


New Unicode Bug Discovered For Common Japanese Character "No"

AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.

125 of 196 comments (clear)

  1. No? by hankwang · · Score: 1
    It tried to RTFA, but it was in Japanese! I thought Japanese didn't have a word for "no":

    Japanese also lacks words for yes and no. The words "hai" and "iie" are mistaken by English speakers for equivalents to yes and no, but they actually signify agreement or disagreement with the proposition put by the question: "That's right." or "That's not right.

    1. Re:No? by AmiMoJo · · Score: 1

      Correct. Unfortunately Slashdot does not allow me to enter Japanese text, hence the confusion.

      This is what happens when I type that character in Japanese: ã®

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:No? by Applehu+Akbar · · Score: 1

      Most Japanese characters in the two phonetic alphabets stand for a consonant tied to a vowel. The no phonetic in grammar indicates possession. In the phrase Katoh no boshi (Kato's cap) all of the other characters would be the Chinese-derived kanji.

  2. Re:Indeed by smittyoneeach · · Score: 1

    Sum of some, sometimes
    Somersaults sagaciously
    In the summertime

    --
    Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
  3. What bug? by Ark42 · · Score: 4, Informative

    The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").

    1. Re:What bug? by AmiMoJo · · Score: 2, Interesting

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      You can't mix C/J/K and mathematics in Unicode, which is a new bug beyond just the failure to support mixing C/J/K.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:What bug? by Anonymous Coward · · Score: 1

      This is plainly wrong. The character discussed here is Hiragana, not Kanji. there is no unification for the Kana alphabets. As such, this *is* an application bug.

    3. Re:What bug? by Florian+Weimer · · Score: 1

      “-” looks very differently in text and formulas, too. I don't get why people assume that you can get nice rendering without additional markup.

    4. Re:What bug? by Ark42 · · Score: 1

      I'm aware of the problems with the han unification and certain Kanji being displayed "wrong" because the Chinese equivalent is drawn significantly different from the Japanese Kanji, but this doesn't seem to be anything close to that kind of problem. I'm also aware of the Unicode block U+1D400 "Mathematical Alphanumeric Symbols" which is what should be used for formulas. Any application that is rendering one particular character in the Hiragana block in a different font than the rest of the Hiragana block, is quite frankly, just rendering it wrong. The bug is with the application as far as I'm concerned, and this clearly does not impact default system rendering or any common web browsers as far as I can see either.

    5. Re:What bug? by amake · · Score: 1

      There are no Chinese or Korean versions of this Japan-specific character. This is the first time I've ever heard of a "mathematical use" of this character, and I suspect the vast majority of users would be surprised at this as well.

    6. Re:What bug? by Ark42 · · Score: 1

      Except while that is called "Hyphen-Minus" and can be used for two things, Unicode does try to solve that problem by having:
      00AD Soft Hyphen
      2010 Hypen
      2011 Non-Breaking Hyphen
      2012 Figure Dash
      2013 En Dash
      2014 Em Dash
      2015 Horizontal Bar
      2212 Minus Sign
      2796 Heavy Minus Sign

      There is no "Mathematical Hiragana No" glyph defined by Unicode, and as such, it should never be rendered in a different font just because somebody *might* use it in a formula. The application is wrong, and there is no bug in Unicode.

    7. Re:What bug? by Megane · · Score: 1

      My guess is that it can be used in certain numerical contexts, sort of like "No." ("number") in English. It can mean a quantity as in "n no x" (ippiki no neko), and maybe some other contexts. So something, probably an application, was coded to think of it as used in numerical contexts. The specific instance is about LaTeX, which is one of those ancient apps like emacs that is so old it had to create everything from scratch, so it's possibly specific to LaTeX or some port thereof.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    8. Re:What bug? by AmiMoJo · · Score: 1

      It's been imported to China: http://portal.nifty.com/koneta...

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    9. Re:What bug? by t551 · · Score: 1

      The bug is not in LaTeX, but in MathJax, an HTML/Javascript reimplementation of the TeX mathematical markup for use on the web.

    10. Re:What bug? by butlerm · · Score: 1

      How can you tell that any of those pictures are from China? They all look like they are from Japan, a country that makes extremely heavy use of Chinese characters (much more than Korea for example), to me.

    11. Re:What bug? by AmiMoJo · · Score: 1

      All the text is in Chinese. The blog post itself in in Japanese, and it says that the pictures are of China.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    12. Re:What bug? by NostalgiaForInfinity · · Score: 1

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      What you probably mean is that an application can't select the right glyph based on the Unicode string. That is correct, but nothing specific to CJK. Without markup or metadata, Unicode often won't render as expected by readers even in Western languages. Unicode used to have its own system for marking language context, but it was dropped since it was redundant with widely used markup and metadata. If you don't know what language your string is written in, you can't pick the right glyphs, and that's true in many languages besides CJK. (CJK has good heuristics for language identification, so it's not a problem.)

      Mostly, CJK deunification is something some Westerners try to use for showing off their (usually limited) knowledge of Japanese and Chinese and demonstrate how morally superior they are to the culturally ignorant, imperialist, evil white men that, in their imagination, made up the Unicode consortium.

      In reality, the Chinese and Japanese are big boys. If they wanted to de-unify their scripts in Unicode, they'd have the political clout, and no Westerner would stop them because, frankly, nobody outside Japan or China gives a f*ck. However, I suspect they are too smart to screw themselves that way.

  4. Re:Is it the same as in Chinese? by Chris+Mattern · · Score: 4, Funny

    Like æ or ÃüY

    If so, seems many Chinese website will have problems too, becuase it's used so often in Chinese.

    As you have just discovered, Slashdot cleverly avoids all Unicode bugs by not supporting Unicode at all.

  5. Re:Bug? by AmiMoJo · · Score: 2

    It's a Unicode bug. Unicode tries to merge different characters into a single code point, because long ago they had the same origin. This particular character exists in Japanese, Chinese, Korean and mathematics, so can be rendered four different ways, but they all share one code point.

    Applications have to guess what font to use. Being a mathematical program, this one defaults to the system language (Japanese) but has logic to detect this "no" character and render it in a different font. It isn't clever enough to notice that the rest of the sentence is Japanese, but it shouldn't have to be.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  6. Re:Why not just use English, and only English? by JustOK · · Score: 1

    Que?

    --
    rewriting history since 2109
  7. Re: No, it is the character pronounced as "no" by Ash-Fox · · Score: 1

    Actually it does, it's just disabled in slashcode after the brief spam event when it was enabled.

    --
    Change is certain; progress is not obligatory.
  8. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 2, Insightful

    There are more native Chinese speaker than English speaker. How about you learn Chinese and shut the fuck up?

  9. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

    In practice, English is the only language we need.

    I think Chinese is the only language we need, it's already the most spoken language in the world.

    It uses a sensible alphabet that's easy to represent digitally.

    Chinese is too.

    It's a democratic language that will draw from other languages where necessary and useful.

    Chinese does too.

    It's a language that has proven it can adapt to changing circumstances.

    Chinese does too.

    --
    Change is certain; progress is not obligatory.
  10. Nitpick by msobkow · · Score: 5, Informative

    This is not a "Unicode bug". It is a rendering bug exhibited by some applications.

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Nitpick by AmiMoJo · · Score: 2

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:Nitpick by msobkow · · Score: 1

      Ask the people who wrote the software that doesn't exhibit the bug. Obviously it can be done.

      --
      I do not fail; I succeed at finding out what does not work.
    3. Re:Nitpick by AmiMoJo · · Score: 2

      Software that doesn't have this bug only avoids it by not supporting mathematical symbols. So far there is no known software that avoids the CJK confusion problem either.

      Most software doesn't even try. How many programmers are even aware of the issue? No Unicode library is immune. It's a problem with the standard that can only be fixed by starting fresh with about 150,000 new CJK characters, and then updating all fonts and libraries to handle translation and equivalence.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    4. Re:Nitpick by thegarbz · · Score: 1

      In other news a new bug is shown to exhibit a behaviour where some mathematical programs substitute a Japanese character into the formula.

      The problem is it can't be done. Not without intelligent user / designer input (such as signifying that the unicode to be displayed is Japanese and not a maths formula). If an application is correct in determining one context it will be incorrect in determining the other.

    5. Re:Nitpick by Kjella · · Score: 3, Informative

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says and if you express it as simply a style issue, it actually sounds plausible.

      On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.

      I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?

      --
      Live today, because you never know what tomorrow brings
    6. Re:Nitpick by loufoque · · Score: 1

      In LaTeX, it's mathematical if it occurs in a math context, which is separated by $ characters.

    7. Re:Nitpick by AmiMoJo · · Score: 2

      I agree, font encoding should not be part of the character encoding. Unicode even screws that up though, because there are things like text direction marks in it. Anyway, the problem is that often you have text without metadata. A file name, audio file metadata, a plain text database entry etc. You have to pick a font to render it, and the choice depends on the language because thanks to Unicode it's impossible to have a universal all-language font.

      You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    8. Re:Nitpick by Megane · · Score: 1

      This is not a unified character, it is Japanese-only. Some program (apparently LaTeX) is using the wrong font because it thinks it is part of a mathematical equation, even to the point of showing the wrong font for the character in a font character viewer window.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    9. Re:Nitpick by Kjella · · Score: 1

      You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

      Actually I was thinking of it more like a "sticky" composite character, like you can have a + circle = å you'd have unihan + HAN(C|J|K) = "right" glyph while:

      a) Extending existing single-language CJK documents with just one character
      b) Preserving backwards compatibility with all current CJK systems
      c) Avoiding any complex CJK conversion functions
      d) Creating a simple way to override with "show as C/J/K"

      It would require adding a bit of intelligence to copy-paste for preservation, like:

      (HANC)abcde -> copy "cde" -> (HANC)cde

      But if the application doesn't, well you'll still get the correct unihan. Also on paste it could remove redundant markers, but they'd be harmless. Then you could have universal fonts with as little invasive changes as possible. The alternative would be creating literally hundreds of thousands of new code points.

      --
      Live today, because you never know what tomorrow brings
    10. Re:Nitpick by AmiMoJo · · Score: 1

      That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character. The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simple character count suddenly requires a massive look-up table of composite characters.

      The goal should be to make handling Unicode text as simple as possible without huge code libraries, metadata tables and the like. Everything else is prone to screw ups - for example with the text direction mark, there was a security flaw where you could include on in a file name to make "document.fdp.com" look like "document.moc.pdf". The right-to-left mark is after the first period and invisible.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    11. Re:Nitpick by Megol · · Score: 1

      The problem is outside the problem domain Unicode attempts to solve so it isn't strange it doesn't solve it. For some other problems Unicode try to solve the result is a mess (example: bidirectional text) so that is probably a good thing.

    12. Re:Nitpick by Kjella · · Score: 1

      That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character.

      Yes, until the software got updated to treat it as a non-printing character but it wouldn't make everything unreadable, there's bad and there's much much worse.

      The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simple character count suddenly requires a massive look-up table of composite characters.

      It already does for a huge number of reasons. Oh and if you thought giving every character a code point would mean a 1:1 mapping to glyphs that's still wrong, many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character.

      The goal should be to make handling Unicode text as simple as possible without huge code libraries, metadata tables and the like. Everything else is prone to screw ups - for example with the text direction mark, there was a security flaw where you could include on in a file name to make "document.fdp.com" look like "document.moc.pdf". The right-to-left mark is after the first period and invisible.

      Well, you should have a filter there anyway because "foo/bar.*<hello?>" is not a valid filename either, though it's a valid unicode string. That you don't restraint it to the valid subset isn't the standard's fault.

      --
      Live today, because you never know what tomorrow brings
    13. Re:Nitpick by GuB-42 · · Score: 1

      First of all, the hiragana "no" is always Japanese, not Chinese, not Korean. The CJK unification is only about han characters (in Japanese, that's kanji).
      As for maths, there are usually markers to indicate we are in an equation, which makes sense because Unicode is not powerful enough for this : fractions, integrals, matrices, etc... cannot be rendered with just code points. So in this case Unicode provide the characters (roman and geek letters, numbers, mathematical symbols, the hiragana "no", etc...) and a higher level language (like MathML or LaTeX) deal with the structure. Because of this, Unicode doesn't have to dedicate a special page for mathematical version of regular characters : the software can easily differentiate. If it is MathML / LaTeX "$" block, render it with the math font, otherwise, use the regular font.

  11. Re: Why not just use English, and only English? by John+Allsup · · Score: 3, Insightful

    Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.) When it comes to ideograph based languages, we would have been better off designing an entirely separate text system rather than trying to shoehorn it into a font-character paradigm derived from the needs of writing and printing latin scripts. Indeed having a writing system designed around the needs of calligraphy would be a useful thing, but like with ideograph based writing systems it is a long way from the use case we normally see with alphabet based writing systems.

    --
    John_Chalisque
  12. Re: Or speak English, it's 7bit clean by John+Allsup · · Score: 1

    As I pointer out elsewhere here, Chinese can be written with a latin alphabet and a few accents. Likewise languages such as Sanskrit. Just as there is a difference between English handwriting and what can be represented in Ascii, we face a related issue with ideograph based writing systems. We would be better of writing Chinese webpages in pinyin, and developing a separate system for calligraphy and ideographs.

    --
    John_Chalisque
  13. Re:Is it the same as in Chinese? by ChunderDownunder · · Score: 1

    meanwhile the folks at soylent implemented it ages ago.

    With all the effort wasted on 'beta', I wonder how much of the open source slashcode remains.

  14. Re:single global language by AmiMoJo · · Score: 1

    English is fine for factual information like air traffic control or shipping, but it would never work for Japanese society. There are too many important things you can't adequately express in English that are essential to Japanese people. Same with Chinese.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  15. Re:Is it the same as in Chinese? by KiloByte · · Score: 1

    Actually, slashcode does support Unicode, all that needs to be done for /. to get Unicode is reconfiguring the database (and converting old comments, I guess).

    --
    The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  16. Re: Or speak English, it's 7bit clean by Anonymous Coward · · Score: 2, Interesting

    As I pointer out elsewhere here, Chinese can be written with a latin alphabet and a few accents. Likewise languages such as Sanskrit. Just as there is a difference between English handwriting and what can be represented in Ascii, we face a related issue with ideograph based writing systems. We would be better of writing Chinese webpages in pinyin, and developing a separate system for calligraphy and ideographs.

    Except that there are so many homonyms in pinyin that a strong sense of the context is needed to read it. The logograms are much harder to write but reading is quite a bit easier, which is why they are still in use. That's not the same as English handwriting vs printing, where the differences are only in rendering and there is a 1:1 correspondence between a handwritten and a printed character.

  17. Re: No, it is the character pronounced as "no" by ledow · · Score: 1

    Fuck it doesn't even support ASCII, let alone Unicode.

    Try doing an English pound sign:

    £

    Nope.

  18. Re: Or speak English, it's 7bit clean by amake · · Score: 1

    No one, absolutely no one who is actually proficient in any of these languages, would find your proposal acceptable. The only people who advocate such things are, deservedly, dismissed as cranks.

    So instead, how about we fix the problems with the current, largely acceptable system we have now?

  19. Re: Why not just use English, and only English? by amake · · Score: 2, Insightful

    we would have been better off

    No, you might have been better off. Chinese speakers would not. They would like to use their written language, as it exists today, on computers just like everyone else.

  20. Re: Why not just use English, and only English? by Anonymous Coward · · Score: 1

    https://en.wikipedia.org/wiki/Romanization_of_Chinese

    they tried, but failed.

    If you actually does know a little bit more than "pinyin", you should understand why they failed.

  21. Re:Is it the same as in Chinese? by Carewolf · · Score: 1

    Actually, slashcode does support Unicode, all that needs to be done for /. to get Unicode is reconfiguring the database (and converting old comments, I guess).

    No, it already works. It was active for a while some 10 years ago, but was removed because it was hard to sanitize. You could easily write you own comment score by reversing direction at the right time.

    Still they could reactivate it if they just found a reasonable way of sanitizing features they don't want.

  22. Re:Bug? by Carewolf · · Score: 1

    It's a Unicode bug. Unicode tries to merge different characters into a single code point, because long ago they had the same origin. This particular character exists in Japanese, Chinese, Korean and mathematics, so can be rendered four different ways, but they all share one code point.

    Applications have to guess what font to use. Being a mathematical program, this one defaults to the system language (Japanese) but has logic to detect this "no" character and render it in a different font. It isn't clever enough to notice that the rest of the sentence is Japanese, but it shouldn't have to be.

    The funny thing is that the same have never been done with latin letters and symbols, because that would be a mess. I really don't understand why they couldn't see it would be the same in Asian langauges.

  23. JUst a rendering problem? by Applehu+Akbar · · Score: 1

    The character in the Unicode table looks like a mashup of the hiragana (grammar-forming) version of the character, and the katakana (used as we do italics) form.

  24. Mandarin dependency and homophone confusion by tepples · · Score: 4, Interesting

    Just write chinese in pinyin and speak it normally. (the number of Chinese speakers does not matter, the issue is with how it is written down.)

    "Chinese" is not a single spoken language. A passage written in one Chinese language, such as Mandarin, is often readable in another Chinese language, such as Cantonese, so long as they're written with Han characters. It's as if French could be read as Italian or Spanish with the same characters. In addition, different words that sound the same in a given Chinese language due to historic sound changes usually have different Han characters. They may end up sounding different in a different Chinese language whose different historic sound changes produced different homophone sets. Pinyin, on the other hand, depends on Mandarin and confuses homophones.

    1. Re:Mandarin dependency and homophone confusion by interval1066 · · Score: 1

      Something I've always been curious about though; my understanding is that a Japanese speaker can understand written Chinese, to a certain extent. Is that not correct? I know that the reverse isn't really possible due to the Japanese use of Kana. But if the text is written using Han glyphs Cantonese, Mandarin, Hunan, Kan, Taiwan, etc, and Japanese speakers can sort-of understand each other's written stuff, or is that just nonsense?

      --
      Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
    2. Re:Mandarin dependency and homophone confusion by Fire_Wraith · · Score: 4, Informative

      To a degree, yes, because the symbols themselves are the same. Note however that some of the original Chinese characters have been altered in use (simplified) by the PRC in the 50s and 60s, but those are only used in mainland China (and I think Singapore maybe?), but not Taiwan or Japan. Aside from that though, the characters for something like 'University' would still be a combination of the character for 'large' and the character for 'school'. It might be pronounced totally differently, but could be read and understood by all. Fun fact: The proper reading of the characters for the country of "Japan" in Japanese is actually "Nihon" or "Nippon." However, in certain Chinese dialects, the characters that comprise it are pronounced more like "Zep-pen" or "Japan." What's also fascinating to consider is that Korean is the same way, but that in modern usage you hardly ever see the Chinese characters (Hanja) used, even though I think they're still taught in some schools. Almost everything I saw when I was in Korea was in Hangul, the Korean native alphabetic script.

    3. Re:Mandarin dependency and homophone confusion by phantomfive · · Score: 1

      But if the text is written using Han glyphs Cantonese, Mandarin, Hunan, Kan, Taiwan, etc, and Japanese speakers can sort-of understand each other's written stuff, or is that just nonsense?

      I went to China once with a professor of ancient Korean. He couldn't speak any Chinese, but he learned enough Chinese characters from studying Korean that he could write well enough to communicate with a taxi driver. They had to write to each other, they couldn't speak.

      Essentially, there was an old style of Chinese that everyone wrote in (but probably no one ever spoke, including Chinese). Over time, Japan, Korea, Hong Kong and eventually all of China modified the writing system to match the speaking system. (Here is a really good link on that topic).

      The meaning of the individual characters is mostly the same. Sometimes though, you combine two characters together to make a complete word, and there is more variation (in two-character words) between the different countries. Also, there is some variation in the styles of various characters.

      There's more to say but that's probably an earful already lol

      --
      "First they came for the slanderers and i said nothing."
    4. Re:Mandarin dependency and homophone confusion by aix+tom · · Score: 2

      Another interesting thing little tidbit I stumbled upon while learning Japanese: "Peking" is written with the Characters North-Capital, "Nanking" is written with the Characters "South-Capital", while "Tokio" is written with the Characters "East-Capital".

    5. Re:Mandarin dependency and homophone confusion by billyswong · · Score: 1

      No. Kyoto is literally capital-city, or capital-cpaital (slashdot chopped all Han characters :( )

    6. Re: Mandarin dependency and homophone confusion by jrumney · · Score: 1

      If you write it with long vowels spelt out, Toukyou is clearly different than Kyouto, though the Kyou in both cases is the same, Kyoto being the former capital. On the subject of Chinese being able to understand written Japanese, it is only partially the case, as Chinese characters are not always used for their meaning in Japanese. Sometimes they were used for their (Middle Chinese) sound.

    7. Re:Mandarin dependency and homophone confusion by Fire_Wraith · · Score: 1

      No, it's a different character for 'To'.

      Different characters can have the same phonetic representation in Japanese, which is one of the tricky parts of the language. English has homonyms too, though they're usually easier to differentiate based on context. Kanji puns from this are definitely a big deal in Japanese humor, as you might expect.

      Also, fun fact, prior to the Tokugawa era where Tokyo became the capital, it was called Edo.

  25. Re: No, it is the character pronounced as "no" by alexhs · · Score: 1

    HTML entity pound sign (&pound;): £
    Literal pound sign, as on my keyboard: £
    It's OK for me in preview mode.
    Maybe it's your browser's encoding that's broken ? I have it set as UTF-8. Your rendering (£ (*)) seems to indicate you sent the byte sequence for UTF-8. But I suspect that your browser set the character encoding as ISO-8859-1 in its headers.

    While I'm at it: "" <- This was supposed to be the "no" hiragana. Disallowed characters are stripped, rather than being "converted" to mojibake.

    (*) Fun fact: rendered as £ in your comment and in editing mode, but as £ in preview mode.

    --
    I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
  26. Language markup by tepples · · Score: 1

    Please name said software.

    Any HTML renderer ought to be able to tell an element with lang="zh-Hans" (Chinese using simplified characters) from one with lang="ja" (Japanese).

  27. Re:single global language by dunkelfalke · · Score: 1

    Actually, English isn't very good for factual information either. It has too many homonyms, a very inconsistent spelling, too ambiguous sentences even with the very strict word order English has to use, no single language authority and too many standard variations.

    Other Germanic languages are much more precise, as are Slavic languages. Due to the more complicated grammar and being synthetic instead of analytic, the meaning of a sentence is clear even if words in the sentence are shifted around, the spelling is usually also phonemic (you write words as you hear them - regular spelling).

    --
    "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
  28. Re: Why not just use English, and only English? by tepples · · Score: 1

    "symbols" occupy less space

    Not if you have to make the font bigger to keep the strokes from touching each other. By that point, you could have used a smaller font on the Latin.

  29. Re: Why not just use English, and only English? by AmiMoJo · · Score: 1

    It would have been absolutely fine if they had just stuck to one codepoint per character and not tried to merge them.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  30. Re:Is it the same as in Chinese? by Bing+Tsher+E · · Score: 1

    Slashdot is more in the spirit of Usenet than anything else. I wish they'd just strip the 8th bit on everything.

  31. Re: Why not just use English, and only English? by Bing+Tsher+E · · Score: 1

    Then it would need to be written as it exists today, which would mean some sort of calligraphic text system. That wouldn't have been possible with the design of text-based system 30 years ago, where character bitmaps were stored in a lookup ROM for a rasterizer to use, but it shouldn't be difficult today.

  32. Re:Why not just use English, and only English? by ArcadeMan · · Score: 1

    We shouldn't strive to eliminate other languages, of course. They do have their value, but more as historic curiosities for linguists and historians rather than something to use on a daily basis.

    Quoi?

  33. They're trying to unify *similar* characters by ciaran2014 · · Score: 4, Informative

    A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)

    The issue:

    There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.

    Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.

    (Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)

    A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.

    An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

    Here are some example symbols:

    https://en.wikipedia.org/wiki/...

    unicode.org's FAQ also has clarifications:

    If the character shapes are different in different parts of East Asia, why were the characters unified?
    http://www.unicode.org/faq/han...

    Isn't it true that some Japanese can't write their own names in Unicode?
    http://www.unicode.org/faq/han...

    (All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)

    --
    Help build the anti-software-patent wiki
    1. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      > you are summarizing A issue, not THE issue the author was making up.

      Yes, my post only relates to the last line of the summary.

      --
      Help build the anti-software-patent wiki
    2. Re:They're trying to unify *similar* characters by AmiMoJo · · Score: 1

      An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

      The first one isn't really an advantage, since there is no shortage of code points. There are massive disadvantages though.

      From a software point of view it would be good to have universal fonts that can render any Unicode character correctly for anyone in the world. The Unicode consortium has tried to support this by splitting some of the more distinct symbols into separate code points for each language, but it's far from complete and every new version adds many more. The FAQ is a joke - when people point out that some Japanese people can't even write their name in Unicode they just wave their hands and say "oh we will fix it one day, and the older standards are just as bad!"

      Back to the point though, universal fonts are impossible. If it's a question of saving memory you just create a font format that lets you assign the same glyph to multiple code points. Instead the application has to figure out what language to use and load an appropriate font. If the wrong one is selected the result might be more or less readable, but again the Unicode guys acknowledge that there are many instances where it isn't and are still trying to fix it by adding new code points. New code points for existing characters are a disaster because older apps/fonts don't support them.

      Plus, I think we should be aiming higher than just "legible". There is a reason why a lot of Japanese software still uses Shift-JIS, it's because Unicode tends to render badly, especially on systems where multiple languages are in use. There are Unicode libraries with masses of code that tries to sort it all out, but it's hacky and incomplete.

      Neither option is perfect, but unification has really hindered adoption of Unicode in East Asia. Even when it is implemented, it tends to be broken.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    3. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      Thanks for this reply!

      Can you give me an example of a Japanese name that can't be written in unicode? I keep hearing English speakers mention this problem but I've never seen exactly what the problem is.

      --
      Help build the anti-software-patent wiki
    4. Re:They're trying to unify *similar* characters by ciaran2014 · · Score: 1

      > it would be good to have universal fonts that can render
      > any Unicode character correctly for anyone in the world

      But a line has to be drawn between substance and style. There are two (main) ways to draw the number 4. One has a slanty line and is closed at the top, the other is made of straight lines and is open at the top. Or the number 7. For English speakers it's two lines, but for French speakers there's also a horizontal bar across the middle. Should unicode have two 4's and two 7's, or should this be left to the font? The unicode consortium (AFAIK) has decided to give 4 and 7 just one code point each and let the font decide how to display it.

      If you agree 4 and 7 should only have one code point then you agree that some unification is good. The question is the degree. It seems that for East Asian languages unicode started off conservative and they're adding more code points based on real world feedback. That sounds like a reasonable approach (given that there was no perfect approach they could have adopted from the start).

      (Or if you think 4 and 7 should each have (at least) two code points, then I think you're creating an impossible and impractical system which covers every way of writing every symbol.)

      Can Shift JIS display Chinese and Korean? If it can't then it "solves" the problem by ignoring the problem. Some people might find it better today, but unicode has a chance to eventually do what Shift JIS can do, and Shift JIS will never be able to do what unicode can do, so the eventual winner seems clear.

      > Even when [unicode] is implemented, it tends to be broken.

      Not really unicode's fault. Yes, they keep adding new code points (although this is partly because East Asian languages create new ideographs), and yes unicode is newer so application/font developers have had less time to implement it, but this is true of any big new project that's working on something massively complex and isn't finished yet.

      --
      Help build the anti-software-patent wiki
  34. Re: Or speak English, it's 7bit clean by ciaran2014 · · Score: 1

    Example: the story of the man who tried to eat ten lions:

    Shí shì sh shì sh shì, shì sh, shì shí shí sh. Shì shí shí shì shì shì sh. Shí shí, shì shí sh shì shì. Shì shí, shì Sh shì shì shì. Shì shì shì shí sh, shì shì shì, sh shì shí sh shì shì. Shì shí shì shí sh sh, shì shí shì. Shí shì shì, shì sh shì shù shí shì. Shí shì shì, shì sh shì shí shì shí sh. Shí shí, sh shì shì shí sh sh, shí shí sh. Shì shì shì shì

    For the Chinese ideograph version and English translation, see slides 12 and 13 of

    https://web.csulb.edu/~txie/38...

    --
    Help build the anti-software-patent wiki
  35. In other news by rabbin · · Score: 1
    Some slashdot editors have failed to notice that incomplete sentences, which are less and less common in the first sentence of slashdot summaries.

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language

  36. Can anyone illustrate? by BlueMonk · · Score: 3, Insightful

    I have been reading the comments for 20 minutes because I don't understand Japanese, but I still don't understand the problem. There's a Japanese character called no, it looks very much like a lowercase English/Latin "e" rotated clockwise about 80 degrees and then flipped over the vertical axis. Is this being mixed up with something else or rendered wrongly? Can anybody provide examples of what it's getting mixed up with or how or where it's being rendered improperly?

    1. Re:Can anyone illustrate? by phantomfive · · Score: 1

      Here's a picture. Notice that the character at the end is rendered in a different font than the rest of the characters. It's not a critical bug, the text is still legible, just an annoying cosmetic bug.

      --
      "First they came for the slanderers and i said nothing."
    2. Re:Can anyone illustrate? by BlueMonk · · Score: 1

      So, pardon my apparent inexperience with Unicode, fonts and glyphs, but this looks like an application or framework issue wherein someone decided that we should switch fonts in the middle of a string if there's another font that contains a glyph for the character we're after in some circumstances. Is that what's happening? Why shouldn't all text drawing operations be restricted to the currently active font, and make it the responsibility of the application developer and user to pick a font that contains all the glyphs required by their application. This doesn't really seem like a fault in Unicode, but in how the application or framework outsmarted itself in trying to switch fonts. Following the K.I.S.S. principle, this never would have happened, right? The application should simply stick to a single font. Also, under what circumstances (if any) would that "wrong" character ever be desired? Is it ever correct? Does it have a similar meaning in these other circumstances?

    3. Re:Can anyone illustrate? by Actually,+I+do+RTFA · · Score: 1

      I can give an example, if you don't mind me running to greek. Imagine some program renders mathmatical symbols differently from text. Imagine that someone writes out, using unicode, the formula for the area of a circle. No problem, right? The pi is clearly a math symbol. But imagine the same thing if you were reading greek. And beyond that, imagine if all the greek you read though pi was being used in a mathematical sense.

      --
      Your ad here. Ask me how!
    4. Re:Can anyone illustrate? by AmiMoJo · · Score: 1

      It's rendered in a way that a Japanese person could read it, but looks ugly because software can't tell if it is Japanese, Chinese or mathematical. It's rather jarring in the middle of sentence and makes the output unsuitable for publishing without manual editing.

      This is due to Unicode assigning the same code to the Japanese, Chinese and mathematical versions. It would be like they tried to merge the Latin "o" and Cyrillic "o". Imagine if every "o" character you wrote was rendered in a different font to all the other Latin characters. Fortunately, even though they look the same they have unique code points in Unicode, so that doesn't happen.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    5. Re: Can anyone illustrate? by BlueMonk · · Score: 1

      What I still don't understand is, if there's only one code point for this character, where are the multiple renderings coming from? Multiple fonts? Is the source of the problem that Japanese fonts are providing a bad glyph/rendering for this character that doesn't match the style of the rest of the font, or is it that they are unable to provide both glyphs because there's only one code point? Would there still be a problem if they just changed their glyph to the other style; could this just be considered a bug in Japanese fonts?

    6. Re: Can anyone illustrate? by BlueMonk · · Score: 1

      It's still not clear how an application rendering Japanese text could end up making the bad assumption. If it's using a Japanese font, why would it bother to switch to another font when the character to be rendered exists in the current font? Does the problem only occur when the current font *doesn't* contain the character, and then the application goes hunting for it and ends up picking up characters from potentially multiple inconsistent fonts? That seems like an application issue, failing to try to retain a consistent font in this defaulting process. It points again to the notion that we should not even be doing that, but rather force applications to use "Unicode fonts" if they want to support Unicode text properly. This seems like a font issue more than a Unicode issue. Does Unicode have separate code points for italic and bold characters in other languages? Why should that information be part of the character instead of the font?

    7. Re: Can anyone illustrate? by Actually,+I+do+RTFA · · Score: 1

      There are multiple fonts: a "math" font and a "japanese" font. The problem is it goes jjjjjjjmjjjjjj for (j)apanese and (m)ath. It's just some programmer who has used the math usage, but never the japanese useage assuming that codepoint was always mathy, as opposed to doing some sane handling of the case.

      --
      Your ad here. Ask me how!
  37. Way to miss an opportunity! by juanfgs · · Score: 1

    It would be funnier if the bug was on character "Ni"

  38. Re: Why not just use English, and only English? by interval1066 · · Score: 2

    I'll buy that, but even native Sinolanguage speakers have told me the learning curve for an alphabet is much shallower. Like, MUCH shallower. And since most modern technical terms have Greek and Latin roots, sometimes its simpler for them to just use the Latin words, otherwise they have to convert the terms to native sounds using bizarre and difficult to use conversion systems. I do agree however that it would have been nice to use a system similar to Kanji right from the beginning had we had one.

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  39. Re:Is it the same as in Chinese? by Megane · · Score: 1

    Slashdot does support Unicode (assuming your browser can be convinced to post in the right encoding). It just happens to have most of the code points (basically everything above U+00FF) blacklisted.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  40. Re: Why not just use English, and only English? by interval1066 · · Score: 2

    English is the official technical language for flight. ALL international pilots, military and civ, MUST know enough English to pass flight school and to fly international commercial flights. Its also the official language of sea navigation, but to a lesser extent. I don't think you need to be as proficient. And English with a number of loan words from Greek and Latin are used in international Engineering. But yeah, English is spoken by the majority of technical people around the world as a common information exchange language.

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  41. Re:Why not just use English, and only English? by interval1066 · · Score: 1, Informative

    I think Chinese is the only language we need, it's already the most spoken language in the world.

    Only in head count, not by region. If the world was populated only by the Chinese, which seems to be their goal, then yes, Chinese is the most spoken language in the world. However, if you break that fact down by dialect, your statement is really weak. Mao's goal to have the entire PRC speak Mandarin really failed.

    It's a democratic language that will draw from other languages where necessary and useful.

    Not really. Mao tried to force all Chinese to speak Mandarin, and he failed miserably. Kinda the exact opposite of "Democratic". But of course that's not the fault of the language per se...

    It's a language that has proven it can adapt to changing circumstances.

    Chinese may be, but if Japanese is an example, and Japanese is adapted from Chinese by Han explorers to Japan in the Iron Age; its not very adaptable at all. The Japanese have developer THREE different writing systems to cope with with some shortcomings of the language (only two tenses, underdeveloped pronoun system, etc). That may be a shortcoming of Japanese, but Japanese is just a symptom of a language root that isn't very forgiving. I will say however that a language that can be nuanced such that 9 different meanings from changing the tone of one word may be more flexible than I give it credit for,

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  42. Re:Is it the same as in Chinese? by interval1066 · · Score: 1

    If only everyone just used UTF8 encoding. Unfortunately, Microsoft insisted on using UTF16 and now here we are...

    --
    Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
  43. Re: No, it is the character pronounced as "no" by ledow · · Score: 1

    Then neither are basically all of the accented characters:

    ÃéÃÃÃ

    ÃÃÃÃ"Ãs

    Quarter, half, most of the currency symbols, etc.

    Extended ASCII is pretty bog-standard. But my point really? I press the pound-sign (or the other characters) on my keyboard, and Slashdot can't render them. Facebook can. The Register can. Every forum in the world can. But not Slashdot.

  44. Re: Why not just use English, and only English? by dunkelfalke · · Score: 1

    ICAO general rules and regulations

    4.4.1c - ICAO languages are English, Spanish, French, Arabic, Russian, and Chinese.

    --
    "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
  45. Kanbun: Reordering Chinese to Japanese by tepples · · Score: 2

    Japanese and Chinese syntax differ too much for parallels as close as those of Mandarin and Cantonese. Japanese puts the verb at the end (SOV) and marks noun case with postpositions (wa, ga, o, e). Chinese, on the other hand, puts the verb in the middle (SVO), more like English. (Other orders are possible: Welsh and Arabic put the verb at the beginning, or VSO, and Kashmiri and Dutch split the verb into a part that's second and a part at the end, or V2.)

    Chinese also uses serial verb construction, where verbs before the sentence's main verb double as prepositions. For example, a sentence that glosses literally as "I sit aircraft depart Shanghai arrive Beijing travel" is understood as "I by aircraft from Shanghai to Beijing travel." (English is also SVO, but manner and place phrases follow the verb, producing "I travel from Shanghai to Beijing by aircraft.") In Japanese, each of these prepositional verbs would have to go after the noun and would probably need a participle ending like -tte to link them into the sentence.

    For about eight centuries prior to World War II, Japanese used kanbun, a way to mark up Chinese text to show the equivalent word order in Japanese, allowing it to be read as Japanese. It used reordering marks called kaeriten.

    1. Re: Kanbun: Reordering Chinese to Japanese by _merlin · · Score: 2

      That sentence doesn't require multiple verb clauses in Japanese. You can use destination, origin and means particles "ni", "kara" and "de": Watashi wa Shanghai kara Beijing ni hikouki de ikimasu. Since it's a single verb clause you can reorder it however you want for emphasis as long as the verb comes last - the way I have it there emphasises the subject. If you want to emphasise means of travel and use implicit speaker-as-subject, you can say: Beijing ni Shanghai kara hikouki de ikimasu. It's all easy as long as you get your particles right.

    2. Re: Kanbun: Reordering Chinese to Japanese by _merlin · · Score: 1

      Gah posting at 4:20AM is a bad idea. I emphasised destination in the second example. To emphasise means of transport: Hikouki de Beijing ni Shanghai kara ikimasu. Just put the aspect you want to emphasise (and it's associated particle) first. The only part that absolutely must be in a certain place in the sentence is the verb that comes last.

    3. Re:Kanbun: Reordering Chinese to Japanese by billstewart · · Score: 1

      And apparently Korean's even weirder. (I'm going by my childhood memories of my mom describing her job translating Korean during the early 50s. Unfortunately, I don't think she still has her books on basic Chinese characters these days, though I could just as easily find them in a bookstore around here.)

      Some parts of Silicon Valley have a lot of Korean restaurants. I don't think I've seen any Chinese characters on their signs or menus, just alphabetic Korean.

      --

      Bill Stewart
      New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    4. Re:Kanbun: Reordering Chinese to Japanese by Fire_Wraith · · Score: 1

      Korean sentence structure and grammar is pretty similar to Japanese. I had very little trouble picking up Korean after learning Japanese, because all the concepts were the same (topic/subject/object markers, use of counters, etc), it was just different words. A lot of the Sino-Korean words were also very similar to their Sino-Japanese counterparts, too. It's not surprising, since they're both from the same linguistic family and root, and both share a ton of Chinese influence.

      If anything, the biggest trouble I had was keeping the two separate, since using the wrong one is a rather bad faux pas...

    5. Re:Kanbun: Reordering Chinese to Japanese by StormShaman · · Score: 1

      Both the Korean and Japanese language groups are language isolates, and are not thought to be in the same language family.

  46. Re:Indeed by LordSamanon · · Score: 1

    I cringed immediately on seeing this.

  47. Absolute BS by Anonymous Coward · · Score: 1

    Let's be clear here: the character is U+306E, "hiragana letter no"
    http://www.fileformat.info/info/unicode/char/306e/index.htm

    The general category is "letter, other"
    Nothing to do with math (that would be "math symbol")
    If there is a bug, it is not in Unicode, but in some crappy software.

    Nothing to see, move along.

  48. Re: No, it is the character pronounced as "no" by epine · · Score: 1

    But I suspect that your browser set the character encoding as ISO-8859-1 in its headers.

    Drawing an inference from the not-fact that the top of the batting order in every Wikipedia FAQ does not include how to set your user agent to send the right encoding header, I'd suggest that Slashdot's long-disabled Unicode support fell far short of the mark in the first place. (2005 just called. It wants to dissolve its de facto clue-stick monopoly.)

    I authored a CJK word processor that ran under MS-DOS in the 1980s and early 1990s. Two of our linguists did our own in-house unification that ended up not so different than Unicode which came later.

    At the time that Unicode came out, our largest customer groups were embassies, diplomats (Snowden-style), and other academic linguists (with a strong representation from the Brigham Young young-adult diaspora). Maybe 40% of our new customers in the early 1990s were still running turbo XTs, 286s, and 386 castrati (16 MHz SX of the 16-bit bus resurrected). It takes a long time for the wallet of a dusty academic sinologist to recover from dolling out $5000 in 1985 (true story, many times over). 20-year-old Mormon missionaries where not especially flush, either.

    Imagine this as your early-adopter power-user-base for the newly ratified Unicode 1.0 Asian language support.

    Many people at the time running Windows 3.11 were running in 4 MB. Multilingual software remained stuck in this grotesquely underpowered rut until the P54 was introduced in the mid-nineties.

    It's not just the print and display fonts that were a burden to the software of the day, but the mere Unicode code point tables themselves. 256 KB of code-point mapping tables was the rough equivalent of Google grabbing another 256 MB to process-isolate another browser tab (4 MB then, 4 GB now).

    Of course, one can code up a bespoke compression method and clever language subset overlays. I'm sure we invested more man-hours in bespoke compression methods and clever data overlays than Zuckerberg invested in coding up The Facebook, original edition.

    It's probably a good thing that Unicode was rushed to fruition, however broken it now appears to be twenty-five years later, before the first release of NCSA Mosaic. Otherwise, Unicode might have been cobbled together Brendan Eich in a succession of 4 a.m. coding binges the week after he pounded out JavaScript.

    It's funny that this bug involves typesetting mathematics. If any software was broken with respect to Asian character support, it was surely the original TeX—paragon of infinite breakage that we all now know it to be.

    Back in the mid-to-late eighties, the very idea of sprinkling Asian fonts into math display mode would have been delegated to the savant sibling sequestered in Lamport's sound-proof attic.

  49. Solution by Tablizer · · Score: 1

    Just say "No" to Unicode.

  50. Timothy can't write in English. by Gibgezr · · Score: 1

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).

    That isn't even a sentence in English. It is extremely grating to read crap like this, and it does not convey much about the story. .

    1. Re:Timothy can't write in English. by gustygolf · · Score: 1

      Yes, I found it difficult as well, took a while to figure what it was referring to. And I *know* Japanese.

      'Japanese hiragana character "no"' and I'll understand it. Hopefully even add a unicode codepoint (U+306E HIRAGANA LETTER NO), maybe even a link to that character's data on e.g. fileformat.info.

      Of course, it's still a sentence fragment, and that's pretty jarring.

      --
      "Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
  51. Re:Is it the same as in Chinese? by jones_supa · · Score: 1

    No, it already works. It was active for a while some 10 years ago, but was removed because it was hard to sanitize. You could easily write you own comment score by reversing direction at the right time.

    Still they could reactivate it if they just found a reasonable way of sanitizing features they don't want.

    Dude, all other websites support Unicode. Sanitizing it properly cannot be rocket science.

  52. Re:Why not just use English, and only English? by KingMotley · · Score: 1

    I think Chinese is the only language we need, it's already the most spoken language in the world.

    That is false. English is the most spoken language in the world. Chinese is the most popular primary language.

  53. Re:Why not just use English, and only English? by KingMotley · · Score: 1

    Not even by head count. 1.5 billion people can speak English, contrasted to 1.0 billion can speak "Chinese".

  54. Re: No, it is the character pronounced as "no" by butlerm · · Score: 1

    ASCII is the American Standard Code for Information Interchange, a 7 bit encoding system. The most common strictly 8 bit encoding is ISO-8859-1, slightly expanded by Microsoft as Windows-1252, also known as Win-ASCII.

    Of course these days, everyone in their right mind should generally be using UTF-8 for transfer and storage. UCS-16 and UTF-16, though widely used internally, are basically a mistake for that kind of thing.

  55. Re:Is it the same as in Chinese? by KingMotley · · Score: 1

    Considering that Windows NT was around *before* UTF-8, it would have been rather difficult to implement it. What you really meant to say was, unfortunately, standards committees are often too slow to implement things like UTF-8 in a timely manner.

  56. Re:Why not just use English, and only English? by AmiMoJo · · Score: 1

    Chinese may be, but if Japanese is an example, and Japanese is adapted from Chinese by Han explorers to Japan in the Iron Age; its not very adaptable at all. The Japanese have developer THREE different writing systems to cope with with some shortcomings of the language (only two tenses, underdeveloped pronoun system, etc). That may be a shortcoming of Japanese, but Japanese is just a symptom of a language root that isn't very forgiving. I will say however that a language that can be nuanced such that 9 different meanings from changing the tone of one word may be more flexible than I give it credit for,

    That's not right. The exact origins of the Japanese language are lost to pre-history, only guessed at. It was the writing system that was brought over from China. Then katakana and hiragana were added to support the parts of the Japanese language that can't be written adequately in the Chinese system. They were simply added to support the way the language was already spoken, not to make up for any limitations.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  57. Re:Bug? by Darinbob · · Score: 1

    Because the Europeans override the Asians when creating the unicode "standard". They wanted to save code space, despite not being short on it (maybe some idiots think it could be done in 16 bits, but no one on the committee was that naive).

    In English, why is 1 and l not the same code point, despite having the same look in so many fonts, and even many typewriters did not have a separate 1 and 0 key (tell that to kids these days and they won't believe you). It sounds idiotic to us to give them the same ASCII code. Now imagine native speakers of Asian languages being told similar things about their writing systems.

    The problem with "no" is sort of a side issue in some sense to all this, but the problems with Han unification have been known for decades.

  58. Re:single global language by Xtifr · · Score: 1

    Why do you think language gets overhauled in Orwell's 1984?

    Because Orwell was a little too enamored of the so-called "Sapir-Whorf hypothesis"? I hate to break it to you, but, despite its many obvious parallels to the real world, 1984 was ultimately a work of fiction.

    While it's undeniable that language has some influence on culture and thought, the idea that it can be as influential as proposed by some early SF writers (e.g. Orwell, Jack Vance's The Languages of Pao, or Samuel Delaney's Babel-17) is mostly discredited.

  59. Re:Is it the same as in Chinese? by billyswong · · Score: 1

    Then what about just stripe unicode for comment subject, and leave the comment content intact?

  60. MOD PARENT UP, PLEASE! by billstewart · · Score: 1

    Even using vector fonts doesn't fix the problem that Unicode wasn't a great solution for managing the diversity of characters in many Asian languages.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  61. Bad English is the world's most common language by billstewart · · Score: 1

    I was once at a conference in Germany, most of which was given in English because it was an international crowd. One of the German speakers started off by saying that he used to start by apologizing for his bad English, but the host (who was Turkish) told him not to worry; Bad English is the most widely spoken language in the world. (Which is fine; English is flexible enough about most things that if you don't need to be subtle, Bad English will usually do.)

    German's the only non-English language that I'm even vaguely functional in, and even then it was much more useful for me in Czechoslovakia, where people had learned German in school to deal with tourists, and I mainly wanted to talk to them about the same sets of things, like train schedules and getting food and hotels and which bridge went to the castle. Northern Germans speak a relatively comprehensible dialect, though too fast for me to do much in real time; understanding Austrians is more like being a New Yorker in deep Alabama. (I play music at a local German jam session, and some of the tunes have the lyrics translated from Bavarian or Swiss into German...)

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  62. Typewriter character sets without 1 and 0 by billstewart · · Score: 1

    I'm pretty sure my mom's manual typewriter when I was a kid didn't have 1, less sure about whether it had 0. But it did have the proper French and Spanish accent marks (left, right, circumflex, N~, cedilla, most of which my PC keyboard doesn't have), and you composed them with letters by using the backspace.

    And yes, she could do two-column left-and-right-justified newsletters on it - she'd type a draft, count the letters, type the final. But she happily switched to using a Macintosh to type them, and let it handle that stuff.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  63. English? 7bit clean?? Bwahahah! by billstewart · · Score: 1

    Yes, I know you were trolling, but in your mythical 7-bit-clean English, even if you're not using English letters like ð or , or ligatures like æ , or distinguishing between short and long S's (you know, the s you used to think were f's), how do you put diaeresis marks over words like cooperate, or distinguish between m-dash and n-dash and hyphen, or get the left- and right-side quotation marks without using some Microsoft or Apple ``smart quote'' breakage, much less deal with accent marks in words of foreign origin that are now part of English because we stole them fair and square and they're ours now, or handle degree marks, or words with superscript letters like the abbreviations for the and that and George and Your, or ...

    And turning them all into leet-speak, like earlier Ye Olde Hwaetever's, just doesn't count.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  64. Re:Why not just use English, and only English? by Fire_Wraith · · Score: 1

    Keep in mind that India, which is nearly as populous as China, is a predominantly English speaking country. If sheer number of speakers is a key, the future will probably turn out to be something like Firefly, with a mishmash of Chinese and English.

  65. Re:Why not just use English, and only English? by JustOK · · Score: 1

    nuq ghe''or vIghel SoH?

    --
    rewriting history since 2109
  66. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

    I honestly don't really care for the argument. I just think it's a stupid argument to make because you can apply it to other languages.

    --
    Change is certain; progress is not obligatory.
  67. Re: Why not just use English, and only English? by Ash-Fox · · Score: 1

    Chinese uses a sensible alphabet? What are you smoking?

    Tell me more, educate me.

    --
    Change is certain; progress is not obligatory.
  68. Re:Notice This by Desty · · Score: 1

    Don't you just hate sentences that (while providing important background information)?

  69. Re: Why not just use English, and only English? by Ash-Fox · · Score: 1

    Done, don't see the issue with Chinese writing and it's consonants. You have failed to identify the issue.

    --
    Change is certain; progress is not obligatory.
  70. Re:Why not just use English, and only English? by Ash-Fox · · Score: 1

    According to http://www.infoplease.com/ipa/...

    Chinese are apparently first when it comes to native speakers. What data distinguishes whether someone can speak it as a second language and what level of language knowledge does the person have to know to be counted to speak that language?

    --
    Change is certain; progress is not obligatory.
  71. Re: Or speak English, it's 7bit clean by Zanadou · · Score: 1

    For the Chinese ideograph version and English translation, see slides 12 and 13 of...

    I know I'm a few days late to this thread, but how about we forgo the propriety ppt filetype plug-in hell, and go straight to Wikipedia ?

  72. Re: Why not just use English, and only English? by radarskiy · · Score: 1

    Note that problems also involve Korean, which is written with an alphabet not ideographs.

  73. Re: Or speak English, it's 7bit clean by ciaran2014 · · Score: 1

    Excellent. I wish I'd found that when writing my comment. I'd read a printed version so I was just happy to find any version of it online.

    --
    Help build the anti-software-patent wiki
  74. Re:Bug? by Perky_Goth · · Score: 1

    I haven't read it, but the full explanation is on https://en.wikipedia.org/wiki/...