Slashdot Mirror


Why Unicode Won't Work on the Internet

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

16 of 416 comments (clear)

  1. Unicode Character Set vs Character Encoding by Jordy · · Score: 5
    The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
    The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

    ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

    UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

    Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

    UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

    A good document on this is available at UTF-8 And Unicode FAQ
    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
  2. Re:UTF-8 should be fine for almost any application by spitzak · · Score: 5
    Thanks for some more intelligent discussion about UTF-8.

    I might add a few things:

    In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!

    In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!

    The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.

    Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.

  3. Some errors by BJH · · Score: 5

    Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

    In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

    Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

    That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

    Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

    Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

    After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

    Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

    The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

    Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

    That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

    - No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
    Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

    - A draconian unification of CJK characters.
    The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

    - The ugly "extensions".
    Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

    I could go on, but I should get some sleep...

  4. Re:You bring up a good point by gleam · · Score: 5

    The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source

    A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here.

    Regards,
    gleam

    --
    this .sig is not a .sig.
  5. A Plan for the Improvement of English Spelling by bgarcia · · Score: 4

    Go read the original story here, by Mark Twain.

    --
    I'm a leaf on the wind. Watch how I soar.
  6. too sinocentric, but Unicode has problems by rjh3 · · Score: 4

    Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.

    The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    The history of encodings is roughly:
    1. There was chaos.
    2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
    3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
    4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
    5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
    Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.

    Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.

    China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)

    The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.

    Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.

    This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.

    But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.

    It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.

    These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.

  7. Wrong, wrong! by Mendax+Veritas · · Score: 4
    UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).

    It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

  8. unicode does *not* encode 65,536 characters by egomaniac · · Score: 4

    It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

    Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

    So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.

    --
    ZFS: because love is never having to say fsck
  9. Overstating and misunderstanding the problem by kurisuto · · Score: 4
    This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.

    This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

    For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

  10. Solution - Everybody use Euro-English! by saider · · Score: 4

    The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

    In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

    There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

    In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

    By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

    After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!


    --


    Remember, You are unique...just like everyone else.
  11. ISO-2022-JP and "alphabetical order" by achurch · · Score: 4

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.

    Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?

    Have you ever tried to program in shift-JIS? It is horrific.

    I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.

    On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.

    --
    BACKNEXTFINISHCANCEL

  12. After some skimming... by update() · · Score: 4
    I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
    Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
    and I decided to skim the rest.

    To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.

    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Unsettling MOTD at my ISP.

  13. So... by ackthpt · · Score: 4
    4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

    --
    All your .sig are belong to us!

    --

    A feeling of having made the same mistake before: Deja Foobar
  14. Unicode's reply by roozbeh · · Score: 4

    It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:

    Dear Mr. Carroll,

    I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

    Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

    Here are some specific comments on items in the article which are either misleading or outright false.

    Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

    In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

    Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

    *Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

    Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

    http://www.unicode.org/unicode/uni2book/appA.pdf

    The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

    Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

    Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.

    And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

    Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

    Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

    The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

    The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.

    And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

    Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

    And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

    In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

    Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.

    --Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
    Technical Director, Unicode, Inc.
    Co-Editor, The Unicode Standard, Version 3.0

    --

  15. Hmm.. I must have been using something else then? by vidarh · · Score: 4
    I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.

    UCS-4 is also quite common, and allows for the new extensions.

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.

    Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

  16. Re:You bring up a good point by SpeelingChekka · · Score: 4

    Does anyone know a a real language that has a simpler writing system than english?

    Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.

    I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.

    Is it just a coincidence that the simplest writing system was the first to be digitized

    Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?