Slashdot Mirror


Why Unicode Won't Work on the Internet

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

172 of 416 comments (clear)

  1. Languages by Alex+Belits · · Score: 2

    "Unicoders" ignore the fact that any multilingual text is inherently stateful, so their idea of stateless stream of giant "characters" that will be easy to process is flawed at the core. In fact it's useful for decorative purposes only -- while it's easy to _display_ a unicode text (given in any of countless Unicode encodings), it's impossible to process or edit it without at least some state (current language) information to determine, what input method, dictionary, grammar rule, etc. to apply to any substring, so the goal of stateless text is just as misguided as the initial stateless filesystem representation in NFS. But if statelessness is kicked out of the window (like it should for anything multilingual), then information about both language and charset can be easily added to any substring, so all national charsets, ones that were specifically designed to be used in some particular language, and to which all processing rules and dictionaries were already written, can be used -- programs that don't care about charsets and languages will just handle them transparently as sequence of bytes, and programs that care should use state information anyway.

    How to include state information is a good question -- there are a lot of posibilities, and one of them is modification of HTML and XML specs to add charset attribute to everything that can have LANG. The problem is, for purely political reasons those specs specify only global charset for the whole document, and include LANG but don't include charset as an attribute for everything to make it impossible to use any non-Unicode charset for multilingual documents in them. This does not serve any legitimate purpose, and is an example of blatant sabotage of the specs to serve the interests of small but very influential and vocal group of companies that are interested in making multilingual processing as complicated as possible, so every simple task requires huge bloated application just to comply with the sabotaged specs, instead of simple byte-value transparency that otherwise would be sufficient. Raising the barrier for entry, decommodification and contamination of the standards at its finest.

    The reason why things like that are possible is, that in fact the demand for multilingual text processing (multilingual as one document that contains text in more than one language other than English because English is usually supported within non-Unicode national charsets and works just fine with them) is currently very low, and was even less when those "standards" were adopted, so obvious flaws did not cause immediate havoc. This is a commonly used strategy -- when no one needs something, write a standard for it that favors you, create a lot of noise around it, declare that it "dominates the industry" because no one else is doing it, and then wait until the need becomes more or less apparent. Then when it happens, everyone will somehow remember a piece of your noise, and you can loudly proclaim that all that time you was busy including new great standard into the innards of your software and lobbied all standards groups to include some reference into standards (that everyone, of course, ignored all that time because of the lack of the need for application). So, at the time when need is "more or less apparent" and the requirements to applications and standards quality is low, you can expand the "use" of your standard by people who don't need it or care about it, just because it was included into some of your products -- if features support in them is ridiculously poor, no one would notice because there isn't that much use anyway. The development of other, superior, standards will be stifled because you will always be able to claim that everyone is happy with your standard because there aren't many people complaining -- of course, there won't be many complaining because almost no one actually uses it for what it was supposed to be used it in the first place yet. At the time when real need arises so many products and standards will be contaminated with your standard that people will have to use it despite the obvious flaws. If standard stinks, you still can claim that no one made anything better anyway, so everyone should just use your POS, and if it breaks others' software design, they should just adopt yours.

    If this sounds too close to some particular company's favorite strategy, it probably is -- Microsoft with its nauseating file/documents formats design, mediocre and bloated, display/printing-only oriented text editing software is one of the most enthusiastic backers of the Unicode, and they do it despite the fact that their software itself often gets into trouble because Unicode is both hard to use and hard to implement. It doesn't matter, important thing is, if we had trouble with it doing it half-assed, everyone that will try to do it better will have much more trouble. Scorched earth strategy.

    --
    Contrary to the popular belief, there indeed is no God.
    1. Re:Languages by Alex+Belits · · Score: 2

      No. It's "Microsoft likes Unicode because it sucks, and because it's sticky enough to cause trouble for others".

      --
      Contrary to the popular belief, there indeed is no God.
    2. Re:Languages by scrytch · · Score: 2

      So basically your argument boils down to "Microsoft likes Unicode, therefore it sucks"? You come up with some fuzzy vague idea of encoding "language attributes" like grammar and dictionaries into character sets ... somehow, meanwhile conflating character sets with documents... I'm surprised you haven't asked for binary to be revamped. Try losing the scare quotes too, your sneering disdainful superiority for the subject and everyone associated with it was already fairly apparent.

      As Rand would say, A is A. Whatever sort of semantic meaning the letter might have in the context it's used in is not Unicode's problem. I hear we have things like document formats that handle that.
      --

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
  2. Re:Unicode's reply by Alex+Belits · · Score: 2

    It does not matter, what Unicode in theory can have in -- the allocation of characters is handled by a single, and not in any way open, organization, so the standard is all that is allocated and not that in theory can be if Unicode consortium would be benevolent enough, that we all know that it is not. Even if it would be, there is always some need to represent, in some consistent and unambiguous manner, text in languages that can't be possibly accepted into Unicode, such as fictional languages -- they can be easily handled by any expandable charsets-handling system and it won't be a rocket science to develop one, however Unicode supporters do everything that is possible for humans and sometimes more, to prevent any competing system from being developed. Also it does not matter what stated goals of Unicode are -- in fact it is being hawked to be used as the required internal representation of all text in all applications, and as the origin for encodings used for data manipulation, storage and transmission. These are facts, and so are the real problems that Unicode generates if used in that way. I have no problem with Unicode standard being a big dusty book used as a simplified manual for world''s alphabets, or as an intermediate format for fonts handling and texts conversion between different charsets of the same language. The problem is, Unicode is being used for things it is inadequate for, and its existence is loudly proclaimed as the reason to make no progress in development of any solution for multilingual texts handling that is not entirely based on Unicode-derived representation over the wire and in storage. This is selfish and counterproductive.

    --
    Contrary to the popular belief, there indeed is no God.
  3. Re:Unicode's reply by Alex+Belits · · Score: 2

    1. The standard is expandable if, and only if, it does not require a change of itself to adopt an expansion. For example, the addition of a new MIME type does not change the MIME standard, however the addition of a new tag does change HTML standard, therefore HTML is not expandable, what is pretty easy to notice while comparing different HTML renderers. XML is a near-absurd case because it's basically an umbrella that allows to declare all kinds of tags and therefore is supposed to be flexible and generate expandable standards, however the catch is, it does not provide any facility do automatically determine how to handle those tags' semantics in applications, so mere possibility to declare something new does not make it expandable either if applications' algorithms have to be modified. In Unicode however the situation is much more simple -- any addition of the characters IS a modification of the standard, and there is no possibility to automatically provide interoperability between older and newer versions.

    The existence of the procedure TO change the standard does not make it expandable.

    2. If Unicode will adopt all fictional languages/scripts/... it will become absolutely impossible to make complete fonts for it -- now it's merely a huge task, but then it will be plain impossible. The only real solution is to have standard that allows to name a language/charset combination, and leave the text in them intact until either user will install support for them, or application will automatically download it. Unicode doesn't help with it a single bit -- application encountered a character in unsupported range, and all it has is 16 or now 32 bits that it can only stuff in its virtual ass and report an error because no reasonable resolution can be made without some external assumption.

    3. ISO 2022 is a very poor implementation of stateful multi-charset character stream, and Unicoders are very fond of mentioning it as a proof that all possible stateful systems are bad. However repeating something that is false does not make it any less false -- in fact, after Unicode was adopted by IETF (on meetings behind the closed doors) all work on stateful character streams standardization was stopped.

    4. Computers can magically process all kinds of charsets. It's called byte-value transparency. Most of applications would work just fine if they just copied strings without making any assumptions about their structure or number of characters in them as long as bytes are bytes, and end of string is always 8-bit 0, what would be quite trivial for any stateful text system to implement. Tiny minority of programs need anything from a text that requires actual parsing other than finding newlines and, rarely, whitespaces. Display routines are different thing, however there aren't many of them, and all systems other than Windows support Unicode by combining and translating multiple fonts for multiple ranges, so supporting multiple fonts subsets for multiple marked charsets would be only easier to implement.

    The problem is, there are too many Windows programmers writing internet drafts now, so semantics of text display routines got stirred up from system-specific and application-specific processing where they belong, and contaminated standards responsible for data transfer, where they don't belong, and a lot of people now believe that to transfer some data one has to know how to display it in some pretty letters. Shame on you.

    --
    Contrary to the popular belief, there indeed is no God.
  4. Re:Unicode's reply by Alex+Belits · · Score: 2

    1. ISO is a closed standards body -- if it does anything, it makes standard less open.

    2. Private use codes aren't standard -- they don't provide any guarantees of interoperability, and merely provide a way to break the standard while fooling a program that is compliant with it into behaving how the user wants. If there was a way to put somewhere even a name of a charset to map "private" codes to a font name, it would solve a piece of the problem, but alas -- Unicode is made under the slogan of total statelessness of text, so while applications' file formats may allow this, arbitrary substring in a text can't.

    --
    Contrary to the popular belief, there indeed is no God.
  5. Re:Conspiracy Theories and Unicode by Alex+Belits · · Score: 2

    The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.

    This is, to say the least, incorrect. While there is a lot of effort to shoehorn Unicode into Unix and Unix software, the actual results are beyond miserable, precisely because Unicode does not work. Unix vendors solved this problems by adding a small support for to/from unicode conversion and by declaring that their filesystems support UTF-8, thus getting blessed by Unicode consortium as compatible. Guess what, UTF-8 can be "supported" in that way even by abacus, if that abacus is long enough and has at least 8 stones in a row, however actual use of it is a completely different thing -- I have never in my life seen a filename in UTF-8 outside of Unicoders' demos, and I am Russian myself and have a lot of friends that speak Japanese. So, again, Unix vendors' support of Unicode is in fact a lip service, not unlike Microsoft's support of POSIX or claims that Internet would support OSI 7-layers model (what ended with "temporary solutions" known as TCP/IP and Berkeley sockets replacing it).

    --
    Contrary to the popular belief, there indeed is no God.
  6. Re:Statelessness of text by Alex+Belits · · Score: 2

    Statelessness of text is something that Unicode tried to achieve, and still is using as their main argument toward its acceptance. Latin-1 is quite irrelevant here because its goals weren't as pretentious as Unicode, and impact on existing applications was near zero, and was basically "where are we going to use those values anyway?" Unicode actually is supposed to be used for serious multiple languages support, and requires fundamental changes in both applications and protocols -- with protocols causing a lot of infiltration of Unicode-based requirements into otherwise tansparent protocols. This would be at some extent justified if Unicode actually was a base for serious multilingual processing (what Latin-1 never claimed to) but otherwise it isn't worth the effort and problems that Unicode brings in. So, main advantage of Unicode over basically everything else imaginable (though not implemented because of pressure on IETF from Unicode), is statelessness of text stream.

    --
    Contrary to the popular belief, there indeed is no God.
  7. Re:Unicode's reply by Alex+Belits · · Score: 2

    You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal

    I can do that if anyone will listen. The problem is, the actual problem that it will solve does not exist yet, its time didn't come. Multilingual documents, for all purposes, don't exist beyind demos. Unicoders are using this to create their own standard that definitely won't hold water if demand already existed, but they can with their propaganda flood everything involved with standars -- certain person, Martin Duerst, subscribes to EVERY mailing list that may in any way touch multilingual text handling and every time someone mentions Unicode, floods it with tons of messages in support, and fiercely fights against every argument against. I have no idea what else that person does beyond that, if any, and how many hours is in his day, but it's extremely hard to support any serious argument when one side is so active, and most of people are disinterested.

    I have planned to do this when actually people will need multiple languages in their documents, and if someone can convince me that I have slept too long, and this time is now, I will happily start work, but otherwise it will be not just fighting with windmills, but fighting with windmills when there is no wind.

    --
    Contrary to the popular belief, there indeed is no God.
  8. Re:Conspiracy Theories and Unicode by Alex+Belits · · Score: 2

    UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.

    I merely tried to explain that UTF-8 is specifically designed to be used with any imaginable system -- what says nothing about its usefulness.

    I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.

    This is simply false. UTF-8 filenames and data can be used in any Unix if one wants to sacrifice functionality that people expect from a fixed-length characters representation (ex: regexps matching, cutting text at arbitrary offsets). However it's not a problem of Unix that users expect their encodings to be easier to use than a mess that UTF-8 is -- on other systems there isn't any counterpart to this functionality in utilities that are in common use.

    On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.

    Show me. I have seen a shitload of data, marked as UTF-8, yet used exclusively as ASCII, or even with different encodings actually in the data, but never -- actual multilingual database in UTF-8. Again, it demonstrates my point that Unicoders are trying to sneak their "standard" in while there is no demand and therefore no scrutiny for the quality of things being introduced.

    > and I am Russian myself and have a lot of
    > friends that speak Japanese.

    Umm. And the relevance of that comment is what?

    It means that I am in my own experience familiar with handling of multiple encodings, with what people use in the real-life texts handling, and their willingness to use Unicode, that happens to be below zero. You can claim that their reasons are irrational, and Unicode is still the best solution for them, however I still don't see, why opinion of almost everyone who actually knows about the subject from practice, and is supposed to benefit from what Unicoders are proposing, can be dismissed so lightly.

    --
    Contrary to the popular belief, there indeed is no God.
  9. Re:Unicode's reply by Alex+Belits · · Score: 2

    Because, gee, the need to communicate with someone in another language is new.

    When people communicate, they choose one language for it -- usually one that both know best. No one speaks like "Ya odnowremenno trying goworit' po-english i russkomu, and esli ya by znal nihongo ya would simultaneously speak po-yaponski, too".

    It's very important to see the distinction between the need to support "multilingual document" that contains multiple languages within one body of text and to support documents in multiple languages within one system or program. Also historically it happened that documents in all languages can painlessly include ASCII text, so non-English language + English is usually treated the same way as a text in non-English language, not requiring any special tools to be handled. One may claim that this is wrong, but this is how things happened to be developed over decades.

    I've never seen VCR instructions in multiple languages

    Those are multiple documents, not one document with multiple languages in it. There is clear separation between versions in different languages, and this is already being accomplished easily, even in MIME email.

    , I've never seen a bilingual dictionary

    Dictionaries are special cases, and they usually are distributed in either printed form, or as a database -- they almost never are seen as plain text documents. In both for-print-only formats and in databases there are plenty of ways to represent languages and charsets as metadata, and absolutely all computer dictionaries that I have seen chosen to use native encodings.

    , and the EU driver licenses only have one language on them, not every language of the EU.

    Again, I assume that the whole text of the license is repeated in multiple languages, not individual words are repeated in each language within one body of text, so the same definition of multiple documents applies.

    --
    Contrary to the popular belief, there indeed is no God.
  10. Re:Unicode's reply by Alex+Belits · · Score: 2

    So Reta Vortaro , an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.)

    First, without any doubt it is a demo -- the set of languages to which trnaslations are available varies from word to word, and in real life one would never want to have translation into multiple languages to always appear, clogging the screen. Second, this is an application (even though a simple one), not a document, and there are plenty of ways for applications to handle multiple charsets even now. My point is, functionality that supports multiple languages within application is completely ortogonal to the support of multiple languages within a single document or string. Unicoders love to mix those two.

    Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too.

    Again -- I don't see why this particular application used UTF-8, however neither its design requires it, nor those files are for any purposes normal text documents -- even uncompressed, they have strict formatting and are even indexed, so they could use just any charsets/encodings possible.

    And the Debian main page , where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.

    Absolutely. This list of languages is obviously a gimmick that provides nothing that list of languages in English wouldn't provide -- everyone in the world, for whatever reason, knows how his language's name looks in English even if he can't read English. In the case of Debian page, if I was looking for Russian translation, I certainly would search for "Russian" string to find the link (it's interesting that the word "Russian" is the only one, where both "native" and English name of the language are mentioned in the Debian page -- I assume, because a lot of Russians actually use Russian translation but don't have UTF-8 enabled or supported in their browsers). Also, Debian home page automatically chooses the language if it's announced by the browser, so if I really wanted Russian version and set language preferences in the browser, I wouldn't even have to touch anything else. And lo and behold -- when I choose Russian, the page appears in koi8-r, what happens to be Russian local charset, not any form of Unicode.

    --
    Contrary to the popular belief, there indeed is no God.
  11. Re:Unicode's reply by Alex+Belits · · Score: 2

    You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode.

    The simplicity of Unicode is only in its authors' imagination. Yes, it's easy to present Unicode to people who don't know the details as a simple solution -- the problem is, reality isn't as simple as it looks.

    I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode.

    When the goal is just to make a text that can be printed in pretty letters, anything is ok as long as it's implemented. This is why a lot of low-quality products such as MS Office are so popular -- in fact so popular that I often receive email with nothing but plain ASCII text as a MS Word file. However even in this case a complex typesetting system (that would most likely just use multiple fonts in whatever charsets they happen to be avilable because it cares more about fonts) would be more appropriate.

    When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.

    How deceptive. The implied assumption is that "obtaining" charset support is some kind of nonzero effort while using Unicode is smooth regardless of the language. Both things are incorrect -- in a system with multi-charset support the charsets support can be loaded automatically depending on the languages and charsets mentioned -- if someone wants to have support for everything Unicode supports at the extent Unicode supports it, he will only need fonts, and the amount of the information and resources used would be exactly the same as if he had their support in Unicode. However in practice usually the goal is different -- only few languages and charsets are in active use by the same user at the time, however he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

    Again, Unicode user still ends up having to somehow get something language-specific, except that his language-specific data and procedures also have to be designed to use Unicode, what differs from the procedures that are already in use, and often open source. Software vendors would love that -- they can either keep making localized versions of all software with Unicode support but with different language-specific procedures, or try to make tools that can handle all languages and spend man-millennia rewriting trivial things and then release them as the only way to use Unicode in practice. In either case they get their money because old software, Unicode-supporting or not, will not match the requirements for multilingual documents processing, and their new solution will be complex and therefore hard to reproduce.

    My idea is that infrastructure for stateful text processing is as unavoidable as the existence of different languages and writing systems, so it would be foolish to try to decieve people into thinking that displaying pretty letters is the main problem of handling multiple languages or multilingual documents. I don't see how denying undeniable is justified. Most of people are ignorant about the details because at this moment the problem isn't evident, and problem isn't evident because the whole field of its application is not in any way related to their everyday life, however I don't think that every kind of ignorance deserves to be abused with such a long-lasting possible consequences.

    Extending the idea that in multilingual text attributes that should be applied to substrings ("state" when text is treated as a stream) are necessary, I can say that since statefulness is unavoidable anyway, charset/encoding is just as good attribute as the language or, say, language-dependent parameter such as direction (for example, in Japanese left-to-right and up-to-down directions are both acceptable, even though modern texts use left-to-right). The implementation of "full unicode" text processing, even in a primitive display-only manner, is not any simplier, and certainly isn't any lighter on resources than a multiple charset support -- in fact multiple charsets support can be easily built on the top of any existing text displaying or printing procedure that supports multiple fonts and multibyte characters. The only "big question" is how to represent attributes in a text stream, but this is merely a question of formally declaring some decision to be standard -- one can design many of them easily, and almost everything that a sane human mind can create at this moment in history would be infinitely superior to iso 2022.

    --
    Contrary to the popular belief, there indeed is no God.
  12. Re:Unicode's reply by Alex+Belits · · Score: 2

    Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

    Then what was the point of your argument? If implemented in the display-only library and used for displaying/printing only, Unicode is just as "simple" as would be any other system, with or without multiple charsets. If program does anything complex, it should handle various language-dependent stuff anyway, however bare Unicode support provides no such infrastructure, and a reasonable infrastructure can be implemented either with or without Unicode. Then what is the advantage of Unicode? Being self-proclaimed status quo in standards' backroom-politics, that no one supports properly anyway, that is hard to segment into subsets, non-expandable, maintained by a closed standards body and requires more resources?

    I don't claim that Unicode theoretically can't be used as the base for languages support -- in theory it can, but the problem is, it provides no advantage compared to multi-charset system if used as a part of multilingual text support infrastructure. I have already explained why such infrastructure does not exist now, however I believe that when it will become necessary, someone will have to implement it anyway. So now, when no one needs it, Unicoders are busy to claim this "piece of noosphere", just like some people tried to sell land on Mars -- just because it's there, and before it will become obvious that it's not theirs.

    The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible.

    By this logic it should use Microsoft Word or at least PDF -- both very widely supported, more wide than even plain text files in UTF-8 (yes, I know, Word can use unicode internally -- this isn't the point).

    Unicode HTML and UTF-8 plain text are those formats.

    Are they? Most of my boxes don't have them installed -- the one I am writing this message on is an exception, but only because it has Mozilla, what is still a bloatware. My handhelds most likely never will have them installed -- they don't have enough ram, and need rather nontrivial manipulations with characters size and formatting to keep texts in some languages readable, so plain stream of unicode text would be impossible to display without some heavy heuristics.

    Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience.

    I wouldn't dream to propose a non-open standard for this. However the trouble with open standards is that they never appear before they become necessary, and I, following the principle that standards and tools should be developed as the need arises, am not making any detailed proposals at this time. But when there will be a need, the standard that will be created must be open, expandable and easy to port and reimplement -- something that anything Unicode-based is not. If you mean that charset is "proprietary", I am not aware of any charset except, maybe, "klingon in private unicode" that was in any way declared to be someone's property. If multi-charset support infrastructure will be created, it would be reasonable to include some common facility into the libraries that will make it possible for users to allow programs, when they see an unknown language ar charset, to automatically download fonts, tables and even formatting/comparison/input methods/... source code automatically from some servers that keep directories of known charsets and languages, and this would be an open, expandable and flexible infrastructure, available to everyone. If someone wants his language that never had local charset in the first place to be represented by its range in Unicode, he should be able to do that, however in a system like that there should be no reason to prevent established language/charsets combinations from being used just because of someone's narrow view of the problem.

    Maybe I am wrong is this traditionalism, and it will be better if I made an infrastructure for stateful text support just to demonstrate this point -- after all, even with all dynamic fonts/code/input methods/... it won't be in any way more complex than any other solution, merely useless because right now still almost no one uses multiple languages in a single document. But maybe the need to demonstrate the solution for a problem that no one experiences yet is now a good reason enough when someone else is trying to sneak in an impractical solution as the standard while no one is looking.

    I see current advance of Unicode as something that may serve some simple need now, but can severely limit further progress if accepted as widely as Unicoders are trying to get accepted. That would not be "good enough" as TCP is "good enough", large SMP kernel lock was "good enough" or C pointers are "good enough" -- it's "good enough" as Windows, region codes, crippleware, etc. are "good enough" -- people accept them because those things are pushed, and the inconvenience they create isn't bad enough until it's too late, but when it's too late, people still use them because there is nothing else in sight.

    --
    Contrary to the popular belief, there indeed is no God.
  13. Re:Unicode Character Set vs Character Encoding by Jordy · · Score: 2

    Actually, the Unicode specification for UTF-8 places an artificial limit of 4 8 bit code units for variable length encoding as that is all that Unicode currently requires.

    ISO 10646 defines UTF-8 as having up to 6 8 bit code units.

    At 4 bytes, UTF-8 can only map to 0x10FFFF. At 6, it can map to 0x7FFFFFFF.

    Of course, my math could be wrong.

    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
  14. Unicode Character Set vs Character Encoding by Jordy · · Score: 5
    The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
    The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

    ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

    UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

    Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

    UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

    A good document on this is available at UTF-8 And Unicode FAQ
    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
    1. Re:Unicode Character Set vs Character Encoding by crath · · Score: 2

      Jordy, you're completely missing the point. You have clouded the writer's religious bias against western civilization by bringing facts into the discussion. If the writer had wanted to be deal with facts he would never have written his article in the first place: he only wants us westerners to acknowledge that johnny-come-latelys to the Internet game should have an equal place at the table. In other words, stop the Internet and computer technology from moving forward until everyone's perspectives have been completely accomodated; everyone, that is, except for those who started the revolution!

    2. Re:Unicode Character Set vs Character Encoding by ClarkEvans · · Score: 2

      Nice summary. Although UTF-32 only implements a subset of UCS-4 due to compatibility issues. You can find more information on unicode at unicode.org, in particular their faq is very helpful, especially the sub-faq on UTF-16 and the BOM.

    3. Re:Unicode Character Set vs Character Encoding by ClarkEvans · · Score: 2

      Morgo is correct. Unicode is only capable of representing a sub-set of ISO 10646-1:2000. This is detailed in the UTF-32 definition among other places which says: UTF-32 is restricted in values to the range 0x000000 to 0x10FFFF, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16.

    4. Re:Unicode Character Set vs Character Encoding by TekPolitik · · Score: 2
      UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation.

      Not quite, although you could be forgiven for believing this. UCS-2 is just a truncated UCS-4, which represents exactly 65536 characters (less a little over 2048 now) and was orginally the same as Unicode. UTF-16 is an encoding which extends the range of possible characters to around 1million, and Unicode has been redefined to be the same as UTF-16. Current versions of Windows use UTF-16, not UCS-2

      A good reference for this is theUTF-8 and Unicode FAQ for Unix/Linux

      The article's claim that Unicode can only map 65536 characters is fundamentally flawed, since its new definition as being the same as UTF-16 means it can probably map every character ever used, and in fact includes fictional scripts (including Tolkein, and more importantly, Klingon, although I'm not sure of the standardisation status of the latter at this time).

    5. Re:Unicode Character Set vs Character Encoding by blair1q · · Score: 2

      All encodings can handle the entire character set. They'd be pointless if they couldn't!

      Do they have all my Zapf DingBats?

      --Blair
      "And what do we do when the Venutians touch down?"

  15. Re:Duh. by jandrese · · Score: 2

    Just because you can't read other langauges doesn't mean multi-language support is useless. Oh, and inputting Kanji on a keyboard is quite feasable, try using the Windows IME sometime (It's built into 2000).

    Down that path lies madness. On the other hand, the road to hell is paved with melting snowballs.

    --

    I read the internet for the articles.
  16. Unicode includes all common Asian character sets by Per+Abrahamsen · · Score: 3

    I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.

    Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.

  17. Babel by jafac · · Score: 2

    He sure did do a good job when he slapped that old Tower of Babel bitch down.

    --

    These are my friends, See how they glisten. See this one shine, how he smiles in the light.
  18. Use UTF-8, don't worry about sizes by iabervon · · Score: 3

    UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).

    The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.

  19. Re:2 + 1 bytes? by Jeremy+Erwin · · Score: 2
    Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).

    Typically, one shouldn't apply font styles on a character by character b as iS.

  20. Chinese language(s) by danny · · Score: 2
    I highly recommend S Robert Ramsey's The Languages of China to anyone interested in language in China.

    Danny.

    --
    I have written over 900 book reviews
  21. Re:Overstating and misunderstanding the problem by tjansen · · Score: 2

    >>the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V No, the handwritten one in Germany looks more like the 1 in an Arial font. bye...

  22. Case by Pseudonymus+Bosch · · Score: 2

    26 letters

    You mean 26 uppercase and 26 lowercase.
    __

    --
    __
    Men with no respect for life must never be allowed to control the ultimate instruments of death.
    GW Bu
  23. Re:UCS-4 by spitzak · · Score: 2
    UTF-8 can encode 31 bit characters with an obvious extension of the standard (or perhaps this is part of the standard, I'm not sure).

    After that point it breaks down (if you continue the first byte is filled and the prefixes will have to go into the second byte). Alternatively you can quit at that point, use the remaining bit as the 32'nd bit, and say that UTF-8 cannot encode more than 32 bits.

    Anyway, one big advantage of variable-sized encoding is that there is potentially no limit to the size of the data transferred.

    My personal opinion is that UTF-8 should be used *everywhere*, including all internal interfaces to libraries and services like X. The sooner we stamp out these "wide characters" and all the complexity they cause by doubling or quadrupiling the number of interfaces we need, the better.

    UTF-8 probably does not address the concerns of the article, which is about the fact that Unicode does not contain all possible scribbles drawn by humans. But the fact that English speakers were able to compensate for decades and even adapted to the rather arbitrary and limited 62-character ascii set would indicate that people will easily compensate for this as well.

  24. Re:ASCII stupidity all over again... by spitzak · · Score: 2
    When ASCII was invented it was based on existing typewriters, including the ones sold in Europe. At that time output was on paper and it was rather easy to overstrike characters to produce accented characters. How else do you explain the existence of the '~', '^', and backquote characters (in addition the underscore code was originally a macron). They actually designed it so the countries of NATO could type as well as they could on a typewriter. They also deleted several characters Americans wanted (fractions, fl and fi ligatures, open and close quotes, cent sign were all very common on typewriters of that period).

    Yes, only a small set of countries was considered, and only minimal support. But this claim on "no support for anything not USA" is false.

  25. Re:ASCII stupidity all over again... by spitzak · · Score: 2
    I didn't claim they tried to support all European characters. What I meant is that they did not ignore them totally. They (rather stupidly) thought that a few accent marks would do the job.

    The cent sign was replaced with the caret. That is why shift+6 prints a caret, if you look at old typewriters that was how you printed a cent sign. This is in fact the main reason I think they considered European support, since from an American point of view the cent sign is more important. The fractions were what were replaced by the square braces. The curly braces, vertical bar, and apparently the tilde were added later (originally they printed as square braces, slash, and caret, and devices that totally ignored the lower-case bit were allowed, and the original tilde was changed to underscore because that character was missing originally).

    "Extended ASCII" usually refers to the replacement of several of the punctuation marks with European characters. This was pretty useless because by then most OS's had assigned meaning to those punctuation marks (like the square brackets), also only 5 or 6 new characters were available. This died almost immediately when people started supporting the 8th bit as data rather than parity.

  26. Re:Arabic space by spitzak · · Score: 2
    Is there a good reason for this or is this due to some stupidity in MSWord? I assumme it has something to do with bidirectional scripts, but if normal space is not used for anything in Arabic then I would accuse MicroSoft of being stupid.

    If this is the normal non-breaking space character (0xA0 in Unicode) then it takes 2 bytes in Unicode.

  27. Re:Arabic space by spitzak · · Score: 2
    Yes, I am guessing that MS picked a character to mean "backwards wrapping space" or something, and it sounds like all Arabic must have the words seperated by this character rather than space. It apparently is not the "non breaking space" or some Arabic equivalent, if I understand the orginal poster correct.

    The question was "did they do this for a good reason? Ie: doing this allows formatting control that could not be achieved otherwise. Or were they just stupid/lazy, and if normal spaces were used with a slightly smarter program would it be just as good?

    I personally don't know anything about Arabic so I cannot answer these questions. My guess is that this is reasonable if there is a place that "normal spaces" are used in Arabic.

  28. Re:UTF-8 should be fine for almost any application by spitzak · · Score: 5
    Thanks for some more intelligent discussion about UTF-8.

    I might add a few things:

    In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!

    In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!

    The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.

    Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.

  29. Re:All Character sets simultaneously?? by K-Man · · Score: 2
    I got all sorts of spurious matches from the Latin words, which wouldn't happen if the Greek and Roman letters weren't sharing a single character space.


    However, in Unicode, Chinese, Korean, and Japanese all share the same codepoints and glyphs, so you can't grep for one language or another.

    For instance, if you were searching in Korean for "Kim Il Sung", this string in Unicode would be the same as the Chinese characters for "gold" (jin), "one" (yi), and "star" (sheng), so your search would get hits from other sino-based languages in addition to Korean.

    It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is different for the different Asian languages, even though they use the same characters.
    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  30. Correct, also see link by K-Man · · Score: 2

    Yes, the author was overbroad with that statement. All languages work on a restricted set of phonemes; there are some 200+ identified, but no one language uses near that number. Hangul covers all the Korean phonemes, but not much else.

    Here's a good description of Hangul. If you check this page, you'll notice I was wrong about the vowels; they don't seem to describe their own pronunciations at all, but rather the yin and yang elements of their sounds :-P.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  31. Re:You bring up a good point by K-Man · · Score: 3

    If you read the article, you'll find a decent description of Korean Hangul, which has around the same number of characters as English (IIRC, it has 24).

    Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.

    Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

    Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  32. Re:Unicode's reply by dvdeug · · Score: 2

    > the allocation of characters is handled by a single [...] organization,

    Slightly incorrect. It's handled by the Unicode Consortium AND the ISO 10646 standards group.

    > not in any way open, organization

    It's as open as, say, the ISO C++ standards group. That is, unless you're connected to the right corporation or country, you won't get a seat, but they still accept outside submissions and respect experts outside the group.

    > Unicode consortium would be benevolent enough, that we all know that it is not.

    Benevolent how? Benevolent enough for what? It took them less than a year to get LATIN CAPITAL N WITH LONG RIGHT LEG encoded, for a minor language with no political power (Lakota). They're constantly encoding new letters and scripts for groups with no political or economic clout (Z with hook below for Old High German, various Phillipine scripts in 3.2).

    And no one's stopping you from hacking up your own multi-charset system, and using it whereever you want. But loudly claiming that you're being oppressed doens't prove that you are, and doesn't prove that your system would actually be superior to Unicode.

  33. Re:No, _n_ bytes per character! by dvdeug · · Score: 2

    Part of the point of UTF-8 is that non-ASCII characters don't get encoded with ASCII characters. In your system, you can get an '/' or a '\0' or '\e' byte that doesn't represent that character, meaning that all Unix software needs to be changed to support your encoding. As it is, Linux accepts bytes for filenames without caring whether it's UTF-8 or some 8-bit code or some other multibyte code that obeys the same rule, knowing only that the byte '/' is uniquely the directory seperator.

  34. Re:Unicode's reply by dvdeug · · Score: 2

    > ISO 2022 is a very poor implementation of stateful multi-charset character stream,

    You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal solution.

  35. Re:ISO-2022-JP and "alphabetical order" by dvdeug · · Score: 2

    Why is the character order a problem for the Japenese, and not the Germans, the French, the Lithuanians, the Belarusians, and almost every other language in the world? Latin-* does not encode anything besides English in alphabetical order, and neither does Unicode. (It's theoritically impossible; the Lithuanians want the Y to precede the J, and the Danish and the Swedes disagree about where the a with ring above goes.)

    If you go to the Unicode standard (found online at http://www.unicode.org/unicode/uni2book/u2.html ) they have an index with all the characters by radical and stroke. They also have an index with all the characters found in JIS sorted by their JIS index.

  36. Re:Unicode's reply by dvdeug · · Score: 2
    The problem is, the actual problem that it will solve does not exist yet, its time didn't come.

    Because, gee, the need to communicate with someone in another language is new. I've never seen VCR instructions in multiple languages, I've never seen a bilingual dictionary, and the EU driver licenses only have one language on them, not every language of the EU.

    Multilingual documents, for all purposes, don't exist beyind demos.

    So Reta Vortaro, an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.) Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too. And the Debian main page, where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.

  37. Re:Unicode's reply by dvdeug · · Score: 2

    The fact that you chose to dismiss this stuff as demos does not change the fact that it's in actual use. Revo's author doesn't feel like changing the format of his dictionary because you don't agree with it. The web is full of gimmicks, but people like their gimmicks; why do you think Java took off? You can't just call it a gimmick and dismiss it; if that's what people to do, then that's what people want to do.

    You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode. I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode. When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.

  38. Re:Unicode's reply by dvdeug · · Score: 2

    > The simplicity of Unicode is only in its authors' imagination.

    Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

    > When the goal is just to make a text that can be printed in pretty letters [...] even in this case a complex typesetting system [...] would be more appropriate

    The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible. Unicode HTML and UTF-8 plain text are those formats. Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience. Project Gutenberg has existed for 30 years. What "complex typesetting system" format can claim the same? How many "complex typesetting system"s that could handle it are available on many different platforms? At least 70% of the people on the net can read Unicode HTML, and many of the rest could with little work and no cash expenditure. What "complex typesetting system" can say the same? How is a "complex typesetting system" simpler than Unicode plain text?

    > he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

    Nonsense. For "Old High German", I will map ALT-Z to ȥ in XEmacs. Spellcheckers don't exist for this language, I'm not going to sort the data, and it's just a z with a hook, so there's no special formatting rules. The people who read the book don't even need a way to enter the character, any more than reading the original book precipitated a need to enter it into the computer.

    Displaying pretty letters isn't the end all and be all of multilingual computing, but it's a damn good start. The only registered character set (ISO 2022 registry or IANA registry) that supports Lakota or the Cherokee syllablary is Unicode. No, on most systems, they can't get decent support; handcrafted keyboard maps must be used, there's no spelling or sorting support. But they can type the characters in and send them across the net and print up papers, which is better than nothing.

  39. Re:Esperanto, Ido, lojban; BCE by unitron · · Score: 2

    He explained it as "before the Christian era", no doubt for the benefit of those only familiar with B.C. and A.D., but did not define it as that, although anyone who needs it explained no doubt also needs it defined as "Before Common Era" (and should also be told that what comes after is "C.E.", or "Common Era", and that B.C.E. and C.E. correspond to B.C. and A.D., respectively), so he did screw up just a tad.

    --

    I see even classic Slashdot is now pretty much unusable on dial up anymore.

  40. Re:It works by h2odragon · · Score: 2
    No, that makes too much sense; it's not all inclusive so let's trash the whole thing, start over from scratch, and revert to 7bit ASCII in the meantime. We need a system that can handle every glyph that has ever had meaning to somebody, somewhere.

    ...for the sarcasm impaired, the above should be read as "good point".

  41. No you didn't read the article, or even think by A+nonymous+Coward · · Score: 2

    You are so euro-centric it's not even laughable. As the article said, those who claim Unicode good enough for the masses are the same foreigners who would scream and howl if someone tried to remove redundacies from the English language such as pork and ham, or argue and dispute, or ...

    I have read that an English language vocabulary of 300 words is good enough for most ordinary conversation. You are claiming the equivalent is good enough for ordinary use. You are mistaken.

    Unicode is a classic case of (western) imperialism, in which the imperialists are completely blinded as to why it is imperialistic, and continue to mutter "it's good enough, and we know what's good for you smelly foreigners."

    --

    1. Re:No you didn't read the article, or even think by WNight · · Score: 2

      Whoops, here comes the racism...

      Anything a white guy wants, or someone who might be a white guy, is wrong, euro-centric, penis-dominated, and wrong.

      Now anything a non-white, non-guy wants wants is automatically right.

      Now a person whom is completely anonymous on the internet can be assumed to be white and male if they disagree with anything said by a non-white, non-male, or someone who lives outside of 'europe or north america'.

      You know, there are a lot of reasons for disliking Unicode, and a lot of reasons for not wanting to waste time implementing a system which has 1) grown monstrously beyond original specs and 2) doesn't help you at all.

      IMHO, you should use those ~65K characters and stop your pathetic sniveling. If you want a character set that supports more, make it yourself and get others to use it.

      If you ever want anyone outside of your immediate family to use it, you'll have to make it worth their while.

      What does the other 80% of the world get out of supporting your dead language? Uglier URLs? More bloated OSes? Slower web usage?

      Sure sounds important to me.

      And before you scream "Racist!", ask yourself if you have any proof, or if you're just pissed that I don't agree.

    2. Re:No you didn't read the article, or even think by WNight · · Score: 2

      I'm sure those 65000 characters would better represent any asian language than German or French is represented without their accents, yet speakers of those languages didn't pull this entitlement crap to make people support larger character sets.

      No language is going to be properly represented, especially when you consider ancient forms, so we'll have to accept that nothing is perfect. We've run into diminishing returns and now people want to increase the complexity of the system a hundred-fold just to get some characters than only a thousandth of one percent of the population will ever know are missing, let along want to use in conversation.

      There's no written language that more than 80% of the world population uses, so I stand behind my original estimate.

      I'm not at all racist in what I say, I'm merely sick of cattering to the special interest groups. Especially the special interest groups that claim to be part of a larger group. (In this case, 99.999% of the population of the original subject's country couldn't give a shit about having ancient characters from a dead language in their URLs, it's *his* issue, not theirs, but he's making it seem like a race issue and oppression of the little guy.)

  42. Re:C programs by Luke · · Score: 2

    I'd love to see the linux kernel coded in Python.

  43. What about the artist formerly known as Prince? by mattkime · · Score: 2

    Does the artist formerly known as Prince get his own charcter space as well?

    Will I need to download a new character set on windows to view it?

    --
    Know what I like about atheists? I've yet to meet one that believes God is on their side.
    1. Re:What about the artist formerly known as Prince? by kevinank · · Score: 2

      Prince is again Prince. He got his name back when the music industry contract that prohibited him from using his own name expired.

      --
      LibBT: BitTorrent for C - small - fast - clean (Now Versio
    2. Re:What about the artist formerly known as Prince? by vidarh · · Score: 2

      I don't know about the status, but I believe it was proposed by someone a while back... :-)

  44. Re:Quit whining and move to a phonetic alphabet by scrytch · · Score: 2

    > The obsession with phonetic spelling is an unhealthy and rediculous pathology

    Despite the fact that we move inexorably toward it anyway.
    --

    --
    I've finally had it: until slashdot gets article moderation, I am not coming back.
  45. Re:Well DUH! It's not meant to have every characte by Mike+Buddha · · Score: 2

    Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

    How is this translation software supposed to work if there is no standard for interchange? Magic? How are we supposed to translate these characters that have no symbol for the computers to process?

    There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer.

    What makes you think that we can't encode all these characters? Are we going to run out of numbers? A 32-bit number can hold 4 billion different values, and if that isn't enough, we can use a 64-bit number. We certainly aren't going to run out of numbers.

    --
    by Mike Buddha -- Someday the mountain might get him, but the law never will.
  46. Re:After some skimming... by K. · · Score: 2

    I suspect Unicode is a lot more upsetting to
    a "reference writer specializing in rare Taoist
    religious texts and medical works" than to
    ordinary Chinese users who want to run Photoshop
    or put their wedding pictures on a web page.


    Let me get this straight - you think people
    should be prepared to accept having restricted
    access to the literature that underpins their
    culture in exchange for their very own
    geocities.cn?

    K.
    -

    --
    -- Proud descendant of semi-nomadic cattle-herders.
  47. Some errors by BJH · · Score: 5

    Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

    In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

    Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

    That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

    Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

    Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

    After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

    Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

    The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

    Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

    That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

    - No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
    Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

    - A draconian unification of CJK characters.
    The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

    - The ugly "extensions".
    Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

    I could go on, but I should get some sleep...

  48. printing [ Chinese ] vs dictionary [ Chinese } by peter303 · · Score: 2

    Its a lot like the Oxford English Dictionary
    versus Websters Collegiate- Chinese printers have
    gotten by with 7-10K characters versus the 60-80K
    in the full language. Synonyms and hononyms are
    used for the more obscure words. The standard
    modern Chinese dictionaries only have this smaller
    number of characters.

  49. totally unconvinced by kaisyain · · Score: 2

    One of the author's main propositions seems to be that Communist Chinese and Taiwanese/Overseas Chinese want different spaces in Unicode for the same characters.

    I don't see every Western nation asking for it's own encoding of "w" or accented characters. The author doesn't give any explanation for why we should pay attention to IMHO silly political whining in this particular case.

    The author further implicitly assumes that it is reasonable to include the deprecated K'ang Hsi characters in addition to the official characters, but gives no justification for this view. I don't see unicode trying to include all possible historical graphings of Western characters.

    1. Re:totally unconvinced by vidarh · · Score: 2
      One of the reasons they want different glyphs is that the characters actually look different in present day use.

      As for including all possible historical versions of Western characters, there are very few that are sufficiently different from present day renderings to be easy to confuse.

      But I agree that his criticism is mostly whining. Most of all because Unicode 3.1 has shown that unicode absolutely is not a static standard, but one that is evolving to encompass more characters on a regular basis. Perhaps some people will have problems using it today. In that case those people should interact with the standards committee instead of whining, and get their characters into the next version.

      But for most people (including most Chinese and Japanese people) the current Unicode standard will be comprehensive enough for most use.

    2. Re:totally unconvinced by vidarh · · Score: 2
      Did you actually read my post? I explained why there is a legitimate request for different versions of similar characters among the CJK glyphs. I also suggested that people that needs the missing characters work to add them.

      Finally, however, I did suggest that to most people using Chinese, Japanese and Korean, the current set of 94,140 characters, of which about 65.000 are there for the benefit of Chinese, Japanese and Korean, would be sufficient.

      I did not write anything to imply that noone would run into limits. I did not write anything to imply that people who do run into limits should accept that (hence my suggestion that they work to have the characters they need accepted in forthcoming revisions of the standard).

      However I do stand by my claim that 94,140 characters will be enough for most people most of the time, including people using Chinese, Japanese and Korean.

      Now go learn something about how to parse basic English sentences.

  50. Re:Solution - Everybody use Euro-English! by sharkey · · Score: 2

    Das rubbernecken sightseenen keepen das cotten picken hands in das pockets, so relaxen und watchen das blinkenlights.

    --

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  51. Re:You bring up a good point by gleam · · Score: 5

    The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source

    A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here.

    Regards,
    gleam

    --
    this .sig is not a .sig.
  52. Re:Quit whining and move to a phonetic alphabet by dutky · · Score: 2

    The obsession with phonetic spelling is an unhealthy and rediculous pathology: to understand why, have a look at Justin B. Rye's Spelling Reform page (subtitled And the Real Reason It's Impossible).

  53. Re:UTF-8 should be fine for almost any application by dutky · · Score: 3

    Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.

    I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.

    Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.

  54. Re:unicode does *not* encode 65,536 characters by jholder · · Score: 2

    This is incorrect, 44,946 surrogates were approved in March as part of Unicode 3.1.0.

    Unicode 3.1 and 10646-2 define three new supplementary planes:

    Supplementary Multilingual Plane (SMP) U+10000..U+1FFFF (1594 chars)
    Supplementary Ideographic Plane (SIP) U+20000..U+2FFFF (43,253 chars)
    Supplementary Special-purpose Plane (SSP) U+E0000..U+EFFFF (97 chars)

    Or plane 1, 2, and 14. (from the Unicode 3.1 Technical report, #27)

    --
    -- John
  55. Re:After some skimming... by WNight · · Score: 3

    Oh gawd, just listen to the feelings on entitlement in that messages...

    You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.

    You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.

    There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.

    I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.

    In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.

  56. Re:Quit whining and move to a phonetic alphabet by HenryFlower · · Score: 2
    But you wouldn't say that if you could read ancient Greek (I assume that's what you meant, not modern Greek). If you could, you would be happy that there need be no longer a half-dozen ideosyncratic methods for encoding ancient Greek, with equally ideosyncratic input methods. All you are really saying is: if I don't need it no-one does. And I don't see how a character set allowing faithful encoding of Greek characters and diacritics places any special burdens on you, who don't need to use them....

    Or perhaps you are being very subtly sarcastic? Or trolling?

  57. This is so wrong by jfedor · · Score: 2

    The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here.

    Unicode may have its problems, but this is not one of them.

    -jfedor

  58. Compaction and Traction by JJ · · Score: 2

    The 64,000 should suffice. Ideographic scripts, like Chinese are were the problem arises. The number of characters in Chinese is not fixed, unlike the number in most alphabets. I have a Chinese novella which was written in just 300 characters. 10,000 would be a good place to start, a few thousand more would cover all but specialized texts. Japanese could fold into Chinese, since there are only 2000 kanji characters and a few hundred kana.
    Throw in Arabic, Cryllic, Sanskrit, Dravidian, Hangul (Korean) and Navaho and you still add only a few thousand. The odd European characters (the 'ss' in German, the extra Danish vowels, . . .) add a few hundred tops. Even the special linguist marks and punctuation don't add much.
    If you have to double the Chinese, now you run into trouble. Its classical characters vs. simplified. The later is for the PRC. If you also bloat the number of characters required so that specialized religous characters are required, now you start to push the system. 64K would be fine if a special marker character could be used which signify's that the next character is from the special table. Unicode has resisted this effort.

    --
    So long and thanks for all the fish . . . !!!
    1. Re:Compaction and Traction by JJ · · Score: 2

      You know, this is actually the one topic that I am probably best versed on discussing. My info sci masters advisor was on the committee which established ASCII and my linguistics masters was on medieval Chinese dictionaries. Plus, I used to live in Japan.
      There are _slightly_ more than 2000 kanji in Japanese, but Japanese printers, like my wife's father, don't use more than 2100 absolute tops.
      Chinese characters obey Zipf's law on a near perfect logarithmic scale. As in, the first ten characters make up about 60% of written text. For each unit of ten up from that include about 60% of what is left. At 10,000 characters you have all but about 2.5% of most newspaper text. The few thousand extra that I spoke of covers mostly proper names.
      Chinese most certainly can be written satifactorily in this manner.

      --
      So long and thanks for all the fish . . . !!!
  59. Re:Quit whining and move to a phonetic alphabet by scruffy · · Score: 2
    Did you actually read what you linked to? Justin Rye is very sympathetic to "spelling reform", but he realizes it is utopian:
    The flaws of the standard orthography are indefensible - but it has an extensive Installed User Base, and can thus afford to ignore criticism in exactly the same manner as Fahrenheit thermometers, QWERTY keyboards, and certain software packages, which can all rely on conformism, short-termism, and sheer laziness for their continued survival.
  60. Quit whining and move to a phonetic alphabet by scruffy · · Score: 3
    Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?

    But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.

    I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.

    Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.

    I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.

    1. Re:Quit whining and move to a phonetic alphabet by Baki · · Score: 2
      Same goes for the original classic texts of our 'western civilization' in Greek and Latin.

      Does that mean that everyone should be able to read/write Greek and Latin? Or should everyone learn Hebrew to read the bible?

      Reading classical texts IMO has no relevance for a character set of today. The waste (difficult input/output methods, waste of space and processing speed) in comparison with the occasional gain (being able to process classical texts on modern computers by everyone) just isn't worth it.

    2. Re:Quit whining and move to a phonetic alphabet by Baki · · Score: 2
      Of course the west doesn't have to decide for China or Japan. They can make their own judgement, and see for themselves whether it is worthwhile and cost-effective to maintain an old and complex character system today.

      But those that don't have the need of daily reading/writing such characters shouldn't be forced to "suffer" in terms of waste of memory and processing speed (2 or even 4 bytes per character). In that sense UTF-8 (1-byte subset of Unicode, similar to the ISO-lating-1 encoding) is a reasonable alternative.

      What I don't understand is why all possible characters of the world should be in 1 big character set. I know it simplifies some things, but it also costs a lot.

      Why not use a system of multiple (standardized) character sets. This is extendable, you can always add new encodings/sets etc. The only really fixed mechanism needed is a way to specify a switch from one encoding to the other.

  61. Nordic Runes? by reverse+solidus · · Score: 2
  62. A Plan for the Improvement of English Spelling by bgarcia · · Score: 4

    Go read the original story here, by Mark Twain.

    --
    I'm a leaf on the wind. Watch how I soar.
    1. Re:A Plan for the Improvement of English Spelling by tswinzig · · Score: 2

      Actually, I think the slashdot post was a lot funnier... he managed to convert english to german by the last year...

      --

      "And like that ... he's gone."
  63. Re:Perl in Hierogliphics by CharlieG · · Score: 2

    Gee, I thought it already exists - they call it APL

    --
    -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
  64. Re:Is this really such a problem? by revscat · · Score: 2

    Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent.

    It's totally irrelevant for poor rural populations, true. But as more and more of the world's population moves towards being centered around urban areas this is indeed relevant. It is relevant to those who desire the full functionality of the Internet in their native character set. I believe (and this is a belief, not a fact) that one way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible. One way to do this is by making sure they can use the latest technology in their native tongue, lowering the slope of the learning curve.

    Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language.

    This is indeed tragic, but it quite simply cannot be helped. It's so common as to be a cliche: "Life Sucks", or "Shit Happens", or even "C'est l'vie." I hope that there are linguists and philologists who are archiving these languages for future generations and our general cultural awareness. BUT: People must eat, and they have a strong desire to make themselves and their families prosperous. If, when all things are considered, making sure that you live your life only speaking language X turns out to be counterproductive, then that language will become less important. There have been many languages that have come and gone throughout the millenia; humanity continues to advance. Would the world be a richer place if all those languages were still around? Certainly. But it would also be more confusing. And remember: If people can speak to each other, there is less of a chance they'll start killing each other. (LESS of a chance, mind you.)

    I'm a Taoist at heart in matters such as this. For every yin, there is a yang, for every good, there is a bad. Life goes on.

    - Rev.
  65. Re:ASCII stupidity all over again... by gorilla · · Score: 2

    ASCII is, and always was, a 7 bit standard, which encoded 95 printable characters and 33 control codes. 'high-ascii' just does not exist, and never did.

  66. It works by Kohath · · Score: 2

    Imperfect != "does not work"

    1. Re:It works by TommyW · · Score: 2

      True, but "does not work" implies "imperfect."

      The point the article is making is that this system cannot be made to work for everybody at once.

      So you either put up boundaries, and have systems
      that work perfectly, but only within those boundaries, or you need a system with wider scope at the outset.
      --
      Too stupid to live.

      --
      Too stupid to live.
      Too stubborn to die.
  67. Re:Wrong, wrong! by csbruce · · Score: 2

    or you want to scan the string backwards

    UTF-8 can indeed be scanned backwards. You could also locate the start of the current character given a random pointer into a byte buffer. RTFM. UTF-8 can also directly encode 2 billion characters. UTF-8 is the right general solution to data interchange, and this is why it's catching on.

  68. Re:You bring up a good point by ncc74656 · · Score: 2
    The plural of dish is dishes, but the plural of fish is fish

    "Fishes" is also a valid plural form of "fish." "Fishes" refers to a group of different species, while the plural "fish" refers to a group that is all of the same species. The plecostomuses (sp?) and cichlids in my tank at home are fishes; the trout in a pond are fish.

    (Your point that English has tons of rules and even more exceptions to those rules still stands, though.)

    "A bunch of bananas" or "a group of individuals", are these plural or singular?

    A bunch and a group are both singular, though some Brits would disagree (their usage used to treat a group as a plural object ("and the crowd are going wild!"), but that is starting to change in more recent usage).

    --
    20 January 2017: the End of an Error.
  69. Well DUH! It's not meant to have every character by stienman · · Score: 2

    Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Unicode was never meant to hold one spot for every character. It was meant to be used as a set of code pages much like ascii was. But it had to be larger than 256 to hold a reasonably representative set of one language at one time (such as Japanese, or Chinese (two dialects), etc).

    Most documents consist largely of one language, so you start the document by stating the code page you're using. Very few documents need more than one set of 65,536 characters, but you can intersperse sets if needed.

    But the idea of having one universal character set is ludicrous. There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer. Sure, we could use 4 bytes per character, but is it really necessary? Absolutely not! Talk about inefficient. The only case where that would be more efficient than code pages is when the majority of documents extensively use more than 64k characters within each document.

    Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

    -Adam

    This sig 80% recycled bits, 20% post user.

  70. Re:too sinocentric, but Unicode has problems by olevy · · Score: 2

    I worked as a programmer in Japan for 4 year, and I've also done several projects in Unicode.

    There are couple of things I would like to point out:

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    >>The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    Have you ever tried to program in shift-JIS? It is horrific. Basically they mix one byte and two byte characters. The problem is that if you jump into the middle of the string there is no way to know if you are looking at a one byte character or the second byte of a two byte character. You also can't do tell the number of characters in a string simply by looking at the length. It is a *terrible* standard.

  71. Re:After some skimming... by tytso · · Score: 2
    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Actually his original nalogy was flawed and designed to yank people's chain... A better analogy of what's going on would be to say that the Germans and the French wanted to have their own Unicode code point for the letter "A", since obviously the German A is very different from the French A. Repeat for all the letters in the alphabet. The excuses for saying that the German "A" should have a different value than a French "A" is (a) The Germans and the French hate each other, and (b) French tend to use a sans-serif'ed font. When told by the standards committee that font issues were independent of Unicode assignment, the response was this was obviously anti-European imperialism....

    That's basically what's going on here with the folks who are complaining about Han Unification. Many Asian languages are desended originally from Chinese, just as many European languages are descended from Latin and Germanic roots. So it's not surprising that the systems of orthography share a lot in common. The difference is that each Asian country refuses to share any codepoints with any other Asian country, because They Hate Each Other, and there seems to be some widespread belief that doing so would somehow be causing their national language to lose face.

    As someone who's Chinese, I think I can safely say to those people who like to bitch and moan about Han Unification..... Grow up!

  72. Re:Alrighty by hernick · · Score: 2

    Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?

    You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.

    It actually works pretty well. Their keyboards pretty much look just like ours.

  73. UTF-8 should be fine for almost any application by AdamBa · · Score: 2
    The purists who want 4-byte characters go beyond just wanting to allow 50,000 Kanji or insisting that Japanese and Chinese Kanji with the same stroke pattern not share the same character. They want a separate character for the English lower-case 'e', the French lower-case 'e', the German lower-case 'e', etc. This is not at all necessary. YES, there may be some Kanji that fall out of use if the set listed in the Unicode standard becomes the only one used, but you have to counter that with the fact that suddenly these languages can have a universally-recognized way to encode them, as opposed to the 5 of whatever ways that previously existed to encode Japanese (which all had limited character sets anyway).

    UTF-8 is very nice because 7-bit characters encode as one byte. Also it is defined so there won't be a NULL or a hex 01B (decimal 27 -- the telnet escape character) anywhere in the data stream, even in the second or third byte of an encoded character. So it will generally be passed through correctly by programs expecting straight 8-bit ASCII. UTF-8 is also encoded and decoded via a trivial algorithm, as opposed to the DBCS used in Windows which needs lookup tables.

    One negative of UTF-8 is that Unicode characters at 0x8000 or above (using more than 11 bits) encode in UTF-8 as 3 bytes, not 2 as in Unicode. I think that range includes things like Arabic and some Indian written languages. But I think that tradeoff is worth it.

    - adam

  74. Re:Solution - Everybody use Euro-English! by rkent · · Score: 2
    Oh come on, that's not so hard to read... remember, as Andrew Jackson said:

    "It's a damn poor mind that can think of only one way to spell a word!"

    ---

  75. Idographics Have Their Place In English Too.... by EXTomar · · Score: 2

    Lets see...you used "10,000" and "100". Those are idographic representation of "ten thousand" and "one hundred". There are hundreds of idographs in common US English yet someone wants to harp on a language that uses idographs for 95% of their written word?

    The point is that any character encoding should have been robust enough to encode any language used at any point in the history of mankind(okay...encoding things like Ancient Latin might be more acedemic than anything).

  76. Re:You bring up a good point by mrogers · · Score: 2

    Don't confuse the Latin alphabet with the English language! In Czech, the Latin alphabet (plus a few accents) is used phonetically.
    --

  77. Flamebait :) by phunhippy · · Score: 3

    Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)

  78. Re:After some skimming... by kevinank · · Score: 2

    Special letter forms don't need to be coded into unicode to be viewable. SVG, Postscript and other languages do a perfectly good level of presentation. So unless you can convince me that a Korean/Chinese person will be trying to do a word search through an historical Japanese/Taiwanese/Vietnamese document and will always inadvertently find the Korean ACK/Chinese SPOO when what he was really looking for was the Japanese FOOFLE/Taiwanese FLUM.

    Personally I can't understand why anyone in the world would want to search in a character set of more than 60,000 characters. I'd personally be pissed off if the UNICODE committee started adding special letter forms for US product trademarks (so they would render correctly) when as a user I'd rather just have them be findable.

    Really, the author needs to understand the use of the ALT tag.

    --
    LibBT: BitTorrent for C - small - fast - clean (Now Versio
  79. Technically Illiterate by tbray · · Score: 2

    This article is technically illiterate. UCS-2, which he references heavily, basically doesn't exist any more and hasn't for a while. UTF-8 and UTF-16 are perfectly adequate encodings each of which can handle all the of the extended characters, up to a million or so in number (17 planes of 64k, to be precise).

    He's correct that the ability to do computing in an Asian environment has lagged behind Western-language capabilities. However, as of Unicode 3.1 (in fact, as of Unicode 2), the support for what you need to do *business* computing has been pretty well there.

    The job of collating and organizing all the tens of thousands of characters required to handle the classical texts is under way but will take a while to finish. Then there's the really hard problem of building quality fonts to support all these things.

    But the title and premise are wrong. You can use Unicode on the net today just fine, lots of people are doing it, and anyone who builds a significant application today and *doesn't* build in support for international character handling is just out 'n' out stupid. It's not that hard.

    Cheers, Tim Bray (tbray@textuality.com)
  80. Re:After some skimming... by TheReverand · · Score: 2

    Then how do they sing danny boy?

  81. Re:Danny Boy? by TheReverand · · Score: 2

    Tell that to Tommy Makem

  82. 10100010100 by 4of12 · · Score: 2

    We Bynari take issue with this.

    With much grief and gnashing of teeth do we stoop to use this ill-conceived and bloated Latin based alphabet with 26 characters to respond to this bigoted viewpoint in a way that your feeble minds may understand.

    Our alphabet has exactlytwo letters.

    --
    "Provided by the management for your protection."
  83. too sinocentric, but Unicode has problems by rjh3 · · Score: 4

    Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.

    The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    The history of encodings is roughly:
    1. There was chaos.
    2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
    3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
    4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
    5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
    Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.

    Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.

    China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)

    The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.

    Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.

    This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.

    But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.

    It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.

    These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.

  84. Re:UTF8 by Mendax+Veritas · · Score: 2

    No, because the aliens are all so technologically and socially advanced that they've standardized on Esperanto.

  85. Wrong, wrong! by Mendax+Veritas · · Score: 4
    UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).

    It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

    1. Re:Wrong, wrong! by hackbod · · Score: 3

      People who think there is a problem with the number of different Unicode encodings -- including the authors of this article -- completely misunderstand how unicode works. The different encodings are -not- different character sets -- in fact, they are different ways to write the -same- standard Unicode character set. The transformation between UTF-8, UTF-16, and UTF-32 is only a simple bit minipulation -- it is completely independent of the character set.

      An implication of this is that UTF-8, UTF-16, and UTF-32 can all express the EXACT SAME NUMBER OF CHARACTER CODES. So, if you think UTF-32 is good enough for you, then UTF-16 and UTF-8 are just as good. The latter two simply use multi-word or multi-byte sequences to express the upper character values.

      After using BeOS for a number of years, where all character strings are natively handled as UTF-8, I am a very strong believer in Unicode. Having a Western perspective I may be missing something, but none of the "problems" mentioned in this article are actually problems that Unicode has.

      Of course, once you start using Unicode, the main problem you are going to run in to is having fonts with the characters you need. And if the Chinese, Japenese, etc. really need 50,000 of their very own characters, then this is going to be that much more of a problem. Unforunately, there is no easy solution to this -- but it doesn't have anything to do with the encoding you use, so changing to another encoding is not going to help here.

  86. Re:Hmm.. I must have been using something else the by ClarkEvans · · Score: 2

    And UCS-2 is not the only way to encode Unicode. You mean Unicode is not the only way to encode UCS-2. UCS-2 is a character set, unicode is an encoding of this character set.

  87. Re:Hmm.. I must have been using something else the by ClarkEvans · · Score: 2

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16. Whoa! UCS-2 is the character set. You can encode UCS-2 using either UTF-16 or UTF-8. Once again, Unicode is an *encoding* and UCS is the *character set*. Big difference and you seem to be reversing them.

  88. Language ID? by kreyg · · Score: 2

    being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters.

    So, add a byte or two per document as a language ID...

    Anybody feel like joining me at Milliways?

    --
    sig fault
  89. unicode does *not* encode 65,536 characters by egomaniac · · Score: 4

    It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

    Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

    So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.

    --
    ZFS: because love is never having to say fsck
    1. Re:unicode does *not* encode 65,536 characters by vidarh · · Score: 2
      AFAIK, each plane is only 16 bit. For Unicode 3.1, for instance, the new characters are placed in planes 1,2 and 14. But you're right that Unicode as a whole encodes over a million codepoints.

      The "surrogate pair" method only applies to UTF-16 encoding, AFAIK. UCS-4 should be equivalent to UCS-2 with surrogate pairs, except that the codepoint is always encoded as a 32 bit value, whether or not a single 16-bit character or a pair of two 16-bit surrogates are used.

  90. Unicode has this covered. by tjwhaynes · · Score: 3

    Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.

    In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.

    There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.

    The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

    Plenty of room.

    Cheers,

    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
  91. This article is stupid by MrResistor · · Score: 2
    HTML 4 includes country codes so the browser knows how to interpret the Unicode character. Thus, the same 16 bit number will display a different character for an English document than it will for a Mandarin document.

    In other words, Unicode doesn't need to account for every single character in the world!

    But of course, this was posted on the internet, so it MUST be true...

    --
    Under capitalism man exploits man. Under communism it's the other way around.
  92. Misconceptions in article by HalfFlat · · Score: 3

    As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.

    One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.

    It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)

    Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.

    Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)

    With this in mind, one can argue against the points raised in the article:

    1. The unification scheme, allows the representation of the 170,000 characters the author calculates in 70,000 or so codepoints. Which it now does with Unicode 3.1. The use of external context is still necessary for correct rendering, but if the document has no structure for representing language context, there are Unicode language tags that can fill this role. Similarly, context would be required for the presentation of different calligraphic variants of Roman characters (e.g. fraktur.)
    2. Unification is quite unlike the analogy described 'in Western Terms'. 'M' and 'N' could not be identified, as they semanticly distinguish words (e.g., 'rum' and 'run' have very different meanings.) Traditional characters and their simplified analogues are not identified under Unicode, so even if 'Q' were simply a fancier 'C' (which of course it is not), it wouldn't be given the same codepoint.
    3. Unicode is not limited to 16 bits as stated in the introduction to the article. There are over 2000 million available codepoints in UCS-4 and UTF-8, and UTF-16 can represent approximately 1 million of these. There is plenty of room - even in UTF-16 - to encode more characters as the need arises.
    4. With the exception of calligraphic variants in CCCII, Unicode can already faithfully represent characters in the major Chinese, Japanese and Korean character encoding standards.

    A little bit of research by the article author would have made the article unnecessary.

    References:
    Unicode 3.1 document;
    CJKV Information Processing, Ken Lunde.

    PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.

  93. Re:Is this a problem? by autechre · · Score: 2

    Well, if we want to have the "standard" language be "Chinese", you'll first have to decide which one you want.

    China has 7 main dialects, according to my Chinese language class teacher. People in Shanghai speak a language that can almost be considered completely different than the one in Beijing. They use the same characters for writing, but use them to mean different things. At the very least, you have Mandarin and Cantonese.

    Also, while Chinese is a grammatically simple language (no conjugation, no pluralisation, etc.), it is less fun to write, because there is no alphabet. Yes, there is a different character for every word. Yes, there is a rhyme/reason to the characters, but that doesn't make it all that much less difficult to learn all of them. Oh, and you have to decide whether you want simplified or traditional Chinese characters to be the "standard", too.

    Finally, while the population of China is certainly the largest in the world, do they really have the most people _online_? I have no statistics, I'm actually curious.

    Sotto la panca, la capra crepa

    --
    WMBC freeform/independent online radio.
  94. Tengwar is SCRIPT not language by yerricde · · Score: 2

    rough translation of this Quenya Elvish phrase which is a derivative of the Tengwar elven language

    Script != language. The word tengwar is Quenya for "letters." Calling the tengwar script a language is like calling the cyrillic script (used for Russian), the katakana and hiragana scripts (used for Japanese), or the latin-1 script (used for many Western European languages) a language.

    Find Tolkien's tengwar and more in the conscript registry, which uses the 'private use' area of the Unicode space for scripts invented in modern times (all scripts are invented at some time or other). And there are "surrogate" codes in Unicode UTF-16 for a million additional code positions.

    --
    Will I retire or break 10K?
  95. Tengwar: Another alphabet designed on phonetics by yerricde · · Score: 2

    Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

    If you like hangul, you'll probably also like J.R.R. Tolkien's tengwar. Regular changes to the shapes of the consonants denote stop/fric/nasal and voiced/less. The structure of the script is such that unused letters (after t series, p series, and k series) can be used to represent sounds unique to a given language. It's available in both vowel-pointed (like devanagari and biblical hebrew) and vowel-letter (like greek/latin/cyrillic) modes.

    I'm not 100% sure about the legal status of a post-1923 script. Can a script be copyrighted or trademarked? Probably not. (Patents don't apply; it's been more than 20 years since the entire system was disclosed in RotK.)

    --
    Will I retire or break 10K?
  96. "Ye Olde" typo and Walt Disne�^WDisney by yerricde · · Score: 2
    Old English, by the way, did have more letters than are found from modern english ("thorn" letter for "th", and couple of others).

    The letter thorn looks like (U+00DE; Alt+0222; capital) or (U+00FE; Alt+0254; lowercase).

    Thus, "Ye olde ..." is a kind of a typo; the first letter wasn't Y, but was close enough visually that it started at some point to be thought to be Y...

    Except DisneyCo (famous for buying bad legislation) actually does the opposite: using instead of y in the corporate logo.

    --
    Will I retire or break 10K?
  97. Basic English by yerricde · · Score: 2

    if someone tried to remove redundancies from the English language such as pork and ham, or argue and dispute

    C. K. Ogden once did just this, reducing the English vocabulary to a set of 850 basic English words, but the result has (some foreigners claim too many) idiosyncratic idioms and turns of phrase.

    --
    Will I retire or break 10K?
  98. Likewise, for Latin-1 based languages... by yerricde · · Score: 2

    Likewise, in Unicode, English, German, and Finnish all share the same codepoints and glyphs, so you can't grep for one language or another without using META headers or something similar.

    For instance, if you were searching in English for "gift", this string in Unicode would be the same as the German characters for "poison" (Gift), so your search would get hits from other latin-based languages in addition to English.

    It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is a bit different for the different European languages, even though they use the same characters.

    Translation: Language collision can be avoided by exact phrase matching ("perpetual copyright" wouldn't return many matches for non-English documents) and specifying the natural language of a document either in the document or in the headers.

    --
    Will I retire or break 10K?
  99. And Unicode distinguishes those. by yerricde · · Score: 2
    1. The western way "1"
    2. The common Chinese character
    3. The complex Chinese character used for legal documents, cheques (when not in English), etc
    4. The Chinese character used in markets

    So then, are they the same, or not? The answer is NO.

    Unicode would distinguish among these four forms because they are distinct characters, but it would not distinguish among similar forms of the SAME character. Unicode does not distinguish sans-serif from roman from italic from fraktur from monospace; that's the job of the stylesheet.

    To answer another common objection: When two characters look the same but are not the same, they are assigned separate codespaces. For instance, Latin capital letter A, Greek capital letter Alpha, and the Cyrillic equivalent look exactly the same. Chinese 'yi' (one) is the same character with the same origin as Japanese 'ichi' (one), but it is not the same character as hyphen is not the same character as em-dash.

    --
    Will I retire or break 10K?
  100. UTF-8 by yerricde · · Score: 2

    As long as C programs have to be written in ASCII

    Supporting UTF-8 variable names as an extension to C and to C++ would not break any standard because, by definition of UTF-8, any valid ASCII string equals its UTF-8 representation.

    english will be the standard

    Programming languages use English as the standard for keywords because more programming language designers can speak English than any other language.

    Limit use of 'to be' verbs to add power to your English.

    --
    Will I retire or break 10K?
  101. "Extended ASCII" misnames ISO-8859-1 by yerricde · · Score: 2

    Perhaps not, but there is such a creature as "Extended ASCII".

    Say not "extended ASCII" or "high ASCII" but "ISO-8859-1" or "ISO Latin-1." Latin-1 happens to use the same characters at codepoints 00 to 7f as ASCII, but that of itself does not make it ASCII. Unicode uses the same characters at codepoints U+0000 to U+00FF as Latin-1, but...

    --
    Will I retire or break 10K?
  102. Re:You bring up a good point by joto · · Score: 2
    This of course depends on what you mean by a simple writing system. If by simple you mean only "has few letters" then English is about as simple as it can be. However, english spelling is extremely idiosyncratic, there are no simple rules to follow, and almost every word is spelt in some not entirely logical way.

    If you choose this view, then yes, most european languages have much more logical spelling than english. One exception might be french, which is not at all written as it is spoken, although there is admittedly a system to it.

    Accent marks and diacriticals doesn't make the writing system more difficult, it simply makes it possible to write more phonetically. I would prefer the writing systems of german, danish, norwegian or swedish any day before english. I don't know any east-european languages, but I would be very surprised if most of the accents and diacriticals weren't there for a good reason, and I doubt they can be much worse than english.

    From what little I know of russian, it has a very simple writing system that is even clearer and simpler than e.g. german, danish or norwegian.

    On the other hand, if someone makes a truly simplified and logical spelling of the english language popular, e.g: "I thought my bones were breaking during the fight" -> "Ai thokt mai bowns wer breiking diuring the fait", it could eventually become as simple as most other european languages (or those written with the cyrillic character set).

    Of course, most languages has some kind of idiosyncrasies when it comes to spelling, but english is certainly not among the easiest. And the few added letters in some european languages is laughable. German adds a few umlauts and ß, danish adds æ and ø, norwegian adds æ, ø and å, swedish adds å, ä and ö, and so on... No big deal! Besides, none of the above mentioned languages makes any use of x or z except in foreign words. Scandinavian languages never use w except in foreign words. The same is true for c in norwegian. So the letter count is mostly similar, as is true for cyrillic.

  103. Re:Bummer by joto · · Score: 2

    Yeah, that would be really useful. Only spammers would know how to copy and paste your email-address, while ordinary people you tell your email-address can't type it...

  104. UTF8 by Srin+Tuar · · Score: 2
    UTF8 is cabable of encoding up to 31 bits per character, which is 2,147,483,648 distinct glyphs. This should be plenty for all languages, and at least for linux/*nix, it is well recognized as the way to go.

    One upside of it is that that is almost no cost for english/ascii, which will remain 1 byte per character. You dont even have to recompile most apps to support it- only those that format character glyphs.

  105. You bring up a good point by Srin+Tuar · · Score: 2
    Does anyone know a a real language that has a simpler writing system than english?

    Almost every other european language I have seen uses some set of accent marks or diacriticals. And having studied japanese and vietnamese, they have orders of magnitude more complexity. Even esperanto has a larger alphabet than english.

    Is it just a coincidence that the simplest writing system was the first to be digitized? Too bad pronunciation of english isnt equally simply.

    1. Re:You bring up a good point by SpeelingChekka · · Score: 4

      Does anyone know a a real language that has a simpler writing system than english?

      Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.

      I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.

      Is it just a coincidence that the simplest writing system was the first to be digitized

      Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?

  106. (reply to AC) by Srin+Tuar · · Score: 2
    Wrong, look here: unicode faq

    quote: All possible 2^31 UCS codes can be encoded.

  107. Unicode and CJK Characters by torokun · · Score: 3
    There are some good comments here, clarifying why this article is fundamentally wrong in its assumption that Unicode only encodes 2^16 characters. This is the first reason why this article is wrong.

    The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.

    The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...

    The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.

    The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.

    On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand ( Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.

    So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.

    The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.

    This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)

  108. Helping the poor by peccary · · Score: 2

    One way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible.
    One way to do this is by making sure they possess the knowledge and skills of the modern economy. One of those skills is the dominant language. If you want to be rich, you learn to act, speak, think, like the rich people. Preserving the "native language and culture" is the province of romantic idealists.

    Don't go calling me a cultural imperialist, now. I actually read, speak, and write three languages, and could easily add a couple of more. I love the differentness of distant cultures. I am a "romantic pragmatist." I would love to see this differentness preserved, but I recognize that its passage is inevitable. The fact is that all these languages and cultures sprang up because the world was so vast. Groups of people were totally isolated from each other.

    The "western" world isn't that large anymore -- it's actually smaller than it's ever been. When Alexander the Great ruled the world, it was months from one end to the other. Now, the western world is maybe a day from one end to the other.
    The natural circumstances under which those languages arose simply do not exist any longer. They are fish out of water, and they must naturally pass on -- it's just the way of things.

    There may be room for many different languages when the human race colonizes the solar system, but I suspect that even then, the communications delays will be low enough that a single culture will be maintained, more or less.

  109. Cultural Heritage is important! by peccary · · Score: 3

    I mean, imagine how much pooerer you would be if you had been unable to read the epic poems of early Anglo-Saxon culture in their original form! Or the early Judaic and Greek writings on which much of our more recent culture is based.

    You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?

    Now imagine how hard this would be if your computer didn't support the full character set in which they were written.

  110. Ignorant nonsense. by fm6 · · Score: 2
    I'm repeating what other posters have already said, but I think it's worth boiling down the basic issue.

    The simple fact is these guys are totally ignorant. They confuse a particular 16-bit implementation of the Unicode "basic plane" with Unicode itself. If they'd done any research at all, they'd know that there are 16 planes, with support for about 1 million characters. Plus some there are "private spaces" so people can create their own extensions of Unicode. There's already the ConScript registry (which supports Shavian and Klingon).

    I'm reminded of people who thought computers would never catch on because keypunches were too bulky.

    Another ignorant assertion: that 1.5 billion people "speak" Mandarin. Mandarin is the standard dialect of Chinese, but only about 800 million people actually speak it.

    __

  111. Re:another drawback of unicode by kurisuto · · Score: 2

    There is in fact a group working on Unicode encodings for the Egyptian heiroglyhic character set. The codes will go in the "surrogate characters" range of Unicode. Regular Unicode uses the codes between 0 thru 2^16-1; the surrogate range runs from 2^16 thru 2^32-1, and has been designated by the Unicode Consortium for exactly this kind of case, i.e. large, rarely used characters sets.

  112. Re:Unicode includes all common Asian character set by kurisuto · · Score: 2
    To the contrary, Unicode 3.0 does include the Germanic runes.

    I don't see a need for special software to display runes. It's just a matter of having a font architecture which allows you to create and install a font for an arbitrary subrange of the Unicode space.

  113. Re:Define two unicode escape chars = 196000 chars. by kurisuto · · Score: 2
    Your solution is roughly functionally equivalent to the UTF-8 encoding of Unicode. UTF-8 is a way of representing the Unicode character space. It has the following properties:

    • Every character in the U0000-U007F range is represented by a single byte which happens to correspond to the ASCII code for the same character (so purely ASCII text is identical to the same text in UTF-8).
    • Characters outside the U0000-U007F range are represented by two, three, or four bytes.
    • You can tell what kind a given byte is by examining its high bits (it's either in the ASCII range, or is the first byte of a multi-byte encoding, or is a non-initial byte of a multi-byte encoding).
    Since UTF-8 is backward-compatible with ASCII, it is becoming widely accepted.
  114. Overstating and misunderstanding the problem by kurisuto · · Score: 4
    This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.

    This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

    For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

  115. Re:Solution - Everybody use Euro-English! by Skuto · · Score: 2

    a) This is (adapted?) from Mark Twain, it's in
    most fortunes.

    b) No matter how funny it looks, if you read
    it aloud its prefectly understandable...

    c) ...but it keeps reminding me of 'Allo Allo?'

    --
    GCP

  116. Solution - Everybody use Euro-English! by saider · · Score: 4

    The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

    In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

    There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

    In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

    By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

    After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!


    --


    Remember, You are unique...just like everyone else.
  117. Re:After some skimming... by gea · · Score: 2

    Consider English literature and ASCII. If you look at a reproduction of Beowulf in the original Old English, you find lots of characters that aren't present in ASCII. That doesn't mean ASCII is worthless, and it doesn't mean anyone had to accept restricted access to literature. It just means there was room for improvement because ASCII wasn't suitable for all purposes.

    The Unicode designers got bogged down trying to create an encoding suitable for every possible purpose. If the goals had been more modest, say to allow Chinese language URLs, there would have been faster ways to go about it.

  118. ISO-2022-JP and "alphabetical order" by achurch · · Score: 4

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.

    Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?

    Have you ever tried to program in shift-JIS? It is horrific.

    I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.

    On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.

    --
    BACKNEXTFINISHCANCEL

  119. Re:I had no trouble reading that at all by tswinzig · · Score: 2

    Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.

    Ill is not a homonym for I'll. You're talking about how things sound, and the discussion centers on how English LOOKS. :-)

    --

    "And like that ... he's gone."
  120. After some skimming... by update() · · Score: 4
    I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
    Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
    and I decided to skim the rest.

    To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.

    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Unsettling MOTD at my ISP.

    1. Re:After some skimming... by bmongar · · Score: 2

      Of course the display sets could develop dipthongs, ie more than one unicode char to represent a Chinese character. Part of the problem with the Chinese character set is that it is not an character set so much as a dictionary, with words having only one character, and no restriction in adding new ones. So don't make a single character for each of them, use letter combinations. OF course that is my western bias.

      --
      As x approaches total apathy I couldn't care less.
  121. In other news... by ackthpt · · Score: 3

    Bush bolts GOP to join Democrats, fires entire Whitehouse staff

    Linus Torvalds to join Microsoft as OfficeXP advocate

    NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"

    Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts

    RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"

    Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets

    Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway."

    --
    All your .sig are belong to us!

    --

    A feeling of having made the same mistake before: Deja Foobar
  122. So... by ackthpt · · Score: 4
    4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

    --
    All your .sig are belong to us!

    --

    A feeling of having made the same mistake before: Deja Foobar
  123. ASCII stupidity all over again... by Matthias+Wiesmann · · Score: 2

    It's not new, and alas not surprising.

    When they did ASCII, it was a standard by the US, for the US, the mess it created in the high-ascii range (128-256) is still not resolved and I'm talking diacritical characters like those used in western european languages (French, German, Spanish etc...) nothing fancy or very exotic. Problem was, of course the europeans were not implied in the process.

    Now they do a universal standard that should correct all problems and surprise, they don't actually bother to check with the implied persons. Even if they did, it would make sense to have provisions for a few unknown character sets (like ancient civilisations or the myriad of small groups of people living in lost parts of the world).

    Anyway, if computer history has told us something, is that a 16bit range is never sufficient for practical uses. Well, just another sad example of one size does not fit all... But I suppose the slashdot response will be - why the hell don't they all speak/write english...

    1. Re:ASCII stupidity all over again... by vidarh · · Score: 2

      Get your facts straight. Unicode isn't written in stone. It is an evolving standard. And one of the reasons it is taking so long is precisely because everyone affected can get involved - there's been a lot of infighting about which glyphs should make it and how to organize them. The result, however, is that most commonly used scripts can be handled by the current version of Unicode. More will most likely be handled in the future.

  124. Alrighty by rabtech · · Score: 2

    The guy obviously has an anti-western mindset.

    But to simplify, the crux of his argument seems to be that in order to read ancient works from the Chinese/Japanese/etc, they need about 40,000 to 50,000 characters each.

    But in reality, the average Japanese person would use less than 10,000 characters. In fact, probably much less.

    Besides -- it is mostly a moot point until you can show me a keyboard capable of entering 50,000 unique symbols efficiently.

    His solution seems to be allocating 32-bits of storage per character, rather than the 16-bit Unicode standard we have now.

    For the forseeable future, it would seem that Latin-esque alphabets have the upper hand. It just makes more sense, especially in terms of programming and protocols. Do we really need web servers that understand how to read "GET / HTTP/1.1" in thirty different character sets?


    -- russ

    --
    Natural != (nontoxic || beneficial)
    1. Re:Alrighty by vidarh · · Score: 2
      Input methods for Chinese, Japanese and Korean exists, and can efficiently handle the number of characters required. Some do it by typing out the romanized sound, and mapping it to the characters.

      And actually, the "Unicode standard we have now" does not fit in UCS-2 (16 bit). It requires one of the UTF-* encodings (which are variable length encodings), or UCS-4 (32 bit).

      As for his gripes about Unicode 3.1, sure, there are things you can't write with it. But it's a good step forward. And it doesn't fill the entire glyph-space, by far. The 32 bit encodings, because of the way they are arranged can "only" handle about a million characters if I remember correctly, but that is still way more than is needed.

    2. Re:Alrighty by vidarh · · Score: 2
      Do you use Linux? Try starting "kterm" or similar. If you're using Redhat and Gnome you'll likely find it under "System" in the program menu as "Kanji terminal". Try holding down alt and pressing a couple of character combinations.

      You don't need a special keyboard.

  125. No, _n_ bytes per character! by The+Monster · · Score: 3
    Sort of. You define a 32-bit space for now, then use something like UTF-8 to encode it.

    Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:

    A "character" is defined as a sequence of bytes ("octets" for the RFC-phile) that ends with a value which has the most-significant bit clear. (If you treat byte as unsigned, this means nonnegative; if signed, it's < 128, whichever test you'd prefer to code. I have my preference...)
    This gives 2^(7 * n)possible characters of length n:
    1. 128.
    2. 16,384, cumulative 16,512.
    3. 2,097,152, cumulative 2,113,664.
    4. 268,435,456, cumulative 270,549,120.
    5. 34,359,738,368, cumulative 34,630,287,488.
    6. 4,398,046,511,232, cumulative 4,432,676,798,720.
    7. ...
    As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.

    The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.

    --

    [100% ISO 646 Compliant]
    SVM, ERGO MONSTRO.

  126. A tough problem... by RareHeintz · · Score: 2
    This is a problem that the Chinese gov't has realized in the past, and the development of the Pin-Yin phonetic romanization system was originally started with an eye toward phasing out the (admittedly more cumbersome, but significantly more beautiful) ideograms. (Of course, they had no idea about the Unicode issue back then, but I'm speaking of the larger issues that having a huge, ideogrammatic written language, of which the Unicode problem is just a new manifestation.)

    I don't know where these plans for conversion to a phonetic written language stand now, though I'm sure it wouldn't be hard to find out.

    OK,
    - B
    --

  127. Unicode's reply by roozbeh · · Score: 4

    It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:

    Dear Mr. Carroll,

    I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

    Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

    Here are some specific comments on items in the article which are either misleading or outright false.

    Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

    In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

    Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

    *Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

    Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

    http://www.unicode.org/unicode/uni2book/appA.pdf

    The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

    Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

    Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.

    And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

    Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

    Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

    The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

    The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.

    And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

    Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

    And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

    In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

    Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.

    --Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
    Technical Director, Unicode, Inc.
    Co-Editor, The Unicode Standard, Version 3.0

    --

  128. Duh. by Shoten · · Score: 2

    This should be obvious to anyone who has ever looked at a unicode chart or has had to click "Cancel" when asked to install character support for any of the myriad languages that need language packs to be displayed in Windows. Ok, so they built a way to theoretically support all of these characters. This does not mean that I can read Japanese, however, and making it possible to see it in my browser will not change that fact, nor will it make Google searchable in Japanese, cause IRC to support katakana or hiragana characters (and just freaking forget kanji unless you want to chat with a graphics tablet). Unicode has purposes (besides making it easier to hack web servers, that is), but the hopes and dreams built around it are a classic case of throwing tech at a social barrier to try and make it go away.

    --

    For your security, this post has been encrypted with ROT-13, twice.
  129. Re:I had no trouble reading that at all by cryptochrome · · Score: 2

    Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.

    cryptochrome
    time to get ill

    --

    ---If you can't trust a nerd, who can you trust?

  130. Pictographs suck by cryptochrome · · Score: 3

    For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.

    And besides which, the point of the article is moot. As this article states:

    ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.

    The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?

    The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.

    cryptochrome

    P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "

    --

    ---If you can't trust a nerd, who can you trust?

  131. I had no trouble reading that at all by cryptochrome · · Score: 3

    The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.

    We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.

    cryptochrome

    P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.

    --

    ---If you can't trust a nerd, who can you trust?

  132. Re:2 + 1 bytes? by vidarh · · Score: 2
    Uhm. Unicode already have at least four representations that allow for about a million characters each: UTF-8 (8 bit for US-ASCII, 2-4(?) bytes for everything else), UTF-16 (usually 16 bit, 32 bit for alternate "planes") and UCS-32 (32 bit).

    In other words, the limitation currently isn't lack of space in the Unicode encodings (unless you use UCS-2), but the fact that they simply haven't gotten around to specifying any more characters yet - unicode is still a work in progress.

  133. Re:Uh, I Don't Get It by vidarh · · Score: 2
    UCS-4 is not a character set. It is an encoding of Unicode, similar to UCS-2 (UCS-2 is 16 bit, UCS-4 is 32 bit), and UTF-7, UTF-8 and UTF-16 (variable lenght encodings).

    Except for UCS-2 (and perhaps UTF-7? I don't remember), all of them can encode about a million glyphs (the reason it's not more is due to the way the codespace is laid out, separating things in "planes", and reserving a lot of space for private use etc.)

  134. Re:All Character sets simultaneously?? by vidarh · · Score: 2

    Wrong. The worst case for unicode is 4 times larger than normal. If you only use non-ASCII text spuriously, you can use UTF-8 and will get by with much less than that (as UTF-8 encodes all ASCII text in one byte).

  135. Re:Did you even read the article ? by vidarh · · Score: 2

    You mean that can't satisfy the bureaucrats. Most ordinary people won't see many restrictions from the current standard, as it does contain about CJK 65,000 codepoints, which should be more than enough for ordinary use. Those does at this point also include "compatibility" characters - duplicates that are there to be satisfy worries about compatibility with pre-existing encoding systems.

  136. Re:Hmm.. I must have been using something else the by vidarh · · Score: 2

    No. See the glossary at www.unicode.org - UCS-2 and UCS-4 are encoding forms of the unified character set defined by the ISO/IEC 10646 standards, which now include at least 10646-1 and 10646-2. Unicode is mostly a different name for the ISO/IEC standards, but also include additional information about the use of the characters.

  137. Re:Hmm.. I must have been using something else the by vidarh · · Score: 2

    See my other post below. ISO/IEC 10646 and the Unicode standards define the character sets. UCS-2 and UCS-4 are encodings of those characters sets. UTF-7/UTF-8/UTF-16 are transformation formats that allow variable length encodings of the UCS-2 and UCS-4 encodings.

  138. Hmm.. I must have been using something else then? by vidarh · · Score: 4
    I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.

    UCS-4 is also quite common, and allows for the new extensions.

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.

    Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

  139. Re:More Flamebait :) by tb3 · · Score: 2

    Klingon into Unicode? I knew those people were obsessed, but that's just asinine! Fictional languages shouldn't even be considered, where would it end?

    "What are we going to do tonight, Bill?"

    --

    www.lucernesys.comHorizon: Calendar-based personal finance

  140. 16-bit Should Be Enough. by robbyjo · · Score: 2

    First of all, I think the editor (not the author) is right: "We're not in the same room". Therefore, 16-bit should be enough to encode even all the 50,000+ chars of K'ang Hsi dictionary. Moreover, if we try to encode ALL characters in the world, how redundant it would be. Surely Hindi speaking people won't speak Chinese and Hindi at the same time.

    Moreover, we have "Content Language" and "language" tag in HTML, don't we? If we ever want to encode two or more different languages, we can simply include these tags and be done with it. The browser can then pick the appropriate fonts and voila!

    Of the claimed 170,000 characters from the Orients, many of which can be unified since they are the same (in Japanese Kanji, Simplified, and Traditional Chinese). Simplified and Traditional Chinese share a lot of similarities. Even the simplified writings of a particular character often look nearly the same as the traditional one. Thus, the encoding for these two can be unified, only the font bitmap is different. Moreover, it won't be logical to use both simplified and traditional characters in the same article (except if they are exactly the same). So, these can save 50,000 characters.

    Japanese kanji, also shares a lot of similarities in both Traditional and Simplified Chinese (more to traditional than simplified). So, the encoding can be simplified too. Save another thousand characters.

    --

    --
    Error 500: Internal sig error
  141. More Flamebait :) by bark76 · · Score: 3

    Maybe if people didn't try to get character sets like Klingon, Cirth and Tengwar added into unicode we wouldn't have this problem!

  142. Re:Perl in Hierogliphics by Magumbo · · Score: 2

    :) Oh yeah! The king of wacky, terse, symbolic programming. You've gotta love it.

    --

  143. another drawback of unicode by Magumbo · · Score: 3

    And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.

    --

  144. Re:Well DUH! It's not meant to have every characte by haruharaharu · · Score: 2

    50k? The numbers i got from the Japanese Ministry of education were closer to 900.

    --
    Reboot macht Frei.
  145. You don't really KNOW about unicode, do you? by absurd_spork · · Score: 2
    Honestly, you don't really KNOW about Unicode and how it works, do you?

    The idea behind Unicode is to have a uniform encoding for all the world's scripts, not for all the world's languages. The necessity of this is evident for anyone who has experience with the insufficiencies of the individual codepage systems (Windows CPxxx, ISO 8859-x, ISCII etc.) currently in use. Have you ever tried to send an Arabic e-mail through a non-Arabic mailserver or run a program with German character support on a codepage 450 windows? Unicode is designed to programs and data interoperable regardless of either's language encoding.

    Just because you don't know Japanese it doesn't make the rendering of Japanese pointless. Just because you don't have a clue how a Chinese or Japanese Kanji input system works doesn't render the idea of being able to chat in IRC using Japanese characters entirely pointless.

  146. Re:Is this a problem? by trash+eighty · · Score: 2
    would you make the same comments if you were not able to speak/read english?

    millions of pages of the web are not in english y'know?

  147. Use Chinese for English Data compression! by eknuds · · Score: 2

    Actually, if you wanted to, you could write English/German/French/Spanish using Chinese! It would actually be fairly simple, one Chinese character == one English word. Just have the display program figure it out, ie translate the Unicode Chinese into the English/French/etc.. Achieve instant 50% or greater data compression. It's the perfect compression solution for us bandwidth-sucking Westerners! No verb tense or plurals? Don't need it! It's all fluff anyway! I'm only 1/4 joking...