Slashdot Mirror


Why Unicode Won't Work on the Internet

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

416 comments

  1. Is this a problem? by Anonymous Coward · · Score: 1
    English is not only the de facto standard of the internet but also a world language.

    Introducing foreign language character sets and languages only splinters the internet into artificial factions and we end up having borders on the net. Is that what you want?

    1. Re:Is this a problem? by Anonymous Coward · · Score: 1

      This argument is meaningless. All the poeple who speak Chinese, for the most part, are concentrated in a single geographic region. Not to mention that a large portion of those 1.3 billion people have never seen or used a computer. English is more widely spoken and understand around the world than Chinese. The Chinese can do whatever they want with their internal networks but worldwide use of Chinese? please....

    2. Re:Is this a problem? by Anonymous Coward · · Score: 1

      That's easy. Most of the characters mean the same thing in every dialect. It is mainly when there are newer contemporary terms that they differ more widely. The Official language of China is Mandarin, which is leftover from the Ching dynasty, so that is what would be used.

      There may be 7 main dialects, but there are probably more than 100 dialects in China. Since you are reading the text and not listening to it, you don't have to worry about how the different dialects sound. Also, with Chinese movies, it doesn't matter what dialect the movie is in (mainly Mandarin or Cantonese), they are generally subtitled for everyone to watch.

    3. Re:Is this a problem? by Isaac-Lew · · Score: 1

      English is spoken as a second language by many. Also, English speakers are more spread out over the world than any other language (except possibly Spanish, but they use a similar character set).

    4. Re:Is this a problem? by cicho · · Score: 1

      I read and speak English fine, thank you - but is it OK with you if I sign my posts with my name? I need ISO-8859-2 for that, or should I dumb down the spelling of my name to satisfy the "de facto standard"?

      --
      "Only the small secrets need to be protected. The big ones are kept secret by public incredulity." - Marshall McLuhan
    5. Re:Is this a problem? by pompomtom · · Score: 1

      Yes, it is. God you're an idiot. So everyone who CAN'T read and write english is going to be made to learn, or denied the net? Even when communicating with their compatriots? Get a grip. Fine, a lot of people speak English, (or at least US English), and it is probably the biggest SECOND language in the world. So what? Do you think it would be easier to organise some protocal for all characters, or to teach millions (if not billions) of people a new language? Someone mod me redundant, as I hope everyone understands what a dolt this guy is.

      Buckets,

      pompomtom

      --

      Buckets,

      pompomtom

      "There's an exception to every rule. Except for some rules"
    6. Re:Is this a problem? by autechre · · Score: 2

      Well, if we want to have the "standard" language be "Chinese", you'll first have to decide which one you want.

      China has 7 main dialects, according to my Chinese language class teacher. People in Shanghai speak a language that can almost be considered completely different than the one in Beijing. They use the same characters for writing, but use them to mean different things. At the very least, you have Mandarin and Cantonese.

      Also, while Chinese is a grammatically simple language (no conjugation, no pluralisation, etc.), it is less fun to write, because there is no alphabet. Yes, there is a different character for every word. Yes, there is a rhyme/reason to the characters, but that doesn't make it all that much less difficult to learn all of them. Oh, and you have to decide whether you want simplified or traditional Chinese characters to be the "standard", too.

      Finally, while the population of China is certainly the largest in the world, do they really have the most people _online_? I have no statistics, I'm actually curious.

      Sotto la panca, la capra crepa

      --
      WMBC freeform/independent online radio.
    7. Re:Is this a problem? by lingsb · · Score: 1

      Well, if we want to have the "standard" language be "Chinese", you'll first have to decide which one you want.

      China has 7 main dialects, according to my Chinese language class teacher. People in Shanghai speak a language that can almost be considered completely different than the one in Beijing. They use the same characters for writing, but use them to mean different things. At the very least, you have Mandarin and Cantonese.

      Yeah, they speak in different dialects, but the characters mean the same thing... So someone in shanghai can read a beijing newspaper, and understand it...

      --

      -BB

    8. Re:Is this a problem? by cougio · · Score: 1
      lOl

      There's a few things you are forgetting.

      US's net isn't growing much anymore. The rest of the world's net use is growing exponentially.

      Everyone wants to keep their languages. English is not a good standard for communication. And the world is getting pissed of US imperialism. So we're going to push a new standard (such as Interlingua) and there is going to be only one border: surrounding the US. You in your hole and the rest communicating.

      Is that what you want? I'm starting to believe it's what I want.

    9. Re:Is this a problem? by kenthorvath · · Score: 1

      Yes, but the Chinese aren't even allowed to read half of the stuff on the Net anyway... And then you have to worry about dialects and then you introduce a whole world of problems. Best to stick with good old Alphabet languages and BabelFish.

    10. Re:Is this a problem? by Hektor_Troy · · Score: 1

      Why not have the standard language be chineese? It's spoken and written by more people than english, so that would make more sence.

      --
      We do not live in the 21st century. We live in the 20 second century.
    11. Re:Is this a problem? by trash+eighty · · Score: 2
      would you make the same comments if you were not able to speak/read english?

      millions of pages of the web are not in english y'know?

    12. Re:Is this a problem? by dl_j10 · · Score: 1

      Yes, it is true. But the Americans can't understand the other two third stuff on the World anyway.

    13. Re:Is this a problem? by ruzulo · · Score: 1

      reading your comment i think it's not worth reading english web pages either.

  2. Re:another drawback of unicode by Anonymous Coward · · Score: 1

    That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.


    Must...resist... making... lame ... Perl readability joke... arrrghhh....

  3. Re:Is this a problem? - FYI by Anonymous Coward · · Score: 1

    FYI, There are more Chinesse who speak English than there are Americans who speak english. (I saw that on the Discovery Channel :p )

  4. Define two unicode escape chars = 196000 chars. by Anonymous Coward · · Score: 1
    What's the big whoop? Just define two measly escape characters in the 16 bit unicode set that mean "look at next two bytes for real character". This way unicode stays 16 bit for the bulk of the world and can expand when needed.

    In fact I'd propose 8 bit ASCII as the standard with say, 4 escape characters, each of which is followed by two bytes. This allows 252 + 64K + 64K + 64K + 64K or roughly 256000 characters and does so WITHOUT breaking most ASCII based services and code out on the net and in the world.

    Keep it compatible, stupid!

    1. Re:Define two unicode escape chars = 196000 chars. by kurisuto · · Score: 2
      Your solution is roughly functionally equivalent to the UTF-8 encoding of Unicode. UTF-8 is a way of representing the Unicode character space. It has the following properties:

      • Every character in the U0000-U007F range is represented by a single byte which happens to correspond to the ASCII code for the same character (so purely ASCII text is identical to the same text in UTF-8).
      • Characters outside the U0000-U007F range are represented by two, three, or four bytes.
      • You can tell what kind a given byte is by examining its high bits (it's either in the ASCII range, or is the first byte of a multi-byte encoding, or is a non-initial byte of a multi-byte encoding).
      Since UTF-8 is backward-compatible with ASCII, it is becoming widely accepted.
  5. Actually, India speaks English by Anonymous Coward · · Score: 1

    A good portion of India's 1 billion inhabitants speak English. The CIA World Factbook calls English India's "most important language for national, political, and commercial communication." So if even 30% speak/read English, that quantity alone puts it in competition with the number of people literate in Chinese.

    Don't believe me? see: http://www.cia.gov/cia/publications/factbook/geos/ in.html#People

    Plus, add a few hundred million who speak 'Bad EU English', U.S./Canadian, Austrialian. etc.

    But most importantly, most of the techhical documentation for the Internet and Web is in an English derivative.

    However, I do think that we can use Traditional Chinese to replace all these stooopid colorful little icons on my computer. At least then I can look up what they mean. :)

  6. Re:Well DUH! It's not meant to have every characte by Anonymous Coward · · Score: 1

    >Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Man, I strongly doubt that even a single person in the world knows 50000 characters off-hand. Even if you divided by 10, it would still be too large. The Japanese children slightly more than 2000 characters before they graduate from *HIGH SCHOOL*. Stop showing your ignorance. Christian Laforte

  7. Languages by Alex+Belits · · Score: 2

    "Unicoders" ignore the fact that any multilingual text is inherently stateful, so their idea of stateless stream of giant "characters" that will be easy to process is flawed at the core. In fact it's useful for decorative purposes only -- while it's easy to _display_ a unicode text (given in any of countless Unicode encodings), it's impossible to process or edit it without at least some state (current language) information to determine, what input method, dictionary, grammar rule, etc. to apply to any substring, so the goal of stateless text is just as misguided as the initial stateless filesystem representation in NFS. But if statelessness is kicked out of the window (like it should for anything multilingual), then information about both language and charset can be easily added to any substring, so all national charsets, ones that were specifically designed to be used in some particular language, and to which all processing rules and dictionaries were already written, can be used -- programs that don't care about charsets and languages will just handle them transparently as sequence of bytes, and programs that care should use state information anyway.

    How to include state information is a good question -- there are a lot of posibilities, and one of them is modification of HTML and XML specs to add charset attribute to everything that can have LANG. The problem is, for purely political reasons those specs specify only global charset for the whole document, and include LANG but don't include charset as an attribute for everything to make it impossible to use any non-Unicode charset for multilingual documents in them. This does not serve any legitimate purpose, and is an example of blatant sabotage of the specs to serve the interests of small but very influential and vocal group of companies that are interested in making multilingual processing as complicated as possible, so every simple task requires huge bloated application just to comply with the sabotaged specs, instead of simple byte-value transparency that otherwise would be sufficient. Raising the barrier for entry, decommodification and contamination of the standards at its finest.

    The reason why things like that are possible is, that in fact the demand for multilingual text processing (multilingual as one document that contains text in more than one language other than English because English is usually supported within non-Unicode national charsets and works just fine with them) is currently very low, and was even less when those "standards" were adopted, so obvious flaws did not cause immediate havoc. This is a commonly used strategy -- when no one needs something, write a standard for it that favors you, create a lot of noise around it, declare that it "dominates the industry" because no one else is doing it, and then wait until the need becomes more or less apparent. Then when it happens, everyone will somehow remember a piece of your noise, and you can loudly proclaim that all that time you was busy including new great standard into the innards of your software and lobbied all standards groups to include some reference into standards (that everyone, of course, ignored all that time because of the lack of the need for application). So, at the time when need is "more or less apparent" and the requirements to applications and standards quality is low, you can expand the "use" of your standard by people who don't need it or care about it, just because it was included into some of your products -- if features support in them is ridiculously poor, no one would notice because there isn't that much use anyway. The development of other, superior, standards will be stifled because you will always be able to claim that everyone is happy with your standard because there aren't many people complaining -- of course, there won't be many complaining because almost no one actually uses it for what it was supposed to be used it in the first place yet. At the time when real need arises so many products and standards will be contaminated with your standard that people will have to use it despite the obvious flaws. If standard stinks, you still can claim that no one made anything better anyway, so everyone should just use your POS, and if it breaks others' software design, they should just adopt yours.

    If this sounds too close to some particular company's favorite strategy, it probably is -- Microsoft with its nauseating file/documents formats design, mediocre and bloated, display/printing-only oriented text editing software is one of the most enthusiastic backers of the Unicode, and they do it despite the fact that their software itself often gets into trouble because Unicode is both hard to use and hard to implement. It doesn't matter, important thing is, if we had trouble with it doing it half-assed, everyone that will try to do it better will have much more trouble. Scorched earth strategy.

    --
    Contrary to the popular belief, there indeed is no God.
    1. Re:Languages by Alex+Belits · · Score: 2

      No. It's "Microsoft likes Unicode because it sucks, and because it's sticky enough to cause trouble for others".

      --
      Contrary to the popular belief, there indeed is no God.
    2. Re:Languages by scrytch · · Score: 2

      So basically your argument boils down to "Microsoft likes Unicode, therefore it sucks"? You come up with some fuzzy vague idea of encoding "language attributes" like grammar and dictionaries into character sets ... somehow, meanwhile conflating character sets with documents... I'm surprised you haven't asked for binary to be revamped. Try losing the scare quotes too, your sneering disdainful superiority for the subject and everyone associated with it was already fairly apparent.

      As Rand would say, A is A. Whatever sort of semantic meaning the letter might have in the context it's used in is not Unicode's problem. I hear we have things like document formats that handle that.
      --

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
    3. Re:Languages by tristan+f. · · Score: 1

      I don't know how to but this delicately, but... you're a blowhard.

      --
      Hi, I'm a pretentious cock who will make some gay comment about ignoring AC posts here.
  8. Re:Unicode's reply by Alex+Belits · · Score: 2

    It does not matter, what Unicode in theory can have in -- the allocation of characters is handled by a single, and not in any way open, organization, so the standard is all that is allocated and not that in theory can be if Unicode consortium would be benevolent enough, that we all know that it is not. Even if it would be, there is always some need to represent, in some consistent and unambiguous manner, text in languages that can't be possibly accepted into Unicode, such as fictional languages -- they can be easily handled by any expandable charsets-handling system and it won't be a rocket science to develop one, however Unicode supporters do everything that is possible for humans and sometimes more, to prevent any competing system from being developed. Also it does not matter what stated goals of Unicode are -- in fact it is being hawked to be used as the required internal representation of all text in all applications, and as the origin for encodings used for data manipulation, storage and transmission. These are facts, and so are the real problems that Unicode generates if used in that way. I have no problem with Unicode standard being a big dusty book used as a simplified manual for world''s alphabets, or as an intermediate format for fonts handling and texts conversion between different charsets of the same language. The problem is, Unicode is being used for things it is inadequate for, and its existence is loudly proclaimed as the reason to make no progress in development of any solution for multilingual texts handling that is not entirely based on Unicode-derived representation over the wire and in storage. This is selfish and counterproductive.

    --
    Contrary to the popular belief, there indeed is no God.
  9. Re:Unicode's reply by Alex+Belits · · Score: 2

    1. The standard is expandable if, and only if, it does not require a change of itself to adopt an expansion. For example, the addition of a new MIME type does not change the MIME standard, however the addition of a new tag does change HTML standard, therefore HTML is not expandable, what is pretty easy to notice while comparing different HTML renderers. XML is a near-absurd case because it's basically an umbrella that allows to declare all kinds of tags and therefore is supposed to be flexible and generate expandable standards, however the catch is, it does not provide any facility do automatically determine how to handle those tags' semantics in applications, so mere possibility to declare something new does not make it expandable either if applications' algorithms have to be modified. In Unicode however the situation is much more simple -- any addition of the characters IS a modification of the standard, and there is no possibility to automatically provide interoperability between older and newer versions.

    The existence of the procedure TO change the standard does not make it expandable.

    2. If Unicode will adopt all fictional languages/scripts/... it will become absolutely impossible to make complete fonts for it -- now it's merely a huge task, but then it will be plain impossible. The only real solution is to have standard that allows to name a language/charset combination, and leave the text in them intact until either user will install support for them, or application will automatically download it. Unicode doesn't help with it a single bit -- application encountered a character in unsupported range, and all it has is 16 or now 32 bits that it can only stuff in its virtual ass and report an error because no reasonable resolution can be made without some external assumption.

    3. ISO 2022 is a very poor implementation of stateful multi-charset character stream, and Unicoders are very fond of mentioning it as a proof that all possible stateful systems are bad. However repeating something that is false does not make it any less false -- in fact, after Unicode was adopted by IETF (on meetings behind the closed doors) all work on stateful character streams standardization was stopped.

    4. Computers can magically process all kinds of charsets. It's called byte-value transparency. Most of applications would work just fine if they just copied strings without making any assumptions about their structure or number of characters in them as long as bytes are bytes, and end of string is always 8-bit 0, what would be quite trivial for any stateful text system to implement. Tiny minority of programs need anything from a text that requires actual parsing other than finding newlines and, rarely, whitespaces. Display routines are different thing, however there aren't many of them, and all systems other than Windows support Unicode by combining and translating multiple fonts for multiple ranges, so supporting multiple fonts subsets for multiple marked charsets would be only easier to implement.

    The problem is, there are too many Windows programmers writing internet drafts now, so semantics of text display routines got stirred up from system-specific and application-specific processing where they belong, and contaminated standards responsible for data transfer, where they don't belong, and a lot of people now believe that to transfer some data one has to know how to display it in some pretty letters. Shame on you.

    --
    Contrary to the popular belief, there indeed is no God.
  10. Re:Unicode's reply by Alex+Belits · · Score: 2

    1. ISO is a closed standards body -- if it does anything, it makes standard less open.

    2. Private use codes aren't standard -- they don't provide any guarantees of interoperability, and merely provide a way to break the standard while fooling a program that is compliant with it into behaving how the user wants. If there was a way to put somewhere even a name of a charset to map "private" codes to a font name, it would solve a piece of the problem, but alas -- Unicode is made under the slogan of total statelessness of text, so while applications' file formats may allow this, arbitrary substring in a text can't.

    --
    Contrary to the popular belief, there indeed is no God.
  11. Re:Conspiracy Theories and Unicode by Alex+Belits · · Score: 2

    The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.

    This is, to say the least, incorrect. While there is a lot of effort to shoehorn Unicode into Unix and Unix software, the actual results are beyond miserable, precisely because Unicode does not work. Unix vendors solved this problems by adding a small support for to/from unicode conversion and by declaring that their filesystems support UTF-8, thus getting blessed by Unicode consortium as compatible. Guess what, UTF-8 can be "supported" in that way even by abacus, if that abacus is long enough and has at least 8 stones in a row, however actual use of it is a completely different thing -- I have never in my life seen a filename in UTF-8 outside of Unicoders' demos, and I am Russian myself and have a lot of friends that speak Japanese. So, again, Unix vendors' support of Unicode is in fact a lip service, not unlike Microsoft's support of POSIX or claims that Internet would support OSI 7-layers model (what ended with "temporary solutions" known as TCP/IP and Berkeley sockets replacing it).

    --
    Contrary to the popular belief, there indeed is no God.
  12. Re:Statelessness of text by Alex+Belits · · Score: 2

    Statelessness of text is something that Unicode tried to achieve, and still is using as their main argument toward its acceptance. Latin-1 is quite irrelevant here because its goals weren't as pretentious as Unicode, and impact on existing applications was near zero, and was basically "where are we going to use those values anyway?" Unicode actually is supposed to be used for serious multiple languages support, and requires fundamental changes in both applications and protocols -- with protocols causing a lot of infiltration of Unicode-based requirements into otherwise tansparent protocols. This would be at some extent justified if Unicode actually was a base for serious multilingual processing (what Latin-1 never claimed to) but otherwise it isn't worth the effort and problems that Unicode brings in. So, main advantage of Unicode over basically everything else imaginable (though not implemented because of pressure on IETF from Unicode), is statelessness of text stream.

    --
    Contrary to the popular belief, there indeed is no God.
  13. Re:Unicode's reply by Alex+Belits · · Score: 2

    You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal

    I can do that if anyone will listen. The problem is, the actual problem that it will solve does not exist yet, its time didn't come. Multilingual documents, for all purposes, don't exist beyind demos. Unicoders are using this to create their own standard that definitely won't hold water if demand already existed, but they can with their propaganda flood everything involved with standars -- certain person, Martin Duerst, subscribes to EVERY mailing list that may in any way touch multilingual text handling and every time someone mentions Unicode, floods it with tons of messages in support, and fiercely fights against every argument against. I have no idea what else that person does beyond that, if any, and how many hours is in his day, but it's extremely hard to support any serious argument when one side is so active, and most of people are disinterested.

    I have planned to do this when actually people will need multiple languages in their documents, and if someone can convince me that I have slept too long, and this time is now, I will happily start work, but otherwise it will be not just fighting with windmills, but fighting with windmills when there is no wind.

    --
    Contrary to the popular belief, there indeed is no God.
  14. Re:Conspiracy Theories and Unicode by Alex+Belits · · Score: 2

    UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.

    I merely tried to explain that UTF-8 is specifically designed to be used with any imaginable system -- what says nothing about its usefulness.

    I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.

    This is simply false. UTF-8 filenames and data can be used in any Unix if one wants to sacrifice functionality that people expect from a fixed-length characters representation (ex: regexps matching, cutting text at arbitrary offsets). However it's not a problem of Unix that users expect their encodings to be easier to use than a mess that UTF-8 is -- on other systems there isn't any counterpart to this functionality in utilities that are in common use.

    On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.

    Show me. I have seen a shitload of data, marked as UTF-8, yet used exclusively as ASCII, or even with different encodings actually in the data, but never -- actual multilingual database in UTF-8. Again, it demonstrates my point that Unicoders are trying to sneak their "standard" in while there is no demand and therefore no scrutiny for the quality of things being introduced.

    > and I am Russian myself and have a lot of
    > friends that speak Japanese.

    Umm. And the relevance of that comment is what?

    It means that I am in my own experience familiar with handling of multiple encodings, with what people use in the real-life texts handling, and their willingness to use Unicode, that happens to be below zero. You can claim that their reasons are irrational, and Unicode is still the best solution for them, however I still don't see, why opinion of almost everyone who actually knows about the subject from practice, and is supposed to benefit from what Unicoders are proposing, can be dismissed so lightly.

    --
    Contrary to the popular belief, there indeed is no God.
  15. Re:Unicode's reply by Alex+Belits · · Score: 2

    Because, gee, the need to communicate with someone in another language is new.

    When people communicate, they choose one language for it -- usually one that both know best. No one speaks like "Ya odnowremenno trying goworit' po-english i russkomu, and esli ya by znal nihongo ya would simultaneously speak po-yaponski, too".

    It's very important to see the distinction between the need to support "multilingual document" that contains multiple languages within one body of text and to support documents in multiple languages within one system or program. Also historically it happened that documents in all languages can painlessly include ASCII text, so non-English language + English is usually treated the same way as a text in non-English language, not requiring any special tools to be handled. One may claim that this is wrong, but this is how things happened to be developed over decades.

    I've never seen VCR instructions in multiple languages

    Those are multiple documents, not one document with multiple languages in it. There is clear separation between versions in different languages, and this is already being accomplished easily, even in MIME email.

    , I've never seen a bilingual dictionary

    Dictionaries are special cases, and they usually are distributed in either printed form, or as a database -- they almost never are seen as plain text documents. In both for-print-only formats and in databases there are plenty of ways to represent languages and charsets as metadata, and absolutely all computer dictionaries that I have seen chosen to use native encodings.

    , and the EU driver licenses only have one language on them, not every language of the EU.

    Again, I assume that the whole text of the license is repeated in multiple languages, not individual words are repeated in each language within one body of text, so the same definition of multiple documents applies.

    --
    Contrary to the popular belief, there indeed is no God.
  16. Re:Unicode's reply by Alex+Belits · · Score: 2

    So Reta Vortaro , an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.)

    First, without any doubt it is a demo -- the set of languages to which trnaslations are available varies from word to word, and in real life one would never want to have translation into multiple languages to always appear, clogging the screen. Second, this is an application (even though a simple one), not a document, and there are plenty of ways for applications to handle multiple charsets even now. My point is, functionality that supports multiple languages within application is completely ortogonal to the support of multiple languages within a single document or string. Unicoders love to mix those two.

    Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too.

    Again -- I don't see why this particular application used UTF-8, however neither its design requires it, nor those files are for any purposes normal text documents -- even uncompressed, they have strict formatting and are even indexed, so they could use just any charsets/encodings possible.

    And the Debian main page , where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.

    Absolutely. This list of languages is obviously a gimmick that provides nothing that list of languages in English wouldn't provide -- everyone in the world, for whatever reason, knows how his language's name looks in English even if he can't read English. In the case of Debian page, if I was looking for Russian translation, I certainly would search for "Russian" string to find the link (it's interesting that the word "Russian" is the only one, where both "native" and English name of the language are mentioned in the Debian page -- I assume, because a lot of Russians actually use Russian translation but don't have UTF-8 enabled or supported in their browsers). Also, Debian home page automatically chooses the language if it's announced by the browser, so if I really wanted Russian version and set language preferences in the browser, I wouldn't even have to touch anything else. And lo and behold -- when I choose Russian, the page appears in koi8-r, what happens to be Russian local charset, not any form of Unicode.

    --
    Contrary to the popular belief, there indeed is no God.
  17. Re:Unicode's reply by Alex+Belits · · Score: 2

    You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode.

    The simplicity of Unicode is only in its authors' imagination. Yes, it's easy to present Unicode to people who don't know the details as a simple solution -- the problem is, reality isn't as simple as it looks.

    I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode.

    When the goal is just to make a text that can be printed in pretty letters, anything is ok as long as it's implemented. This is why a lot of low-quality products such as MS Office are so popular -- in fact so popular that I often receive email with nothing but plain ASCII text as a MS Word file. However even in this case a complex typesetting system (that would most likely just use multiple fonts in whatever charsets they happen to be avilable because it cares more about fonts) would be more appropriate.

    When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.

    How deceptive. The implied assumption is that "obtaining" charset support is some kind of nonzero effort while using Unicode is smooth regardless of the language. Both things are incorrect -- in a system with multi-charset support the charsets support can be loaded automatically depending on the languages and charsets mentioned -- if someone wants to have support for everything Unicode supports at the extent Unicode supports it, he will only need fonts, and the amount of the information and resources used would be exactly the same as if he had their support in Unicode. However in practice usually the goal is different -- only few languages and charsets are in active use by the same user at the time, however he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

    Again, Unicode user still ends up having to somehow get something language-specific, except that his language-specific data and procedures also have to be designed to use Unicode, what differs from the procedures that are already in use, and often open source. Software vendors would love that -- they can either keep making localized versions of all software with Unicode support but with different language-specific procedures, or try to make tools that can handle all languages and spend man-millennia rewriting trivial things and then release them as the only way to use Unicode in practice. In either case they get their money because old software, Unicode-supporting or not, will not match the requirements for multilingual documents processing, and their new solution will be complex and therefore hard to reproduce.

    My idea is that infrastructure for stateful text processing is as unavoidable as the existence of different languages and writing systems, so it would be foolish to try to decieve people into thinking that displaying pretty letters is the main problem of handling multiple languages or multilingual documents. I don't see how denying undeniable is justified. Most of people are ignorant about the details because at this moment the problem isn't evident, and problem isn't evident because the whole field of its application is not in any way related to their everyday life, however I don't think that every kind of ignorance deserves to be abused with such a long-lasting possible consequences.

    Extending the idea that in multilingual text attributes that should be applied to substrings ("state" when text is treated as a stream) are necessary, I can say that since statefulness is unavoidable anyway, charset/encoding is just as good attribute as the language or, say, language-dependent parameter such as direction (for example, in Japanese left-to-right and up-to-down directions are both acceptable, even though modern texts use left-to-right). The implementation of "full unicode" text processing, even in a primitive display-only manner, is not any simplier, and certainly isn't any lighter on resources than a multiple charset support -- in fact multiple charsets support can be easily built on the top of any existing text displaying or printing procedure that supports multiple fonts and multibyte characters. The only "big question" is how to represent attributes in a text stream, but this is merely a question of formally declaring some decision to be standard -- one can design many of them easily, and almost everything that a sane human mind can create at this moment in history would be infinitely superior to iso 2022.

    --
    Contrary to the popular belief, there indeed is no God.
  18. Re:Unicode's reply by Alex+Belits · · Score: 2

    Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

    Then what was the point of your argument? If implemented in the display-only library and used for displaying/printing only, Unicode is just as "simple" as would be any other system, with or without multiple charsets. If program does anything complex, it should handle various language-dependent stuff anyway, however bare Unicode support provides no such infrastructure, and a reasonable infrastructure can be implemented either with or without Unicode. Then what is the advantage of Unicode? Being self-proclaimed status quo in standards' backroom-politics, that no one supports properly anyway, that is hard to segment into subsets, non-expandable, maintained by a closed standards body and requires more resources?

    I don't claim that Unicode theoretically can't be used as the base for languages support -- in theory it can, but the problem is, it provides no advantage compared to multi-charset system if used as a part of multilingual text support infrastructure. I have already explained why such infrastructure does not exist now, however I believe that when it will become necessary, someone will have to implement it anyway. So now, when no one needs it, Unicoders are busy to claim this "piece of noosphere", just like some people tried to sell land on Mars -- just because it's there, and before it will become obvious that it's not theirs.

    The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible.

    By this logic it should use Microsoft Word or at least PDF -- both very widely supported, more wide than even plain text files in UTF-8 (yes, I know, Word can use unicode internally -- this isn't the point).

    Unicode HTML and UTF-8 plain text are those formats.

    Are they? Most of my boxes don't have them installed -- the one I am writing this message on is an exception, but only because it has Mozilla, what is still a bloatware. My handhelds most likely never will have them installed -- they don't have enough ram, and need rather nontrivial manipulations with characters size and formatting to keep texts in some languages readable, so plain stream of unicode text would be impossible to display without some heavy heuristics.

    Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience.

    I wouldn't dream to propose a non-open standard for this. However the trouble with open standards is that they never appear before they become necessary, and I, following the principle that standards and tools should be developed as the need arises, am not making any detailed proposals at this time. But when there will be a need, the standard that will be created must be open, expandable and easy to port and reimplement -- something that anything Unicode-based is not. If you mean that charset is "proprietary", I am not aware of any charset except, maybe, "klingon in private unicode" that was in any way declared to be someone's property. If multi-charset support infrastructure will be created, it would be reasonable to include some common facility into the libraries that will make it possible for users to allow programs, when they see an unknown language ar charset, to automatically download fonts, tables and even formatting/comparison/input methods/... source code automatically from some servers that keep directories of known charsets and languages, and this would be an open, expandable and flexible infrastructure, available to everyone. If someone wants his language that never had local charset in the first place to be represented by its range in Unicode, he should be able to do that, however in a system like that there should be no reason to prevent established language/charsets combinations from being used just because of someone's narrow view of the problem.

    Maybe I am wrong is this traditionalism, and it will be better if I made an infrastructure for stateful text support just to demonstrate this point -- after all, even with all dynamic fonts/code/input methods/... it won't be in any way more complex than any other solution, merely useless because right now still almost no one uses multiple languages in a single document. But maybe the need to demonstrate the solution for a problem that no one experiences yet is now a good reason enough when someone else is trying to sneak in an impractical solution as the standard while no one is looking.

    I see current advance of Unicode as something that may serve some simple need now, but can severely limit further progress if accepted as widely as Unicoders are trying to get accepted. That would not be "good enough" as TCP is "good enough", large SMP kernel lock was "good enough" or C pointers are "good enough" -- it's "good enough" as Windows, region codes, crippleware, etc. are "good enough" -- people accept them because those things are pushed, and the inconvenience they create isn't bad enough until it's too late, but when it's too late, people still use them because there is nothing else in sight.

    --
    Contrary to the popular belief, there indeed is no God.
  19. Re:Unicode Character Set vs Character Encoding by Jordy · · Score: 2

    Actually, the Unicode specification for UTF-8 places an artificial limit of 4 8 bit code units for variable length encoding as that is all that Unicode currently requires.

    ISO 10646 defines UTF-8 as having up to 6 8 bit code units.

    At 4 bytes, UTF-8 can only map to 0x10FFFF. At 6, it can map to 0x7FFFFFFF.

    Of course, my math could be wrong.

    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
  20. Unicode Character Set vs Character Encoding by Jordy · · Score: 5
    The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
    The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

    ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

    UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

    Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

    UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

    A good document on this is available at UTF-8 And Unicode FAQ
    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
    1. Re:Unicode Character Set vs Character Encoding by jholder · · Score: 1

      You say:

      ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000.

      Actually, this is incorrect. Unicode / ISO 10646, when you count the high and low surrogater range of code points, addresses just over 21 bits worth of data - 1,112,063 code points, including private use area.

      Why do I know? Been to recent Unicode conferences, have the standard, and I write software for it.

      --
      -- John
    2. Re:Unicode Character Set vs Character Encoding by jholder · · Score: 1

      0x10ffff = 1,114,111 code points. So, what I said still stands.

      --
      -- John
    3. Re:Unicode Character Set vs Character Encoding by Morgo · · Score: 1
      UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF


      No, Unicode only allows character values up to 0x10FFFF (the 16-bit basic multilingual plane, plus 2^20 surrogate pairs). This conveniently means that all characters can be expressed in four bytes in UTF-8, as noted here.

      That FAQ, as I pointed out in another thread, is not 100% accurate.

      ...but it can handle the entire Unicode character set.

      All encodings can handle the entire character set. They'd be pointless if they couldn't!

    4. Re:Unicode Character Set vs Character Encoding by Morgo · · Score: 1
      Do they have all my Zapf DingBats?

      Yeah, actually they do :-) But I don't have the book with me right now so I can't tell you the exact range of codes.

      In fact they're one of the few (maybe the only) ranges of characters where the shape of the glyph is defined by the standard. Most characters are just encoded by semantic value (in other words, the letter 'A' is encoded once, and not separately for different fonts and styles - an italic 'A' is still just an 'A').

    5. Re:Unicode Character Set vs Character Encoding by crath · · Score: 2

      Jordy, you're completely missing the point. You have clouded the writer's religious bias against western civilization by bringing facts into the discussion. If the writer had wanted to be deal with facts he would never have written his article in the first place: he only wants us westerners to acknowledge that johnny-come-latelys to the Internet game should have an equal place at the table. In other words, stop the Internet and computer technology from moving forward until everyone's perspectives have been completely accomodated; everyone, that is, except for those who started the revolution!

    6. Re:Unicode Character Set vs Character Encoding by ClarkEvans · · Score: 2

      Nice summary. Although UTF-32 only implements a subset of UCS-4 due to compatibility issues. You can find more information on unicode at unicode.org, in particular their faq is very helpful, especially the sub-faq on UTF-16 and the BOM.

    7. Re:Unicode Character Set vs Character Encoding by ClarkEvans · · Score: 2

      Morgo is correct. Unicode is only capable of representing a sub-set of ISO 10646-1:2000. This is detailed in the UTF-32 definition among other places which says: UTF-32 is restricted in values to the range 0x000000 to 0x10FFFF, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16.

    8. Re:Unicode Character Set vs Character Encoding by TekPolitik · · Score: 2
      UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation.

      Not quite, although you could be forgiven for believing this. UCS-2 is just a truncated UCS-4, which represents exactly 65536 characters (less a little over 2048 now) and was orginally the same as Unicode. UTF-16 is an encoding which extends the range of possible characters to around 1million, and Unicode has been redefined to be the same as UTF-16. Current versions of Windows use UTF-16, not UCS-2

      A good reference for this is theUTF-8 and Unicode FAQ for Unix/Linux

      The article's claim that Unicode can only map 65536 characters is fundamentally flawed, since its new definition as being the same as UTF-16 means it can probably map every character ever used, and in fact includes fictional scripts (including Tolkein, and more importantly, Klingon, although I'm not sure of the standardisation status of the latter at this time).

    9. Re:Unicode Character Set vs Character Encoding by blair1q · · Score: 2

      All encodings can handle the entire character set. They'd be pointless if they couldn't!

      Do they have all my Zapf DingBats?

      --Blair
      "And what do we do when the Venutians touch down?"

    10. Re:Unicode Character Set vs Character Encoding by ek_adam · · Score: 1

      Yes. Here's a chart.

  21. Re:Duh. by jandrese · · Score: 2

    Just because you can't read other langauges doesn't mean multi-language support is useless. Oh, and inputting Kanji on a keyboard is quite feasable, try using the Windows IME sometime (It's built into 2000).

    Down that path lies madness. On the other hand, the road to hell is paved with melting snowballs.

    --

    I read the internet for the articles.
  22. Uh, I Don't Get It by Aaron+M.+Renn · · Score: 1

    It looks like the argument is that since ancient Chinese texts can't be fully reproduced in Unicode, that the standard is flawed. I disagree. There is already a four byte character set out there - UCS-4 I believe, which is ISO something or other - that will easily handle all characters as necessary. This set can be used for replicating all XX,000 old school Chinese characters. Thinks of it as the SGML of character sets. But for common applications, XML (ie, Unicode) will continue to do just nicely.

    1. Re:Uh, I Don't Get It by Old+Wolf · · Score: 1
      Is anyone else getting sick of seeing -- Remove "Trash+" to reach my inbox instead of my trash folder.. I write on behalf of *myself only* - not my employe after every 3rd or 4th message in this thread?

      This brings me to another beef: people sometimes "spam-proof" their email by adding characters to the -username- part and not the domain. This means that any spam will still go to the domain provider anyway, but get bounced. The spam bandwidth is still used and the site owner still gets spam. Lame! Modifying the domain means that the mail will never go anywhere.

    2. Re:Uh, I Don't Get It by vidarh · · Score: 2
      UCS-4 is not a character set. It is an encoding of Unicode, similar to UCS-2 (UCS-2 is 16 bit, UCS-4 is 32 bit), and UTF-7, UTF-8 and UTF-16 (variable lenght encodings).

      Except for UCS-2 (and perhaps UTF-7? I don't remember), all of them can encode about a million glyphs (the reason it's not more is due to the way the codespace is laid out, separating things in "planes", and reserving a lot of space for private use etc.)

  23. Re:Overstating and misunderstanding the problem by imroy · · Score: 1
    There's probably variation among Germans, but at least some Germans do write a "1" as an upside-down V.

    yes, I quickly noticed this when I spent a short time in Germany a few years ago. Sort of a very tall and skinny v, I'd say. Looked kinda like a greek letter I thought, but I can't remember which. need sleep...

  24. Unicode includes all common Asian character sets by Per+Abrahamsen · · Score: 3

    I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.

    Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.

  25. Babel by jafac · · Score: 2

    He sure did do a good job when he slapped that old Tower of Babel bitch down.

    --

    These are my friends, See how they glisten. See this one shine, how he smiles in the light.
  26. UCS-4 by Iffy+Bonzoolie · · Score: 1

    Isn't this what UCS-4 is for? I can't imagine there are more than a billion characters. Of course, most of the Unicode software that deals with wide characters won't work with UCS-4. But any decent UTF-8 based program should support up to 6 bytes per character.

    But I guess internally most programs use 16-bit characters, because it's easier to deal with, and just convert into more compact forms like UTF-8 when they want to save or transfer it.

    --
    Run a pencil-and-paper RPG campaign with your far-off friends: Gametable!
    1. Re:UCS-4 by spitzak · · Score: 2
      UTF-8 can encode 31 bit characters with an obvious extension of the standard (or perhaps this is part of the standard, I'm not sure).

      After that point it breaks down (if you continue the first byte is filled and the prefixes will have to go into the second byte). Alternatively you can quit at that point, use the remaining bit as the 32'nd bit, and say that UTF-8 cannot encode more than 32 bits.

      Anyway, one big advantage of variable-sized encoding is that there is potentially no limit to the size of the data transferred.

      My personal opinion is that UTF-8 should be used *everywhere*, including all internal interfaces to libraries and services like X. The sooner we stamp out these "wide characters" and all the complexity they cause by doubling or quadrupiling the number of interfaces we need, the better.

      UTF-8 probably does not address the concerns of the article, which is about the fact that Unicode does not contain all possible scribbles drawn by humans. But the fact that English speakers were able to compensate for decades and even adapted to the rather arbitrary and limited 62-character ascii set would indicate that people will easily compensate for this as well.

  27. Re:Duh. by Malc · · Score: 1

    You have to install character support for those other languages because most fonts don't contain complete coverage of the Unicode character set. If you install "Arial Unicode MS" off the Office 2000 CD, you get character support for a lot languages. Sorry, I can't remember what option to choose in the Office 2000 setup. Don't forget, Win9x/ME is multi-byte only via code pages with Unicode being a per application thing, and WinNT/2K/XP is Unicode only but with little support as all the Windows applications try to be Win9x compatible with the least amount of effort.

  28. Re:Solution - Everybody use Euro-English! by Zarquon · · Score: 1

    Mark Twain wasn't the only one with a story like this.. There was a similar one I ran across in a
    Astounding anthology.

    --
    "'Tis great confidence in a friend to tell him your faults, greater to tell him his." --Poor Richard's Almanac
  29. Use UTF-8, don't worry about sizes by iabervon · · Score: 3

    UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).

    The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.

  30. Re:2 + 1 bytes? by Jeremy+Erwin · · Score: 2
    Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).

    Typically, one shouldn't apply font styles on a character by character b as iS.

  31. Chinese language(s) by danny · · Score: 2
    I highly recommend S Robert Ramsey's The Languages of China to anyone interested in language in China.

    Danny.

    --
    I have written over 900 book reviews
  32. Re:Overstating and misunderstanding the problem by tjansen · · Score: 2

    >>the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V No, the handwritten one in Germany looks more like the 1 in an Arial font. bye...

  33. Case by Pseudonymus+Bosch · · Score: 2

    26 letters

    You mean 26 uppercase and 26 lowercase.
    __

    --
    __
    Men with no respect for life must never be allowed to control the ultimate instruments of death.
    GW Bu
  34. Re:ASCII stupidity all over again... by spitzak · · Score: 2
    When ASCII was invented it was based on existing typewriters, including the ones sold in Europe. At that time output was on paper and it was rather easy to overstrike characters to produce accented characters. How else do you explain the existence of the '~', '^', and backquote characters (in addition the underscore code was originally a macron). They actually designed it so the countries of NATO could type as well as they could on a typewriter. They also deleted several characters Americans wanted (fractions, fl and fi ligatures, open and close quotes, cent sign were all very common on typewriters of that period).

    Yes, only a small set of countries was considered, and only minimal support. But this claim on "no support for anything not USA" is false.

  35. Re:ASCII stupidity all over again... by spitzak · · Score: 2
    I didn't claim they tried to support all European characters. What I meant is that they did not ignore them totally. They (rather stupidly) thought that a few accent marks would do the job.

    The cent sign was replaced with the caret. That is why shift+6 prints a caret, if you look at old typewriters that was how you printed a cent sign. This is in fact the main reason I think they considered European support, since from an American point of view the cent sign is more important. The fractions were what were replaced by the square braces. The curly braces, vertical bar, and apparently the tilde were added later (originally they printed as square braces, slash, and caret, and devices that totally ignored the lower-case bit were allowed, and the original tilde was changed to underscore because that character was missing originally).

    "Extended ASCII" usually refers to the replacement of several of the punctuation marks with European characters. This was pretty useless because by then most OS's had assigned meaning to those punctuation marks (like the square brackets), also only 5 or 6 new characters were available. This died almost immediately when people started supporting the 8th bit as data rather than parity.

  36. Re:Arabic space by spitzak · · Score: 2
    Is there a good reason for this or is this due to some stupidity in MSWord? I assumme it has something to do with bidirectional scripts, but if normal space is not used for anything in Arabic then I would accuse MicroSoft of being stupid.

    If this is the normal non-breaking space character (0xA0 in Unicode) then it takes 2 bytes in Unicode.

  37. Re:Arabic space by spitzak · · Score: 2
    Yes, I am guessing that MS picked a character to mean "backwards wrapping space" or something, and it sounds like all Arabic must have the words seperated by this character rather than space. It apparently is not the "non breaking space" or some Arabic equivalent, if I understand the orginal poster correct.

    The question was "did they do this for a good reason? Ie: doing this allows formatting control that could not be achieved otherwise. Or were they just stupid/lazy, and if normal spaces were used with a slightly smarter program would it be just as good?

    I personally don't know anything about Arabic so I cannot answer these questions. My guess is that this is reasonable if there is a place that "normal spaces" are used in Arabic.

  38. Re:UTF-8 should be fine for almost any application by spitzak · · Score: 5
    Thanks for some more intelligent discussion about UTF-8.

    I might add a few things:

    In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!

    In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!

    The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.

    Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.

  39. Re:All Character sets simultaneously?? by K-Man · · Score: 2
    I got all sorts of spurious matches from the Latin words, which wouldn't happen if the Greek and Roman letters weren't sharing a single character space.


    However, in Unicode, Chinese, Korean, and Japanese all share the same codepoints and glyphs, so you can't grep for one language or another.

    For instance, if you were searching in Korean for "Kim Il Sung", this string in Unicode would be the same as the Chinese characters for "gold" (jin), "one" (yi), and "star" (sheng), so your search would get hits from other sino-based languages in addition to Korean.

    It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is different for the different Asian languages, even though they use the same characters.
    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  40. Correct, also see link by K-Man · · Score: 2

    Yes, the author was overbroad with that statement. All languages work on a restricted set of phonemes; there are some 200+ identified, but no one language uses near that number. Hangul covers all the Korean phonemes, but not much else.

    Here's a good description of Hangul. If you check this page, you'll notice I was wrong about the vowels; they don't seem to describe their own pronunciations at all, but rather the yin and yang elements of their sounds :-P.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  41. Re:You bring up a good point by K-Man · · Score: 3

    If you read the article, you'll find a decent description of Korean Hangul, which has around the same number of characters as English (IIRC, it has 24).

    Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.

    Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

    Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  42. Re:Unicode's reply by dvdeug · · Score: 2

    > the allocation of characters is handled by a single [...] organization,

    Slightly incorrect. It's handled by the Unicode Consortium AND the ISO 10646 standards group.

    > not in any way open, organization

    It's as open as, say, the ISO C++ standards group. That is, unless you're connected to the right corporation or country, you won't get a seat, but they still accept outside submissions and respect experts outside the group.

    > Unicode consortium would be benevolent enough, that we all know that it is not.

    Benevolent how? Benevolent enough for what? It took them less than a year to get LATIN CAPITAL N WITH LONG RIGHT LEG encoded, for a minor language with no political power (Lakota). They're constantly encoding new letters and scripts for groups with no political or economic clout (Z with hook below for Old High German, various Phillipine scripts in 3.2).

    And no one's stopping you from hacking up your own multi-charset system, and using it whereever you want. But loudly claiming that you're being oppressed doens't prove that you are, and doesn't prove that your system would actually be superior to Unicode.

  43. Re:No, _n_ bytes per character! by dvdeug · · Score: 2

    Part of the point of UTF-8 is that non-ASCII characters don't get encoded with ASCII characters. In your system, you can get an '/' or a '\0' or '\e' byte that doesn't represent that character, meaning that all Unix software needs to be changed to support your encoding. As it is, Linux accepts bytes for filenames without caring whether it's UTF-8 or some 8-bit code or some other multibyte code that obeys the same rule, knowing only that the byte '/' is uniquely the directory seperator.

  44. Re:Unicode's reply by dvdeug · · Score: 2

    > ISO 2022 is a very poor implementation of stateful multi-charset character stream,

    You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal solution.

  45. Re:ISO-2022-JP and "alphabetical order" by dvdeug · · Score: 2

    Why is the character order a problem for the Japenese, and not the Germans, the French, the Lithuanians, the Belarusians, and almost every other language in the world? Latin-* does not encode anything besides English in alphabetical order, and neither does Unicode. (It's theoritically impossible; the Lithuanians want the Y to precede the J, and the Danish and the Swedes disagree about where the a with ring above goes.)

    If you go to the Unicode standard (found online at http://www.unicode.org/unicode/uni2book/u2.html ) they have an index with all the characters by radical and stroke. They also have an index with all the characters found in JIS sorted by their JIS index.

  46. Re:Unicode's reply by dvdeug · · Score: 2
    The problem is, the actual problem that it will solve does not exist yet, its time didn't come.

    Because, gee, the need to communicate with someone in another language is new. I've never seen VCR instructions in multiple languages, I've never seen a bilingual dictionary, and the EU driver licenses only have one language on them, not every language of the EU.

    Multilingual documents, for all purposes, don't exist beyind demos.

    So Reta Vortaro, an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.) Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too. And the Debian main page, where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.

  47. Re:Unicode's reply by dvdeug · · Score: 2

    The fact that you chose to dismiss this stuff as demos does not change the fact that it's in actual use. Revo's author doesn't feel like changing the format of his dictionary because you don't agree with it. The web is full of gimmicks, but people like their gimmicks; why do you think Java took off? You can't just call it a gimmick and dismiss it; if that's what people to do, then that's what people want to do.

    You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode. I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode. When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.

  48. Re:Unicode's reply by dvdeug · · Score: 2

    > The simplicity of Unicode is only in its authors' imagination.

    Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

    > When the goal is just to make a text that can be printed in pretty letters [...] even in this case a complex typesetting system [...] would be more appropriate

    The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible. Unicode HTML and UTF-8 plain text are those formats. Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience. Project Gutenberg has existed for 30 years. What "complex typesetting system" format can claim the same? How many "complex typesetting system"s that could handle it are available on many different platforms? At least 70% of the people on the net can read Unicode HTML, and many of the rest could with little work and no cash expenditure. What "complex typesetting system" can say the same? How is a "complex typesetting system" simpler than Unicode plain text?

    > he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

    Nonsense. For "Old High German", I will map ALT-Z to ȥ in XEmacs. Spellcheckers don't exist for this language, I'm not going to sort the data, and it's just a z with a hook, so there's no special formatting rules. The people who read the book don't even need a way to enter the character, any more than reading the original book precipitated a need to enter it into the computer.

    Displaying pretty letters isn't the end all and be all of multilingual computing, but it's a damn good start. The only registered character set (ISO 2022 registry or IANA registry) that supports Lakota or the Cherokee syllablary is Unicode. No, on most systems, they can't get decent support; handcrafted keyboard maps must be used, there's no spelling or sorting support. But they can type the characters in and send them across the net and print up papers, which is better than nothing.

  49. Re:umm by Ares · · Score: 1

    The encodings are a temporary fix to a permanent problem, much in the same way that NAT "expands" the IPv4 address space. The real solution is to use a character set that can encompass all the world's characters at the same time

  50. Re:Esperanto, Ido, lojban; BCE by unitron · · Score: 2

    He explained it as "before the Christian era", no doubt for the benefit of those only familiar with B.C. and A.D., but did not define it as that, although anyone who needs it explained no doubt also needs it defined as "Before Common Era" (and should also be told that what comes after is "C.E.", or "Common Era", and that B.C.E. and C.E. correspond to B.C. and A.D., respectively), so he did screw up just a tad.

    --

    I see even classic Slashdot is now pretty much unusable on dial up anymore.

  51. I wish it was so simple. by spacehunt · · Score: 1
    Simplified and Traditional Chinese share a lot of similarities. Even the simplified writings of a particular character often look nearly the same as the traditional one. Thus, the encoding for these two can be unified, only the font bitmap is different.

    You can't do that, since very often a simplified character maps to several traditional characters. Even if you can, it won't be a saving of 50,000 characters, only several thousand at best.

  52. Quit whining about how hard it is to type by spacehunt · · Score: 1

    Just check out the latest Nokia phones sold in Asia to see how easy it is to type Chinese SMS messages. Or how about the input method they use on those ESDlife terminals around Hong Kong. Both uses less than a dozen keys to enter Chinese, with no need for prior training.

  53. Re:Prejudice? Or technical hurdle... by spacehunt · · Score: 1
    The computer industry is still strongest in the US, and most OS software is still written by US-based companies. Why don't some Chinese software developers come up with their own language standard and write a bunch of software with it?

    The same reason that you can't get a name brand PC without Windows preloaded.

  54. Re:Some errors by MaufTarkie · · Score: 1

    Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

    Both kanas are derived from the Chinese characters that they "borrowed"; hiragana is the smoothing down of an entire kanji character, while katakana is more or less radicals taken from kanji.

    Hiragana was used initially solely by women and was derived from Chinese character's "caoshuti" in Heian era (794~1192). It was initially called "onna de", or "women's character".

    Katakana (literally, "side script") is derived from a Chinese character of the same sound, ignoring semantic meaning. It was invented by Kibi no Makibi (AD 693-755). They were initially used as pronunciation aids in Buddist scripts; but later became verb endings.

    If I'm to believe my last professor, at the end of WWII, this all changed. Hiragana was changed to reflect verb endings and particles, while katakana was reserved for foreign words, usually Dutch, German, and Chinese -- and with the post-war American occupation, English.

    --
    Without you I'm one step closer to happiness without violence.
  55. Re:After some skimming... by Zagadka · · Score: 1

    Yes, his anaology was a bit off. It would be more accurate to say "imagine if English-speakers were restricted to an alphabet which is missing characters like Æ or fi (the ligature)". While Unicode is missing lots of Chinese characters, the vast majority of the characters which are missing are characters that only historians use. One only needs to know about 2000 characters to be considered fluent in Chinese, and if you know 7000-8000 characters, you're way above average.

  56. Re:umm by Zagadka · · Score: 1

    Chinese uses unicode by combinations of roots and the other parts of the characters

    While many of the more complex Chinese characters do consist fo simpler radicals used in combination, they're not encoded that way in Unicode. For example, the word "ma" used at the end of many questions consists of the character for mouth and the character for horse. In Unicode, the encoding for "ma" is completely unrelated to the character for mouth and the character for horse though.

    (i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)

    Just because you can't read Chinese characters, it doesn't mean Chinese people can't.

    The "syllabic system" you're talking about is probaby pinyin (or perhaps bopomofo, but it doesn't really matter - they're isomorphic). Converting Chinese text to pinyin actually results in information loss. It isn't a really viable solution. The Chinese people also like their linguistic system, despite what American public schools have taught you.

  57. Re:umm by Zagadka · · Score: 1

    It sounds like it was traditional Chinese with bopomofo annotations. That's apparently a common way of teaching characters in Taiwan.

    Bopomofo is essentially the same as pinyin, except pinyin uses English letters, while bopomofo uses non-roman characters. Mainland China uses pinyin, while Taiwan uses bopomofo. There's a 1-to-1 mapping between bopomofo and pinyin. Chinese is also tonal, so pinyin and bopomofo are also often augmented with "tone marks" or sometimes just numbers. Those would be the accents you saw.

    There's a pretty good page with more info here.

  58. Re:umm by Zagadka · · Score: 1

    When was the last time you met someone who left China and said "thank goodness I don't have to use that pictograph system [sic] those emperors put in place to keep everyone illiterate". Most Chinese continue to enjoy reading Chinese text long after they've left China and learned to read and write languages with phonetic alphabets.

    And despite their many flaws, you can't really accuse China's communist government of encouraging illteracy. On the contrary, the communist government in China actually created simplified versions of a large number of the commonly used but complex characters (back in the 50's), and these became the standard character set in mainland China. (that's why there are both "simplified" and "traditional" Chinese characters) It's also interesting to note that Taiwan, which is much more democratic than mainland China, still uses the traditional character set -- the same character set supposedly used to keep everyone illiterate.

  59. Re:It works by h2odragon · · Score: 2
    No, that makes too much sense; it's not all inclusive so let's trash the whole thing, start over from scratch, and revert to 7bit ASCII in the meantime. We need a system that can handle every glyph that has ever had meaning to somebody, somewhere.

    ...for the sarcasm impaired, the above should be read as "good point".

  60. Re:Alrighty by ajf · · Score: 1

    Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?

    You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.

    So why can't Unicode take this approach, and encode words in a similar phonetic fashion? Nobody expects a codepoint for every word in English, German and French.

    --

    I miss Meept.

  61. Re:umm by amorsen · · Score: 1

    Contrary to popular belief, and contrary to what most people will think after reading the article, Unicode can contain all characters in the world. What the article actually says, in its own roundabout way, is that the UCS-2 encoding of Unicode cannot encode all Unicode characters, and that some of the characters it cannot encode are fairly important.

    I can only agree with that. UCS-2 is a silly idea, and UTF-16 is a bad bandaid for it. UTF-8 is great, and if you really must have an encoding with equally much space used by all characters, use UCS-4. UTF-8 is infinitely extendable and will never run out of characters, not even theoretically. UCS-4 can encode millions of characters; the measly 170,000 characters mentioned in the article do not create a problem for UCS-4.

    The only problem Unicode has, is that Microsoft chose UCS-2 for some important things in Windows and Office. They are fairly alone in that stupidity.

    --
    Finally! A year of moderation! Ready for 2019?
  62. No you didn't read the article, or even think by A+nonymous+Coward · · Score: 2

    You are so euro-centric it's not even laughable. As the article said, those who claim Unicode good enough for the masses are the same foreigners who would scream and howl if someone tried to remove redundacies from the English language such as pork and ham, or argue and dispute, or ...

    I have read that an English language vocabulary of 300 words is good enough for most ordinary conversation. You are claiming the equivalent is good enough for ordinary use. You are mistaken.

    Unicode is a classic case of (western) imperialism, in which the imperialists are completely blinded as to why it is imperialistic, and continue to mutter "it's good enough, and we know what's good for you smelly foreigners."

    --

    1. Re:No you didn't read the article, or even think by WNight · · Score: 2

      Whoops, here comes the racism...

      Anything a white guy wants, or someone who might be a white guy, is wrong, euro-centric, penis-dominated, and wrong.

      Now anything a non-white, non-guy wants wants is automatically right.

      Now a person whom is completely anonymous on the internet can be assumed to be white and male if they disagree with anything said by a non-white, non-male, or someone who lives outside of 'europe or north america'.

      You know, there are a lot of reasons for disliking Unicode, and a lot of reasons for not wanting to waste time implementing a system which has 1) grown monstrously beyond original specs and 2) doesn't help you at all.

      IMHO, you should use those ~65K characters and stop your pathetic sniveling. If you want a character set that supports more, make it yourself and get others to use it.

      If you ever want anyone outside of your immediate family to use it, you'll have to make it worth their while.

      What does the other 80% of the world get out of supporting your dead language? Uglier URLs? More bloated OSes? Slower web usage?

      Sure sounds important to me.

      And before you scream "Racist!", ask yourself if you have any proof, or if you're just pissed that I don't agree.

    2. Re:No you didn't read the article, or even think by WNight · · Score: 2

      I'm sure those 65000 characters would better represent any asian language than German or French is represented without their accents, yet speakers of those languages didn't pull this entitlement crap to make people support larger character sets.

      No language is going to be properly represented, especially when you consider ancient forms, so we'll have to accept that nothing is perfect. We've run into diminishing returns and now people want to increase the complexity of the system a hundred-fold just to get some characters than only a thousandth of one percent of the population will ever know are missing, let along want to use in conversation.

      There's no written language that more than 80% of the world population uses, so I stand behind my original estimate.

      I'm not at all racist in what I say, I'm merely sick of cattering to the special interest groups. Especially the special interest groups that claim to be part of a larger group. (In this case, 99.999% of the population of the original subject's country couldn't give a shit about having ancient characters from a dead language in their URLs, it's *his* issue, not theirs, but he's making it seem like a race issue and oppression of the little guy.)

    3. Re:No you didn't read the article, or even think by Troed · · Score: 1
      The proof of your racism lies in your own text ..

      What does the other 80% of the world get out of supporting your dead language?

      He's correct, you wouldn't know how bad 65000 characters express his particular language. Who are you to say he should use them and be satisfied?

      I guess you're american, that would explain everything.

  63. Input devices are much more of an issue by sacherjj · · Score: 1

    Eventually, with much nudging along in the territories of high-resolution color and graphics, better input devices (such as the scanner, which can be thought of a fax machine for computers), better output devices such as the inkjet and laser printer, and even bastardized keyboards and software which could generate thousands of characters - if only one can remember each and every one of the input codes. Graphics tablets eased the pain of having to get something into and out of the computer. But none of this is yet fully satisfactory, and perhaps it will remain in this state until the advent of the intelligent, voice-understanding, "computer" finally comes into our daily lives.

    I recently saw a story about Japanese reporters who send their stories in via a phone call and dictation on the other end, because it is so much faster that trying to get it into the computer for digital transmission. I don't think the common character representation is the only issue here. As the article states, some languages are just much, much harder to digitize.

  64. Re:But for Java by sacherjj · · Score: 1

    It won't work if you try to write your Java in Mandarin... :p

  65. Re:C programs by Luke · · Score: 2

    I'd love to see the linux kernel coded in Python.

  66. What about the artist formerly known as Prince? by mattkime · · Score: 2

    Does the artist formerly known as Prince get his own charcter space as well?

    Will I need to download a new character set on windows to view it?

    --
    Know what I like about atheists? I've yet to meet one that believes God is on their side.
    1. Re:What about the artist formerly known as Prince? by kevinank · · Score: 2

      Prince is again Prince. He got his name back when the music industry contract that prohibited him from using his own name expired.

      --
      LibBT: BitTorrent for C - small - fast - clean (Now Versio
    2. Re:What about the artist formerly known as Prince? by vidarh · · Score: 2

      I don't know about the status, but I believe it was proposed by someone a while back... :-)

  67. Re:Quit whining and move to a phonetic alphabet by scrytch · · Score: 2

    > The obsession with phonetic spelling is an unhealthy and rediculous pathology

    Despite the fact that we move inexorably toward it anyway.
    --

    --
    I've finally had it: until slashdot gets article moderation, I am not coming back.
  68. Re:Quit whining and move to a phonetic alphabet by DavidTC · · Score: 1

    I think we should move to Esperanto.

    -David T. C.

    --
    If corporations are people, aren't stockholders guilty of slavery?
  69. Re:Solution - Everybody use Euro-English! by RSevrinsky · · Score: 1
    The sad thing is that this is already the standard in SMS and Instant Messaging....

    - Richie

  70. Re:Well DUH! It's not meant to have every characte by Mike+Buddha · · Score: 2

    Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

    How is this translation software supposed to work if there is no standard for interchange? Magic? How are we supposed to translate these characters that have no symbol for the computers to process?

    There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer.

    What makes you think that we can't encode all these characters? Are we going to run out of numbers? A 32-bit number can hold 4 billion different values, and if that isn't enough, we can use a 64-bit number. We certainly aren't going to run out of numbers.

    --
    by Mike Buddha -- Someday the mountain might get him, but the law never will.
  71. Re:After some skimming... by K. · · Score: 2

    I suspect Unicode is a lot more upsetting to
    a "reference writer specializing in rare Taoist
    religious texts and medical works" than to
    ordinary Chinese users who want to run Photoshop
    or put their wedding pictures on a web page.


    Let me get this straight - you think people
    should be prepared to accept having restricted
    access to the literature that underpins their
    culture in exchange for their very own
    geocities.cn?

    K.
    -

    --
    -- Proud descendant of semi-nomadic cattle-herders.
  72. Re:After some skimming... by BJH · · Score: 1

    Why exclude the Constitution? Let's put everything published before...oh, say 1800... in there. The Gutenberg Bible (in fact, all Bibles preceding the modern versions), Shakespeare, Chaucer, say goodbye to them all!

    I bet you can't even name one major Chinese or Japanese text. Plenty of people still study them in high school or university, just as you'd study Shakespeare. Don't spout crap when you don't know what you're talking about!

  73. Re:After some skimming... by BJH · · Score: 1

    And what is the point of that? You end up with Unicode characters of unequal length, which further complicates the whole problem (actually, these already exist...)

    Part of the problem with the Chinese character set is that it is not an character set so much as a dictionary

    Oh, bullshit. It's a character set just as much as ASCII is.

  74. Re:totally unconvinced by BJH · · Score: 1

    Thank you for deciding how 1.3 billion Chinese, 120 million Japanese and 50-odd million Koreans should write.

    In news today: The Chinese have embarked upon a simplification of the U.S. Constitution, stating that "it's too hard to understand." The result is expected to be declared an ISO standard within the next three years, with adoption by the US expected to be completed by 2005.

    Now go learn something about the languages of which you speak.


  75. Re:Compaction and Traction by BJH · · Score: 1

    Jesus, what is it about this article that attracts a level of cluelessness normally only seen in "IANAL" threads?

    Japan does not have only 2,000 Kanji. Chinese can not be written satisfactorily with 10,000 characters. And how would you like it if the Chinese told you you can't write Shakespeare the way he was meant to be written???

  76. Re:Alrighty by BJH · · Score: 1

    And you obviously have a Western-centric mindset. Not to mention a stunning lack of knowledge of how Japanese, Chinese and Korean are actually used on today's computers.

    Sheesh.

  77. Re:totally unconvinced by BJH · · Score: 1


    I did indeed read your post, including the bit saying, Perhaps some people will have problems using it today. In that case those people should interact with the standards committee instead of whining, and get their characters into the next version.
    What you don't seem to get is that the Chinese, Japanese and Koreans have been trying to get the Unicode Consortium to produce a sensible standard from day one and they simply refuse to do so. Perhaps you should go look up the history of Unicode; there's been a lot of serious discussion and objection among the CJK people that's never made it into the open.

    Now go learn something about how to parse basic English sentences.

    Perhaps you'd like to debate the subject on a Japanese forum? No? I thought not. Excuse me if I don't view monolingualism as some indication of superiority...

  78. Re:After some skimming... by BJH · · Score: 1

    Oh wonderful. So now we're not allowed to do searches, either. Where'd you come up with that bright idea, buddy?

    And in case you didn't know, there are perfectly acceptable ways of inputting as many characters as you like in Chinese, Japanese or Korean. Just because you don't know how to doesn't mean everybody else doesn't.

  79. Re:Quit whining and move to a phonetic alphabet by BJH · · Score: 1

    It's not "whining", you knuckle-dragging moron. Just as Shakespeare is best appreciated in the original English (have you ever read a translation of Shakespeare? Didn't think so...), classical Chinese or Japanese texts are best read in the original. You know why? Because the author can convery subtle nuances and differences in meaning by choosing a particular character over others that have similar meanings.

    Go learn Japanese or Chinese, and try and read a phonetic transliteration of a classical text. See how far you get.

    Sheesh.

  80. Re:Quit whining and move to a phonetic alphabet by BJH · · Score: 1

    *Sigh*. Actually, I have read Hofstadter, but if you want to be difficult, let me qualify my statement: "Have you ever read a translation of Shakespeare into a non-European language?"

  81. Re:Compaction and Traction by BJH · · Score: 1

    There are _slightly_ more than 2000 kanji in Japanese, but Japanese printers, like my wife's father, don't use more than 2100 absolute tops.

    Funny, my Postscript printer sitting beside me can do about 6,500 Kanji, in pretty much any Japanese font available.

    Chinese characters obey Zipf's law on a near perfect logarithmic scale. As in, the first ten characters make up about 60% of written text.

    For someone who's supposed to have worked with Chinese dictionaries, that's an awfully strange claim to make. Unless you meant the first thousand characters, not the first ten.

  82. Re:Some errors by BJH · · Score: 1

    It was a course meant to take non-Japanese speakers from zero to college level in one year. Of course, that's impossible, but you can get by until you find your feet.

  83. Re:Nonsense by BJH · · Score: 1

    You titled your reply "Nonsense", but...

    1) You say that there are, indeed problems with the conversion tables. When you're trying to convert megabytes of electronic documents, a problem that may seem "small" to you seems much bigger, believe me.

    2) You admit that code unification has several disadvantages, and then say that the distinctions "the Japanese" wanted were ported over. Ummm... have you been to a Japanese discussion on Unicode issues? The only Japanese who are satisfied with the current standard are those who were paid by the Unicode Consortium to put their rubber stamp on it.

    3) So why are there som many different encodings for Unicode, if UTF-8 is so great? Oh, by the way, the problem is that Unicode doesn;t allow many people to encode their languages fully.

    So, if I may ask, how was my post nonsense?

  84. Some errors by BJH · · Score: 5

    Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

    In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

    Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

    That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

    Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

    Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

    After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

    Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

    The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

    Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

    That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

    - No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
    Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

    - A draconian unification of CJK characters.
    The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

    - The ugly "extensions".
    Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

    I could go on, but I should get some sleep...

    1. Re:Some errors by salyavin · · Score: 1

      Nice post, I'd mod you up if I had the points.

      There is one small correction though. Katakana are not only used for foreign words but also emphasis kinda like italics. I confirmed this with my teacher who remembers the war and the changes in the language sense.

    2. Re:Some errors by ProfBooty · · Score: 1

      1300 in one year? Wow thats a pretty intensive course! I assume you studied in japan? When I was there for school I picked up a lot more from walking around on the street than from class.

      Then again, I didn't have much inpetus to study more characters once I started using a kokugojiten(japanese-japanese dictionary for non japanese speaking people) and kanji dictionary.

      Too bad I have forgotten so many, guess I will have to start studying again for the JPLT.

      --
      Bring back the old version of slashdot.
  85. Re:Yes, it is by Delphis · · Score: 1

    Well most people's English is better than a lot of the rest of the world's Chinese - in Europe and the USA. I still think all character sets should be supported, but the internet is, as of now, mainly used by English-speaking (as their primary language) nations, and even those who it isn't their primary language (e.g. continental Europe), they are mainly all fluent in English.
    You could say 'Well, those lazy-ass Americans/Brits etc. they should go learn another language' .. it's not going to happen though. Spanish is more likely to take over than Chinese at any rate.

    --
    Delphis

    --
    Delphis
  86. Re:Prejudice? Or technical hurdle... by Delphis · · Score: 1

    Finally, imagine that a political body imposes a deadline on imported programs.. that they must support their new standard by such-and-so a date or it won't be permitted within the country. The Chinese did this, extending the deadline to Sept. 2001. I only found out about this yesterday.


    The Chinese lately seem to just be trying to piss everyone off as much as possible.

    Granted, people whining about prejudice when they don't understand the technical reasons behind it doesn't help at all. It's sad really. These people need to grow up and understand that large changes don't happen immediately, and that if the state of affairs is how it is then it doesn't necessarily mean that anyone is trying to 'oppress' them. Take a fucking pill, people. There's such a bandwagon for being the 'oppressed people' that people who feel begrudged immediately assume this without THINKING.

    ARrrrrghh.. Can't we all just get along?

    --
    Delphis

    --
    Delphis
  87. Re:Esperanto, Ido, lojban; BCE by Detritus · · Score: 1

    A.D. stands for Anno Domini, Year of our Lord in English. The problem is that for most of the world's population, Jesus is not "our lord".

    --
    Mea navis aericumbens anguillis abundat
  88. printing [ Chinese ] vs dictionary [ Chinese } by peter303 · · Score: 2

    Its a lot like the Oxford English Dictionary
    versus Websters Collegiate- Chinese printers have
    gotten by with 7-10K characters versus the 60-80K
    in the full language. Synonyms and hononyms are
    used for the more obscure words. The standard
    modern Chinese dictionaries only have this smaller
    number of characters.

  89. Re:Solution - Everybody use Euro-English! by Gulthek · · Score: 1

    So that world in Planetfall was actually Earth of the future? :-)

    SEENIK VISTA
    Xis stuneeng vuu uf xee Kalamontee Valee kuvurz oovur fortee skwaar miilz uf xatfaamus tuurist spot. Xee larj bildeeng at xee bend in xee Gulmaan Rivur iz xee formur pravincul kapitul bildeeng.

  90. totally unconvinced by kaisyain · · Score: 2

    One of the author's main propositions seems to be that Communist Chinese and Taiwanese/Overseas Chinese want different spaces in Unicode for the same characters.

    I don't see every Western nation asking for it's own encoding of "w" or accented characters. The author doesn't give any explanation for why we should pay attention to IMHO silly political whining in this particular case.

    The author further implicitly assumes that it is reasonable to include the deprecated K'ang Hsi characters in addition to the official characters, but gives no justification for this view. I don't see unicode trying to include all possible historical graphings of Western characters.

    1. Re:totally unconvinced by vidarh · · Score: 2
      One of the reasons they want different glyphs is that the characters actually look different in present day use.

      As for including all possible historical versions of Western characters, there are very few that are sufficiently different from present day renderings to be easy to confuse.

      But I agree that his criticism is mostly whining. Most of all because Unicode 3.1 has shown that unicode absolutely is not a static standard, but one that is evolving to encompass more characters on a regular basis. Perhaps some people will have problems using it today. In that case those people should interact with the standards committee instead of whining, and get their characters into the next version.

      But for most people (including most Chinese and Japanese people) the current Unicode standard will be comprehensive enough for most use.

    2. Re:totally unconvinced by vidarh · · Score: 2
      Did you actually read my post? I explained why there is a legitimate request for different versions of similar characters among the CJK glyphs. I also suggested that people that needs the missing characters work to add them.

      Finally, however, I did suggest that to most people using Chinese, Japanese and Korean, the current set of 94,140 characters, of which about 65.000 are there for the benefit of Chinese, Japanese and Korean, would be sufficient.

      I did not write anything to imply that noone would run into limits. I did not write anything to imply that people who do run into limits should accept that (hence my suggestion that they work to have the characters they need accepted in forthcoming revisions of the standard).

      However I do stand by my claim that 94,140 characters will be enough for most people most of the time, including people using Chinese, Japanese and Korean.

      Now go learn something about how to parse basic English sentences.

    3. Re:totally unconvinced by trash+eighty · · Score: 1

      its not the same thing though, and in thousands of cases the characters are not the same

    4. Re:totally unconvinced by trash+eighty · · Score: 1

      indeed, i have heard criticm about unicode from asians for years, this is not news at all. hey u westerners, CJK scripts are different you see

  91. Re:Solution - Everybody use Euro-English! by sharkey · · Score: 2

    Das rubbernecken sightseenen keepen das cotten picken hands in das pockets, so relaxen und watchen das blinkenlights.

    --

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  92. There's no hard limit at 65k anyhow by boots@work · · Score: 1

    Mr Goundry seems to be ignoring the fact that UTF-8, which is for many purposes is the most useful representation of UNICODE characters, can compatibly and simply represent the full ISO-10646 character space of 2^31 characters, just by using more of the reserved prefix bits. I can't see what his objection is to this.

    Plane zero (the first 65k characters) is supposed to be enough for most ordinary people speaking modern languages. The additional codeplanes exist to satisfy the valid needs of linguists studying archaic or obscure languages, and our author is welcome to use them.

    65k characters seems like a reasonable limit for ordinary use: going beyond that is going to require a whole new level of complexity in font representation, character entry, and so on. Even being able to display all those characters only gets you halfway: you have to take into account ligature, composition, kerning and layout rules, and it's not at all clear many program authors will find this worthwhile for obscure dialects.

    The elegance of UNICODE is that it offers a smooth migration path away from ASCII through UTF-8, captures a great majority of uses in plane zero, and can expand to handle more obscure cases.

  93. Re:Solution - Everybody use Euro-English! by augustz · · Score: 1

    Funny funny.... why does this post get lamed?

  94. Re:You bring up a good point by gleam · · Score: 5

    The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source

    A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here.

    Regards,
    gleam

    --
    this .sig is not a .sig.
  95. Re:another drawback of unicode by Ole+Marggraf · · Score: 1

    Actually,someone thought about hieroglyphs in UCS (this was mentioned in the quickies section some time ago):


    http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1637/n 16 37.htm


    I don't know whether it is/will be implemented at the end. Looking at the limited character space, probably not.

    --
    God, root, what is difference? - Pitr
  96. Re:Quit whining and move to a phonetic alphabet by dutky · · Score: 2

    The obsession with phonetic spelling is an unhealthy and rediculous pathology: to understand why, have a look at Justin B. Rye's Spelling Reform page (subtitled And the Real Reason It's Impossible).

  97. Re:UTF-8 should be fine for almost any application by dutky · · Score: 3

    Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.

    I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.

    Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.

  98. Re:I'll take that challenge by bears · · Score: 1

    There's a couple of well-known palindromes:

    Snug & raw was I ere I saw war & guns.
    Lewd I did live & evil did I dwel.

    Imagine these in a list of palindromes. Substitute the & and their meaning remains, but their palindromeness - and hence reason for inclusion in the list - is gone.

  99. Re:Did you even read the article ? by jholder · · Score: 1

    Actually, if you have fonts that render the UniHan in a locale-sensitive way, it works perfectly. The UniHan in Japan show the kanji with the proper (culturally) way to draw them with a Japanese font, the UniHan in China show the hanzi correctly culturally, although you need different fonts for Simplified and Traditional, and similarly for Korea.

    It is not so different from using different type faces - ie, in old Germany, most this were printed in an gothic-style font that was culturally correct, whereas printing in other countries at the same time used more "roman" based typesetting. However, the old German gothic 't' was unified with the Latin 't' - are we complaining? It is a similar issue.

    --
    -- John
  100. Re:unicode does *not* encode 65,536 characters by jholder · · Score: 2

    This is incorrect, 44,946 surrogates were approved in March as part of Unicode 3.1.0.

    Unicode 3.1 and 10646-2 define three new supplementary planes:

    Supplementary Multilingual Plane (SMP) U+10000..U+1FFFF (1594 chars)
    Supplementary Ideographic Plane (SIP) U+20000..U+2FFFF (43,253 chars)
    Supplementary Special-purpose Plane (SSP) U+E0000..U+EFFFF (97 chars)

    Or plane 1, 2, and 14. (from the Unicode 3.1 Technical report, #27)

    --
    -- John
  101. Re:UTF8 by lordpixel · · Score: 1
    no. not only 8 bits.

    Variable length encoding, 8, 16 or 24 bits depending on how common the character is
    Lord Pixel - The cat who walks through walls

    --

    Lord Pixel - The cat who walks through walls
    A little bigger on the inside than out

  102. Re:UTF8 by lordpixel · · Score: 1

    Indeed, but the character numbers were initially orderded in something resembling common usage. At least, all of the 1 byte characters correspond (very closely or exactly) to latin-1 encoding, of which the first half corresponds to ascii, of course

    This is a pretty good assumption on probability of occurance today's Internet, I won't try to predict the future :)


    Lord Pixel - The cat who walks through walls
    --

    Lord Pixel - The cat who walks through walls
    A little bigger on the inside than out

  103. Classical specialists need special tools by redvine · · Score: 1

    The writer of this article has a clear bias, being a reference writer specializing in rare Taoist religious texts and medical works. He certainly doesn't seem concerned about encoding Egyptian hieroglyphs or sanskrit. Or what about mathematical or chemcial symbolic systems.

    If the web handled day-to-day writing that would be pretty remarkable. Classicists specializing in arcane texts of certain era's may have to install special plug-ins to be able quote from original arcane works. Why should the day-to-day system be burdened with ancient languages that one small group of specialists use?

    1. Re:Classical specialists need special tools by Another+MacHack · · Score: 1

      How is it a burden to the day-to-day system? Just don't use the characters you don't need to use. The varying width UTF encodings use fewer bytes for characters closer to the beginning of the code-space, and if that is a bad match to your frequency usage, there's always compression.

  104. Re:Unicode != UCS-2 by divbyzero · · Score: 1

    Wrong. You're confusing the UCS-2, the encoding, with UCS, the character repertoire.

    To spell it out, "Universal Character Set - Two Byte Encoding" (UCS-2) is one of many encodings which can represent the "Basic Multilingual Plane" (BMP) subset of the "Universal Character Set" (UCS) character repertoire.


    But my grandest creation, as history will tell,
    --
    But my grandest creation, as history will tell,
    Was Firefrorefiddle, the Fiend of the Fell.
  105. Re:After some skimming... by WNight · · Score: 3

    Oh gawd, just listen to the feelings on entitlement in that messages...

    You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.

    You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.

    There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.

    I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.

    In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.

  106. Re:Helping the poor by mattsouthworth · · Score: 1

    but "helping the poor" is an imperialist structure, it just is. It's imperialist to think that these 'poor' people (more often than not impoverished because of imperialist actions, say, building a dam to flood their farmland, woah, sorry, I'll try to keep my own bias under control...) want to be 'rich' in a Western sense. Before you can help someone, you have to ask him or her what would be helpful - ya know? It's wrong to assume that someone would want to learn english and make american dollars, maybe they would just like to get their farmland back and live the way they had for the last 1,000 years.

  107. Why do we have UTF-8, UTF-16 and UTF-32? by mcdurdin · · Score: 1

    There are a lot of misguided or uninformed comments about Unicode here.

    Just FYI, the Unicode mailing list have already read and dismissed the claims of this document -- the document has a lot of factual errors. For instance, Unicode 3.1 supports 1,000,000+ characters, not ~90,000.

    The first thing to remember is that The Unicode Standard itself is really just a list of characters associated with codepoints.

    Not all the Unicode codepoints have been allocated yet -- some of them never will be. Space has been left with most alphabets to encode other characters discovered or new characters.

    In raw form, these code points run from 0x0001 to 0x10FFFF, in 17 planes of ~65536 code points each. A few of these characters are reserved, such as 0xFFFF. That makes over 1,000,000 characters, which should be enough for anyone.(Although 640K may come to mind for some...)

    All the UTF-* encoding forms are just ways of representing these characters. ALL of them support the full range of Unicode codepoints.

    UTF-32 represents the codepoints exactly in 4 octets.

    UTF-16 represents the codepoints in 2 octets for plane 0 (0x0001-0xFFFF), and in 4 octets for the remaining planes, using 'surrogate pairs'. Surrogate pairs are two UTF-16 codes, the first from the range D800-DBFF, the second from DC00-DFFF, encoding code points (UCP) in planes 1-16 as follows: UCP = (surr1-0xD800)*0x0400 + (surr2-0xDC00) + 0x10000.

    UTF-8 is very clever. It manages to encode European codepoints in an average of 1.1 bytes. If the top bit of the octet is 0, it represents a standard ASCII character. Go read the Unicode website (http://www.unicode.org/) if you want to know more about it.

    There's also UTF-7 for the truly insane. And UTF-8s as used by Oracle and so on...

    The Unicode Standard is not perfect -- what is? -- but it is definitely the only standard out there that even comes close to approaching the goal of supporting all the world's characters.

  108. Re:Quit whining and move to a phonetic alphabet by HenryFlower · · Score: 2
    But you wouldn't say that if you could read ancient Greek (I assume that's what you meant, not modern Greek). If you could, you would be happy that there need be no longer a half-dozen ideosyncratic methods for encoding ancient Greek, with equally ideosyncratic input methods. All you are really saying is: if I don't need it no-one does. And I don't see how a character set allowing faithful encoding of Greek characters and diacritics places any special burdens on you, who don't need to use them....

    Or perhaps you are being very subtly sarcastic? Or trolling?

  109. This is so wrong by jfedor · · Score: 2

    The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here.

    Unicode may have its problems, but this is not one of them.

    -jfedor

  110. Duh. Duh. by Ranger+Nik · · Score: 1

    it should be pretty obvious that unicode is for people who speak that language.
    unicode's purpose is not to teach everyone how to read japanese or to make google's search engine read it (the latter could be done, though). it doesn't try to solve the problem of different languages on the planet.

    as an english speaker, i find it pretty convenient that i can type and use english characters on my computer. as opposed to, say, japanese kanji - even though kanji is not all that hard to learn. what a concept... my own characters! wow!

  111. Re:umm by Ranger+Nik · · Score: 1

    not true. the purpose of Unicode is to have one unique number (unique... maybe that's where the name is coming from) for each character on the planet.
    e.g. all of chinese, korean, english (doesn't make a big dent), arabian (whoops - dunno what it's really called), etc has to fit in there.

    there are not unicodes for every language. there is only one for all of them. i think the reason was to make things more simple.

    ... they should have used 32 bits to begin with...

  112. Compaction and Traction by JJ · · Score: 2

    The 64,000 should suffice. Ideographic scripts, like Chinese are were the problem arises. The number of characters in Chinese is not fixed, unlike the number in most alphabets. I have a Chinese novella which was written in just 300 characters. 10,000 would be a good place to start, a few thousand more would cover all but specialized texts. Japanese could fold into Chinese, since there are only 2000 kanji characters and a few hundred kana.
    Throw in Arabic, Cryllic, Sanskrit, Dravidian, Hangul (Korean) and Navaho and you still add only a few thousand. The odd European characters (the 'ss' in German, the extra Danish vowels, . . .) add a few hundred tops. Even the special linguist marks and punctuation don't add much.
    If you have to double the Chinese, now you run into trouble. Its classical characters vs. simplified. The later is for the PRC. If you also bloat the number of characters required so that specialized religous characters are required, now you start to push the system. 64K would be fine if a special marker character could be used which signify's that the next character is from the special table. Unicode has resisted this effort.

    --
    So long and thanks for all the fish . . . !!!
    1. Re:Compaction and Traction by JJ · · Score: 2

      You know, this is actually the one topic that I am probably best versed on discussing. My info sci masters advisor was on the committee which established ASCII and my linguistics masters was on medieval Chinese dictionaries. Plus, I used to live in Japan.
      There are _slightly_ more than 2000 kanji in Japanese, but Japanese printers, like my wife's father, don't use more than 2100 absolute tops.
      Chinese characters obey Zipf's law on a near perfect logarithmic scale. As in, the first ten characters make up about 60% of written text. For each unit of ten up from that include about 60% of what is left. At 10,000 characters you have all but about 2.5% of most newspaper text. The few thousand extra that I spoke of covers mostly proper names.
      Chinese most certainly can be written satifactorily in this manner.

      --
      So long and thanks for all the fish . . . !!!
    2. Re:Compaction and Traction by topham · · Score: 1
      You'll have difficulty entering Shakespeare as it was originally written.

      The standard keyboard doesn't have the appropriate characters.

      (Not to say you CANNOT do it, but rather the 26 characters on your keyboard are not actually enough.)

    3. Re:Compaction and Traction by egomaniac · · Score: 1

      Japanese *has* folded into Chinese, as has Korean. It's the CJK block (Chinese, Japanese, Korean). There are two blocks, Traditional and Simplified. Additional characters could be encoded in Plane 1 (as I've pointed out elsewhere in this thread).

      The CJK block, IIRC, already has ~27,000 characters in it. About 42,000 of the Unicode codepoints are already assigned.

      And your comment about "Unicode has resisted this effort" is worse than clueless - this is already implemented as surrogate pairs, which enable about a million more characters.

      This is worse than an IANAL thread ... we need IKANAU (I Know Absolutely Nothing About Unicode) disclaimers.

      --
      ZFS: because love is never having to say fsck
  113. Re:Quit whining and move to a phonetic alphabet by scruffy · · Score: 2
    Did you actually read what you linked to? Justin Rye is very sympathetic to "spelling reform", but he realizes it is utopian:
    The flaws of the standard orthography are indefensible - but it has an extensive Installed User Base, and can thus afford to ignore criticism in exactly the same manner as Fahrenheit thermometers, QWERTY keyboards, and certain software packages, which can all rely on conformism, short-termism, and sheer laziness for their continued survival.
  114. Quit whining and move to a phonetic alphabet by scruffy · · Score: 3
    Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?

    But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.

    I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.

    Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.

    I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.

    1. Re:Quit whining and move to a phonetic alphabet by borkbork · · Score: 1

      I think Mark Twain had a great response to this:




      European officials have often pointed out that English spelling is unnecessarily difficult; for example:
      cough, plough, rough, through and thorough. What is clearly needed is a phased programme of
      changes to iron out these anomalies. The programme would, of course, be administered by a
      committee staff at top level by participating nations.




      In the first year, for example, the committee would suggest using 's' instead of the soft 'c'. Sertainly,
      sivil servants in all sities would resieve this news with joy. Then the hard 'c' could be replaced by 'k'
      sinse both letters are pronounsed alike. Not only would this klear up konfusion in the minds of
      klerikal workers, but typewriters kould be made with one less letter.




      There would be growing enthusiasm when in the sekond year, it was announsed that the troublesome
      'ph' would henseforth be written 'f'. This would make words like 'fotograf' twenty persent shorter in
      print.




      In the third year, publik akseptanse of the new spelling kan be expekted to reash the stage where
      more komplikated shanges are possible. Governments would enkourage the removal of double leters
      whish have always been a deterent to akurate speling. We would al agre that the horible mes of silent
      'e's in the languag is disgrasful. Therefor we kould drop them and kontinu to read and writ as though
      nothing had hapend. By this tim it would be four years sins the skem began and peopl would be
      reseptive to steps sutsh as replasing 'th' by 'z'. Perhaps zen ze funktion of 'w' kould be taken on by
      'v', vitsh is, after al, half a 'w'. Shortly after zis, ze unesesary 'o kould be dropd from vords kontaining
      'ou'. Similar arguments vud of kors be aplid to ozer kombinations of leters.




      Kontinuing zis proses yer after yer, ve vud eventuli hav a reli sensibl riten styl. After tventi yers zer
      vud be no mor trubls, difikultis and evrivun vud find it ezi tu understand ech ozer. Ze drems of the
      Guvermnt vud finali hav kum tru.


      --
      ---- There is a fine line between sayings that make sense.
    2. Re:Quit whining and move to a phonetic alphabet by Another+MacHack · · Score: 1
      Why not use a system of multiple (standardized) character sets. This is extendable, you can always add new encodings/sets etc. The only really fixed mechanism needed is a way to specify a switch from one encoding to the other.

      How is this better than a system with one huge character set, and then per-language character encodings which are more efficient for each language? It sounds to me like what you describe is essentially equivalent to the process of adding to the ISO 10646 character space from time to time, and having a bunch of different character encodings, like ISO Latin 1, Windows CP 1252, Mac Extended ASCII, OEM CP foo for language bar, etc.

      The major advantage to having a consistent character encoding is that programs (like, say, an email gateway) can much more easily process documents in any language or subset of the universal alphabet, without having to know anything about the encoding used. Code pages are okay if you never need more than 256 characters at a time.

    3. Re:Quit whining and move to a phonetic alphabet by si1k · · Score: 1

      Actually, it should be noted that just as the alphabetic system is optimised for speed of input (writing or typing) and to some degree for speed of learning (questionable), the Chinese/Japanese/Korean system of ideographs is optimized for *reading speed*. So, forgetting the difficulty for highly literate cultures of switching over to a phonetic system, there is also the fact that alphabets just aren't innately superior.

      Research in Japan showed that people read characters faster than syllabaries, the Japanese equivalent of alphabets. It's just like the reason that weird spellings in English increase reading speed, by making it easier to distinguish between words.

      The real issues *is* a technical one. We simply have to learn how to better store characters (developing an encoding scheme that harmonizes with other character sets, as Unicode claims to) and more importantly, how to improve input methods for C/J/K characters. But it's not a flaw in the system.

      The point of view of many westerners that the characters are too complex and hard to learn is very much equivalent to the Chinese perspective that conjugating verbs is unneeded complexity.

      As a matter of fact, Chinese characters are very easy to learn because of the degree to which the character itself reflects its meaning. I can look at a character and know it's a type of bird without even knowing the character beforehand.

    4. Re:Quit whining and move to a phonetic alphabet by Baki · · Score: 2
      Same goes for the original classic texts of our 'western civilization' in Greek and Latin.

      Does that mean that everyone should be able to read/write Greek and Latin? Or should everyone learn Hebrew to read the bible?

      Reading classical texts IMO has no relevance for a character set of today. The waste (difficult input/output methods, waste of space and processing speed) in comparison with the occasional gain (being able to process classical texts on modern computers by everyone) just isn't worth it.

    5. Re:Quit whining and move to a phonetic alphabet by Baki · · Score: 2
      Of course the west doesn't have to decide for China or Japan. They can make their own judgement, and see for themselves whether it is worthwhile and cost-effective to maintain an old and complex character system today.

      But those that don't have the need of daily reading/writing such characters shouldn't be forced to "suffer" in terms of waste of memory and processing speed (2 or even 4 bytes per character). In that sense UTF-8 (1-byte subset of Unicode, similar to the ISO-lating-1 encoding) is a reasonable alternative.

      What I don't understand is why all possible characters of the world should be in 1 big character set. I know it simplifies some things, but it also costs a lot.

      Why not use a system of multiple (standardized) character sets. This is extendable, you can always add new encodings/sets etc. The only really fixed mechanism needed is a way to specify a switch from one encoding to the other.

    6. Re:Quit whining and move to a phonetic alphabet by rkent · · Score: 1
      But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages

      Interesting my ass. This is a troll if ever there was one. Yes, what a good point, let's conflate entire cultural traditions into a banal comparison to US corporate "progress." I'm sure you'd appreciate it if your cultural treasures were abandoned as "legacy" just because there was a more efficient alternative.

      ---

    7. Re:Quit whining and move to a phonetic alphabet by potifar · · Score: 1

      Phonetic spelling is really a good thing, but it most definately cannot replace traditional spelling. Using phonetic spelling would for example mean that Chinese from different parts of the country no longer could read each others' texts. It would also mean that British Englih, U.S. English, Australian English etc etc would all be spelt differently. Not very convenient,

    8. Re:Quit whining and move to a phonetic alphabet by molybdenum · · Score: 1

      All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?

      Actually, yes; the way we currently do things. Using a phonetic alphabet as the sole method of writing could perhaps be easy for speakers in the same area, but what about others? I live in the Midwest, and I know that I don't pronounce the word "pie" the same way as someone from the South. Granted, English spelling leaves something to be desired, but the fact that it can be understood by English speakers all over the world is what matters.

      Ben

    9. Re:Quit whining and move to a phonetic alphabet by 2Bits · · Score: 1
      Yeah, anyone who wrote this kind of crap, obviously, knows only one language. If you have ever tried to learn another language that is not derived from latin or anglo-saxon, you wouldn't not write shit like this.

    10. Re:Quit whining and move to a phonetic alphabet by trash+eighty · · Score: 1
      ah but tonal languages like chinese are not serves by phonetic alphabets very well. there are thousands of homophores in chinese.

      for example : yi can mean clothing, one, depend on, doctor, he, she, move, will... and quite a few more too

      so what do you think will be easier to use? a phonetic yi or different unique characters for each meaning. i dunno about you but i find pinyin very hard to understand.

  115. Unicode for aliens by WyldOne · · Score: 1
    My god you people are whining about only 170000 symbols to represent languange. Boy will you be surprised when you have to speak Altarian - some of the symbols are math formulss, some can't be produced by the human vocal system, some are gestures, and some are transmitted telepathicly.

    And on the lighter side. IMHO it would be better to code the document with a tag to represent the symbol set and not worry about being able to stuff all of the worlds symbols in a single symbol set. Although the fight to be language '1' would be interesting. BTW, we have lost so many languages/symbol representations already.

    --

    make Linux, not Microsoft. sin(beast) = -0.809016994374947424102293417182819
  116. Nordic Runes? by reverse+solidus · · Score: 2
  117. Re:More Flamebait :) by tommyk · · Score: 1

    It would end with Tengwar, as mentioned above. A "made up" language that is truly phonetic, designed by a linguistic. You can put any language into it, and people who don't know the language can still "read", or at least pronounce, what is written.

    Of course, actually using it for that purpose is as popular as writing esperanto in it.

  118. Re:You bring up a good point by Another+MacHack · · Score: 1

    The article did make the claim that Hangul was "designed ... to be able to describe any sound the human throat and mouth is capable of producing in speech".

    Hangul is missing representations for a variety of trills, fricatives, pharangeal sounds, uvular and glottal sounds, clicks, and a variety of other sounds not present in Korean speech.

  119. Re:ASCII stupidity all over again... by Another+MacHack · · Score: 1

    Perhaps not, but there is such a creature as "Extended ASCII". Many in fact. It would be wrong to call them standards (except possibly de-facto standards) but equally wrong to suggest they don't exist.

  120. Re:Some Article by Another+MacHack · · Score: 1

    You seem to be suffering from a number of misconceptions. First, although Kanji may be entered phonetically, there is then a procedure by which the input method allows the user to specify which of the (generally numerous) Kanji which have that pronunciation was supposed to be entered. This can either be entered using the Roman alphabet or Hiragana. However, what appears on the screen when all is said and done isn't just phonetic, and with good reason; there is otherwise a lot of ambiguity, far more than results from homonomy and homophony in English.

    It would be just as jarring to a native speaker of Japanese to try to read Japanese purely in Hiragana as it would be for a native English speaker to read something written in a phonetic alphabet. Furthermore, a lot of meaning is lost, since there are so many "homophones".

    As for your assertion that it's only CJK which prevent a universal character set in 256 code points, you seem to be forgetting Cyrillic, Greek, Thai, Arabic, Hebrew, and any number of other languages which use completely different alphabets.

    The UniHan controversy is best viewed as a philosophical difference betwen whether the various language variants are different characters (controversial) or just different glyphs (uncontested).

    By way of analogy, there is a pretty good mapping between the greek alphabet and ours, but that doesn't mean it would be trivial for most English speakers to read English which had been transliterated into Greek. It's more than the difference between the letter "A" in Times New Roman vs Arial.

  121. Re:"Extended ASCII" misnames ISO-8859-1 by Another+MacHack · · Score: 1

    There are any number of extended ASCII character sets. ISO Latin-1, but there are many others; any of the ISO Latins, The DOS extended ascii set, the Mac OS character set. Hence "Many in fact"

  122. Re:A Plan for the Improvement of English Spelling by bgarcia · · Score: 1

    I didn't mean to imply that the post was bad in any way. I thought it was pretty good, but it was obviously inspired by Twain, so I thought others should read the Master Troll.

    --
    I'm a leaf on the wind. Watch how I soar.
  123. A Plan for the Improvement of English Spelling by bgarcia · · Score: 4

    Go read the original story here, by Mark Twain.

    --
    I'm a leaf on the wind. Watch how I soar.
    1. Re:A Plan for the Improvement of English Spelling by tswinzig · · Score: 2

      Actually, I think the slashdot post was a lot funnier... he managed to convert english to german by the last year...

      --

      "And like that ... he's gone."
  124. Re:After some skimming... by robosmurf · · Score: 1

    What this post (and the article) seem to not grasp is that the simple 2 byte encoding is not the only way to encode the unicode character set. It is not even a particularly good way of encoding the character set. ;)

    Other encodings give you far more than 2^16 characters.

    In fact, using the surrogate system you can get over a million characters into even the common 2-byte encoding. (I'm assuming this has not changed in the later specs, I've only got the Unicode 2.0 spec to hand.)

    Unicode does seem to have been well thought out, and these sort of problems have been anticipated.

    There does seem to be a huge amount of misinformation and misunderstanding in the article and this discussion. However, this is probably not helped by the Unicode standard not being freely available (as far as I can tell).

  125. Re:Perl in Hierogliphics by CharlieG · · Score: 2

    Gee, I thought it already exists - they call it APL

    --
    -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
  126. Re:Is this really such a problem? by revscat · · Score: 2

    Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent.

    It's totally irrelevant for poor rural populations, true. But as more and more of the world's population moves towards being centered around urban areas this is indeed relevant. It is relevant to those who desire the full functionality of the Internet in their native character set. I believe (and this is a belief, not a fact) that one way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible. One way to do this is by making sure they can use the latest technology in their native tongue, lowering the slope of the learning curve.

    Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language.

    This is indeed tragic, but it quite simply cannot be helped. It's so common as to be a cliche: "Life Sucks", or "Shit Happens", or even "C'est l'vie." I hope that there are linguists and philologists who are archiving these languages for future generations and our general cultural awareness. BUT: People must eat, and they have a strong desire to make themselves and their families prosperous. If, when all things are considered, making sure that you live your life only speaking language X turns out to be counterproductive, then that language will become less important. There have been many languages that have come and gone throughout the millenia; humanity continues to advance. Would the world be a richer place if all those languages were still around? Certainly. But it would also be more confusing. And remember: If people can speak to each other, there is less of a chance they'll start killing each other. (LESS of a chance, mind you.)

    I'm a Taoist at heart in matters such as this. For every yin, there is a yang, for every good, there is a bad. Life goes on.

    - Rev.
  127. Re:ASCII stupidity all over again... by gorilla · · Score: 2

    ASCII is, and always was, a 7 bit standard, which encoded 95 printable characters and 33 control codes. 'high-ascii' just does not exist, and never did.

  128. Re:I'll take that challenge by Tower · · Score: 1

    Funny, it's always looked like more of a backwards 's' with a backslash to me, but hey, whatever floats your boat.
    --

    --
    "It's tough to be bilingual when you get hit in the head."
  129. It works by Kohath · · Score: 2

    Imperfect != "does not work"

    1. Re:It works by TommyW · · Score: 2

      True, but "does not work" implies "imperfect."

      The point the article is making is that this system cannot be made to work for everybody at once.

      So you either put up boundaries, and have systems
      that work perfectly, but only within those boundaries, or you need a system with wider scope at the outset.
      --
      Too stupid to live.

      --
      Too stupid to live.
      Too stubborn to die.
  130. technical critique by lucentshoe · · Score: 1

    Unicode is not a "16-bit character definition". Unicode is a "character coding system" for assigning code points to abstract characters. i'll hereby suggest that the author of this piece has confused Unicode itself with one of the encoding forms of Unicode, that is, ways that characters are expressed as bitstrings. please to shoot this down.

    a "character coding system" (drawing on http://www.unicode.org/ and my copy of the standard 3.0 here) is a system for assigning characters to code points. Unicode 3.1 assigns some 94,000 odd characters, and the roadmap for allocations (start at http://www.unicode.org/pending/pending.html) will assign more in the future. these assignments are just that: an abstract character to an integer value in the Unicode repertoire. this assignment does not dictate how to represent the character as data in any way.

    There are a variety of encoding forms of Unicode, each for ways of representing characters in the repertoire as data (not at all "on screen", that's glyphs, and that's a whole other issue). The different encoding schemes have different strengths and weaknesses. UTF-16 is a form that uses fixed-width 16-bit sequences as the base unit (though through a concept known as Surrogates, two such scalars adjacent to each other can represent a value normally not expressable with just 16-bits). UTF-8 is a different form that uses a variable number of 8-bit sequences to represent characters. There is a UTF-32 form, a UTF-EBCDIC form, believe it or don't. These are just encoding forms, they make no restrictions on what or how many characters get assigned. If the Unicode Consortium wanted to assign abstract characters to values that exceed the limits of current encoding forms, we could certainly do something about that, but it isn't the horrible catastrophe the author makes it out to be.

    this is just the thing that leaps out at me. thoughts?

    1. Re:technical critique by dot11 · · Score: 1

      Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (From the FAQ at www.unicode.org.)

      Support of characters outside the primary 16-bit plane was added merely an afterthought. UTF-16 is not limited to 16 bits/char, but it won't work unless many programs properly support surrogate pairs. Which I think is unlikely, since it would be a rarely used feature.

  131. Re:Wrong, wrong! by csbruce · · Score: 2

    or you want to scan the string backwards

    UTF-8 can indeed be scanned backwards. You could also locate the start of the current character given a random pointer into a byte buffer. RTFM. UTF-8 can also directly encode 2 billion characters. UTF-8 is the right general solution to data interchange, and this is why it's catching on.

  132. Re:Mrrp, wrong by Morgo · · Score: 1
    Sorry, but you're wrong, as is that FAQ.

    ISO 10646 != Unicode, and UCS-1 != UTF-8.

    UCS allows 31-bit character codes, Unicode however only allows up to 0x10FFFF, which is a little over 2^20. UCS characters may occupy up to six bytes, but according to this page, "All three encoding forms [UTF-8, UTF-16, UTF-32] need at most 4 bytes (or 32-bits) of data for each character."

    BTW, it's important to recognise the difference between scalar values, which are the numerical values assigned to characters, and encodings (UTF-8 and so on), which are just ways of encoding those scalar values with different levels of memory efficiency, ease of parsing etc. Every encoding covers the same range of scalar values (ie. all of them).

    (Unfortunately the official Unicode standard is only available in dead tree form, so it's kinda hard to give relevant links...)

  133. Re:Overstating and misunderstanding the problem by Morgo · · Score: 1
    This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

    I'd like to make a couple more points about this: suppose the characters have been encoded by arrogant know-nothing westerners, and it were the case that logically distinct characters have been unified, the real solution is to get onto the committee and have new characters assigned. If, as the article suggests, around 170,000 code points are genuinely needed, then fine - Unicode can handle that many.

    If there are any characters not included in Unicode, it's because it doesn't need them, or it doesn't support them yet. As has been pointed out by many posters, characters are being added all the time. Just last month, version 3.1 of the standard was published which just about doubled the number of assigned characters over 3.0.

  134. Re:You bring up a good point by ncc74656 · · Score: 2
    The plural of dish is dishes, but the plural of fish is fish

    "Fishes" is also a valid plural form of "fish." "Fishes" refers to a group of different species, while the plural "fish" refers to a group that is all of the same species. The plecostomuses (sp?) and cichlids in my tank at home are fishes; the trout in a pond are fish.

    (Your point that English has tons of rules and even more exceptions to those rules still stands, though.)

    "A bunch of bananas" or "a group of individuals", are these plural or singular?

    A bunch and a group are both singular, though some Brits would disagree (their usage used to treat a group as a plural object ("and the crowd are going wild!"), but that is starting to change in more recent usage).

    --
    20 January 2017: the End of an Error.
  135. Well DUH! It's not meant to have every character by stienman · · Score: 2

    Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Unicode was never meant to hold one spot for every character. It was meant to be used as a set of code pages much like ascii was. But it had to be larger than 256 to hold a reasonably representative set of one language at one time (such as Japanese, or Chinese (two dialects), etc).

    Most documents consist largely of one language, so you start the document by stating the code page you're using. Very few documents need more than one set of 65,536 characters, but you can intersperse sets if needed.

    But the idea of having one universal character set is ludicrous. There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer. Sure, we could use 4 bytes per character, but is it really necessary? Absolutely not! Talk about inefficient. The only case where that would be more efficient than code pages is when the majority of documents extensively use more than 64k characters within each document.

    Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

    -Adam

    This sig 80% recycled bits, 20% post user.

  136. What about Klingon? by sometwo · · Score: 1
    Unicode even has Klingon in its standard. What's next: that unnamed Star Wars language? We are definately going to have to change to 64 bits to get all of those Sci Fi languages onto our computers.

    I hate it when I browse a Star Trek web site and I can't read that Klingon.

  137. Re:After some skimming... by Old+Wolf · · Score: 1

    What he meant was that Chinese characters are not an attempt at representing Chinese sounds.

  138. Re:All Character sets simultaneously?? by Old+Wolf · · Score: 1
    You may want to write a comparative linguistic document, or perhaps write a manual that includes a glossary for various languages.

    Here is an example of such a page:
    http://wolf.project-w.com/chess/pieces.html

    Of course, to view this your browser will need to support Unicode encoding, and have the appropriate Unicode fonts.

    I have also created a test page for various operating systems and browsers to view Unicode text: here.

    My opinion on this debate? When loading this page, I didn't expect to see 75% of it being Americans saying, why doesn't everyone use English (!) A better solution, IMO, would be to pick a character encoding that can a) write all possible characters with a LOT of redundancy (who would ever need 2^31 IP addresses?), and b) not take up too much storage space for simple / common characters (I don't want to use 1K to write one sentence in a 4-byte charset).
    Then, this encoding should be verified with all governments and, pending acceptance, made an ISO standard.

  139. Re:Overstating and misunderstanding the problem by PylonHead · · Score: 1

    I guess I am misunderstanding you.

    How do you indent to encode 170,000 possible characters in 2 bytes worth of data?

    Are you suggesting that unicode use additional bytes, or is there already an "escape" code which allows multicharacter encoding.

    Isn't multicharacter encoding what unicode was meant to eliminate?

    --
    # (/.);;
    - : float -> float -> float =
  140. Never mind.. question answered elsewhere (nt) by PylonHead · · Score: 1

    nt

    --
    # (/.);;
    - : float -> float -> float =
  141. Don't be fooled by olevy · · Score: 1

    I worked in Japan for 4 years as a programmer, and I am somewhat fluent in Japanese. I also was project lead for a commercial Japanese computer dictionary.

    Don't be fooled, this is not a technical article, but instead a political rant. I've talked to some of the designers of Unicode, and they tried to be very, very respectful of Asian wishes, But all of the nations of Asia refuse to cooperate on most things including character encodings.

    While I worked in Japan officials claimed that Japan could not import beef because Japanese have evolved longer intestines and therefore can't properly digest red meat. This is laughably not true, but many Japanese still believe it. They also claimed that ski gear couldn't be imported because Japanase snow (hence the laws of physics) is fundamentally different in Japan. And of course they claim they couldn't possibly import another character set because their characters are unique.

    One thing that was never mentioned in the article was the difference between a glyph (how a character looks) and what it means. So for example the letter "A" and "A" are the same characters but they have different glyphs.

    What the unicode designers did was to identify all of the unique characters in all the mainstream languages. It turns out that Japan, Taiwan, Korea and China share a large number of characters. This should not be terribly surprising because all of these countries directly imported their characters from China. What does differ from country to country is how those characters are represented.

    A lot of Asians, seem to really hate the idea of using the same code point for the same characters mainly, I think, because they don't really like the idea of sharing *anything* with the other countries. It is a political and cultural thing, not a technical thing. From a technical point of view it is just gross to think of assigning multiple codepoints for the exact same character.

    Which is not to say that Unicode is perfect. Unicode, for better or worse, solves just the problem of encoding the unique characters, it has very little to say about the font problem. That however is still a wonderful thing to solve -- you no longer have to worry about losing meaning, at worst the characters might end up looking a little funny.

    And Unicode works. In creating our Japanese dictionary we were forced to use Shift-JIS (one of the Japanese standards), and it was just horrible because there were so many Chinese characters outside the standard Shift-JIS encoding that we needed. Unicode would have greatly simplified the problem for us.

  142. Re:too sinocentric, but Unicode has problems by olevy · · Score: 2

    I worked as a programmer in Japan for 4 year, and I've also done several projects in Unicode.

    There are couple of things I would like to point out:

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    >>The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    Have you ever tried to program in shift-JIS? It is horrific. Basically they mix one byte and two byte characters. The problem is that if you jump into the middle of the string there is no way to know if you are looking at a one byte character or the second byte of a two byte character. You also can't do tell the number of characters in a string simply by looking at the length. It is a *terrible* standard.

  143. Re:After some skimming... by tytso · · Score: 2
    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Actually his original nalogy was flawed and designed to yank people's chain... A better analogy of what's going on would be to say that the Germans and the French wanted to have their own Unicode code point for the letter "A", since obviously the German A is very different from the French A. Repeat for all the letters in the alphabet. The excuses for saying that the German "A" should have a different value than a French "A" is (a) The Germans and the French hate each other, and (b) French tend to use a sans-serif'ed font. When told by the standards committee that font issues were independent of Unicode assignment, the response was this was obviously anti-European imperialism....

    That's basically what's going on here with the folks who are complaining about Han Unification. Many Asian languages are desended originally from Chinese, just as many European languages are descended from Latin and Germanic roots. So it's not surprising that the systems of orthography share a lot in common. The difference is that each Asian country refuses to share any codepoints with any other Asian country, because They Hate Each Other, and there seems to be some widespread belief that doing so would somehow be causing their national language to lose face.

    As someone who's Chinese, I think I can safely say to those people who like to bitch and moan about Han Unification..... Grow up!

  144. Unicode & a lot of characters by kune · · Score: 1

    First UNICODE gives only advantage to the US and english speaking countries, because UTF-8 is compatible with ASCII (American Standard Code for Information Interchange).

    Germany has the problem that UNICODE uses the same code points as ISO 8859 Latin 1, but UTF-8 encodes all characters above code point 127 different.

    UNICODE 3.1 differentiates code points and encodings. There a range of encodings UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. This way it is possible to extend the code space to the 170.000 code points characters requested by the author of the article. At least another supplementary plane has to be used. It should be noted, that the policy of UNICODE forbids the assignment of the same character to several code points unless there exists an already a standard defining the characters. So it might be that less than 170.000 code points are necessary.

    UNICODE made the mistake to assume that 16 Bit are enough to encode all the characters of the world. But they have corrected that mistake without breaking the current implementations introducing code point and encoding semantics.

    I believe that UNICODE ist the best thing we have now and everybody criticising UNICODE should make proposals for improvements or another scheme. As far as I understand, the definition of a world wide unique character code is far from easy.

  145. Re:Alrighty by hernick · · Score: 2

    Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?

    You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.

    It actually works pretty well. Their keyboards pretty much look just like ours.

  146. Re:UTF-8 should be fine for almost any application by AdamBa · · Score: 1
    You are a purist!!

    I don't mean to have the a-accent-grave in French be mapped to a plain English a. Certainly you should keep the accents. But do the g and r and c and e need to be different characters?

    Mapping these things is a nightmare. Imagine someone writing say a BIOS, or something else with limited storage and code, who wants to display a startup message...do they need to store a separate Unicode message for each language just so they can say "Dell Computer" or whatever properly for each one? And then you have to store all those glyphs for multiple fonts, so you would probably wind up mapping them all back on top of each other. Gack.

    - adam

  147. UTF-8 should be fine for almost any application by AdamBa · · Score: 2
    The purists who want 4-byte characters go beyond just wanting to allow 50,000 Kanji or insisting that Japanese and Chinese Kanji with the same stroke pattern not share the same character. They want a separate character for the English lower-case 'e', the French lower-case 'e', the German lower-case 'e', etc. This is not at all necessary. YES, there may be some Kanji that fall out of use if the set listed in the Unicode standard becomes the only one used, but you have to counter that with the fact that suddenly these languages can have a universally-recognized way to encode them, as opposed to the 5 of whatever ways that previously existed to encode Japanese (which all had limited character sets anyway).

    UTF-8 is very nice because 7-bit characters encode as one byte. Also it is defined so there won't be a NULL or a hex 01B (decimal 27 -- the telnet escape character) anywhere in the data stream, even in the second or third byte of an encoded character. So it will generally be passed through correctly by programs expecting straight 8-bit ASCII. UTF-8 is also encoded and decoded via a trivial algorithm, as opposed to the DBCS used in Windows which needs lookup tables.

    One negative of UTF-8 is that Unicode characters at 0x8000 or above (using more than 11 bits) encode in UTF-8 as 3 bytes, not 2 as in Unicode. I think that range includes things like Arabic and some Indian written languages. But I think that tradeoff is worth it.

    - adam

    1. Re:UTF-8 should be fine for almost any application by Keick · · Score: 1

      Actually, there is a seperate Arabic space character (Shared with the hebrew space as well). This character is sometimes called the non-breaking space character. Microsoft script processors, such as in W2K, use this character to ensure correct word order rendering. The use of a standard space will really screw up the word order on W2K.

  148. Oh dear - error ridden by orblee · · Score: 1
    This article, although pointing out something I didn't know - that of there being 170,000 characters in existence on this planet, is wrong about Unicode. Firstly, there is now a UTF-32 standard which should be able to deal with just about everything, but that aside, there are more than 65,536 possible characters in UTF-16.

    Out of those 65,536 possible characters, 20,000 characters or so are reserved so that we can use pairs of words to double the set. In fact, the specification allows us to add to it almost ad infinitum, continuously adding more characters. Just two words covers over 100,000 characters, three will do the lot.

    Okay, the writers of unicode may have been slightly short-sighted, but also they probably considered the problems of using a 32-bit character set and decided against it (and 24-bit for that matter). They have added an extending property to the 16-bit unicode standard and that should cope with much. I don't know how the chinese/japanese/korean population deal with their HUGE character sets now (you couldn't have a keyboard big enough) but they must have a shorter, simpler method of coping with everyday data input. Surely, this double and triple pairing of UTF-16 will do?

  149. Re:Overstating and misunderstanding the problem by Kanasta · · Score: 1

    Here's the Chinese perspective. How do you write the number one?

    Theres:
    1) The western way "1"
    2) The common Chinese character
    3) The complex Chinese character used for legal documents, cheques (when not in English), etc
    4) The Chinese character used in markets

    Now, are they the same? Well, in legal documents etc, you MUST use (3) and none of the other characters.

    (4) is ONLY ever used in the markets, however, in the markets (1) and (2) may also be used, but never (3).

    OK, what about brand names or product names? There exists a magazine using (3), and I'm sure there'd be lots of crap flying if some reporter decided to talk about it using any of (1),(2),(4).

    So then, are they the same, or not? The answer is NO. They all represent the number 1, sure, but there are concrete differences in where and how they may be used. Swapping them randomly would surely be unacceptable in certain circumstances.


    ---

  150. Re:After some skimming... by jmccay · · Score: 1

    There is one small problem with the Chinese written language (not the English converted language, but the original). 170,000 characters would not be enought to express the whole language. (I think it maybe over 1,000,000 characters/symbols, but nobody knows the exact count.) Granted it has been a long time since I have done anything with Chinese, but I do remember there is a lot of characters/symbols. When I last checked, it didn't have an alphabet. This means virtually every word has it's own character, or combination of characters.
    If I remember corectly, Unicode had a few different ways to represent characters. It depended on the langunge you were talking about some characters were meant to be combined with other to form the actual characters. They used this to extend the characters.
    It some sense it is not pratical to represent some nonalphabetic launguages with computers. The number of characters would sky rocket. The only way I could think of to represent the entire Chinese language would be to a symbol for each posible brush stroke and combine them to form characters/symbols.
    There will always be some launguages that will not be practical to completely represent on a computer.

    --
    At the next eco-hypocrisy-meeting, count the private jets used to get to the meeting. Should be interesting to see that
  151. Re:Solution - Everybody use Euro-English! by rkent · · Score: 2
    Oh come on, that's not so hard to read... remember, as Andrew Jackson said:

    "It's a damn poor mind that can think of only one way to spell a word!"

    ---

  152. Re: Simpler than English by Christopher+Whitt · · Score: 1

    If by real you mean a natural language, then I don't have any great suggestions, although I know that French is much more regular and has a smaller vocabulary.

    On the other hand Lojban is a real language. Like esperanto, Lojban is regular (the rules of the language have no exceptions), but it has only 6 vowels, 12 consonants, and 3 semi-letters. Other benefits are an unambiguous grammer based on principles of logic, culturally neutral, simple to learn, and it uses phonetic spelling.

    Christopher

  153. Re:After some skimming... by CloudWarrior · · Score: 1
    Let me get this straight - you think people should be prepared to accept having restricted access to the literature that underpins their culture in exchange for their very own geocities.cn?
    People can already have access to their literature - you put it up as an image file. The only advantages of using text instead are that it allows you to search and edit it. Since it's classic literature, you don't need to edit it, and since there isn't a sensible way to input the characters, there wouldn't be a sensible way to search it in any case.

    CloudWarrior .o. "I may be in the gutter but I look to the stars"
  154. Idographics Have Their Place In English Too.... by EXTomar · · Score: 2

    Lets see...you used "10,000" and "100". Those are idographic representation of "ten thousand" and "one hundred". There are hundreds of idographs in common US English yet someone wants to harp on a language that uses idographs for 95% of their written word?

    The point is that any character encoding should have been robust enough to encode any language used at any point in the history of mankind(okay...encoding things like Ancient Latin might be more acedemic than anything).

  155. Frequency rates for Chinese characters by willis · · Score: 1
    In a class I took at Peking University, titled "The Study of Modern Chinese Characters" we covered this stuff.

    In a study by Zhou Youguang published in "Zhongguo Yuwen Zongheng Tan" in 1992, he gives the following stats:
    number frequency
    100090.0 00%
    240099.0 00%
    380099.9 00%
    520099.9 90%
    660099.9 99%
    There we go, that's the facts on frequency.

    The other thing is that there are currently two different systems for writing Chinese on the net -- GB (guobiao) from mainland China and Big5 (dawu) from Taiwan.
    Using the results of frequency studies, the GB format was made to only include a certain set -- 7237, I believe. This is what almost every Chinese from mainland China uses right now, and it's working pretty damn well.
    Big5 has something like 1X,000 characters, and that seemes to work just fine as well.

    If you ask me, the largest problem that folks can face is that they receive email with scrambled codes or don't have a Big5 converter or something, not that there aren't enough codes for folks to adequetly express themselves.

    (I haven't read the article, but I believe that before they planned to use escape sequences to make these more un-used characters -- and I'll betcha 99.99999% of users will never run into them, and not need the fonts, etc.)
    (that class was boring as all hell -- can't believe it came in useful on /.)

    --

    there is no thing
    what else could you want?
  156. Re:You bring up a good point by mrogers · · Score: 2

    Don't confuse the Latin alphabet with the English language! In Czech, the Latin alphabet (plus a few accents) is used phonetically.
    --

  157. Does this means, all ancient chars should be maped by Artemis3 · · Score: 1
    Ok; so answer me a question. Does this means that all ancient characters should be mapped so that historians could use them as well? Why only chinese ancient then? What about glyphs and all the ancient cultures?

    --

    --
    Artix
    Your Linux, your init.
  158. Re:Flamebait :) by phunhippy · · Score: 1

    Did ya consider getting her Hooked on Phonics? :)Or was the trouble of her reading throug your comic collection because like every other sane person, Marvel's NEW UNIVERSE, STUNK!

    ;)

  159. Flamebait :) by phunhippy · · Score: 3

    Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)

  160. Re:After some skimming... by kevinank · · Score: 2

    Special letter forms don't need to be coded into unicode to be viewable. SVG, Postscript and other languages do a perfectly good level of presentation. So unless you can convince me that a Korean/Chinese person will be trying to do a word search through an historical Japanese/Taiwanese/Vietnamese document and will always inadvertently find the Korean ACK/Chinese SPOO when what he was really looking for was the Japanese FOOFLE/Taiwanese FLUM.

    Personally I can't understand why anyone in the world would want to search in a character set of more than 60,000 characters. I'd personally be pissed off if the UNICODE committee started adding special letter forms for US product trademarks (so they would render correctly) when as a user I'd rather just have them be findable.

    Really, the author needs to understand the use of the ALT tag.

    --
    LibBT: BitTorrent for C - small - fast - clean (Now Versio
  161. Why not encode brush strokes? by MountainLogic · · Score: 1

    Phonic based encoding simply lists letter (say, 6x8=48 bits for a typical word "letter") Why can't brush strokes be used to describe ideograms?
    I assume (with our knowledge) that common ideograms are of limited complexity due to simplification over time and that seldom used ones tend to remain complex. With a stroke encoding scheme you should end-up with a unique string of bits not too much longer than a roman word.
    True, this would play hobb with the functions in the stanard libs, but in real life you really are searching for words (multi byte symbols), not single chars.

    -Scott

  162. Technically Illiterate by tbray · · Score: 2

    This article is technically illiterate. UCS-2, which he references heavily, basically doesn't exist any more and hasn't for a while. UTF-8 and UTF-16 are perfectly adequate encodings each of which can handle all the of the extended characters, up to a million or so in number (17 planes of 64k, to be precise).

    He's correct that the ability to do computing in an Asian environment has lagged behind Western-language capabilities. However, as of Unicode 3.1 (in fact, as of Unicode 2), the support for what you need to do *business* computing has been pretty well there.

    The job of collating and organizing all the tens of thousands of characters required to handle the classical texts is under way but will take a while to finish. Then there's the really hard problem of building quality fonts to support all these things.

    But the title and premise are wrong. You can use Unicode on the net today just fine, lots of people are doing it, and anyone who builds a significant application today and *doesn't* build in support for international character handling is just out 'n' out stupid. It's not that hard.

    Cheers, Tim Bray (tbray@textuality.com)
  163. Some Article by ahde · · Score: 1
    the author spends the first few paragraphs blasting the guy that made his profession possible for being a Christian, and tries to prove his knowledge is superior to a hundred year old work.

    Excuse me, but Unicode isn't suppose to describe fonts. As bad as Unicode is, every character in every language can be represented in way, way, less than 65,000 bits. Korean has around 50 characters. The Japanese use less than 1000 Kanji in practical use. You couldn't find a Chinese, or even an an American who has more than a 10,000 word vocabulary if you tried.

    What's more, if accents, umlats, or whatever were used as separate characters, everything except chinese and japanese kanji could fit in 256 characters.

    Most japanese get by using 2-character combinations, on a variation of a standard keyboard, which is a lot easier than trying to use a 10,000 letter typewriter. Every word in japanese can be represented with 50 hiragana.

    the chinese character set can be at least partially blamed for the high level of illiteracy in china. It is ancient. It has lasted so long mainly as a tactic specifically to keep the general populace ignorant.

    And we don't need to fit Mayan sculpture into unicode.

  164. Re:You bring up a good point by ahde · · Score: 1

    Anyone who speaks english and cant figure out what fishes are ise fit to be hung, in bunches.

  165. Re:After some skimming... by TheReverand · · Score: 2

    Then how do they sing danny boy?

  166. Re:Danny Boy? by TheReverand · · Score: 2

    Tell that to Tommy Makem

  167. How simple is English? by The+Trinidad+Kid · · Score: 1

    Sorry to rain on your parade but English is not a particularly simple written language.

    The western tradition uses 2 complementary (but distinct) alphabets - the Latin, Majescule or upper case alphabet and the hunnish, Miniscule or lower case one.

    These 2 alphabets have a 100% redundancy between them, and about a 50% overlap and their mixed-usage is context dependant and purely conventional and dates from the rennaisance. Their usages prior to that were in substantially non-overlapping geographical areas (and/or time periods).

    In addition to this the English tradition chucks in an ideogram set to represent numbers, except that unlike the latinate or hunnish alphabets, this ideogram set reads right to left like the Arabic from whence it was bodged.

    So, let's recapitulate, 2 alphabets with 100% semantic redundancy and 50% overlap of form which read left to right, and an ideogram set that reads right to left. Simple? Or just what you are used to?

    --
    http://scottish.politicaldiscussion.org
    1. Re:How simple is English? by The+Trinidad+Kid · · Score: 1

      If you look at the history of writing you will see that at different times and in different periods different letter sets were used at different times by different people: consider this rendering of a simple text in a 'roman' hand (400 BC to 400AD ish) and the same one in a carolingian hand (Carolingian referring to Charlemagne, Charles The Great circa the 750 AD onwards).

      Well one looks like it is all written in upper case and one in all lower case.

      A general overview can be found here.

      Mixed case (dual alphabet) stuff only took off with the invention of printing. The issue of whether the lower and upper case character sets are different alphabets is simply one of degree, how different are they from each other and from other alphabets (like the greek one. This article makes the point that in ancient greece there were also no "lower case" letters only "upper case" ones - modern greek developed a dual alphabet in emulation of the modern latin one.

      Would you consider this to be a different alphabet? - I can barely read it, and certainly not in blocks - and it was used all over Germany until 1941 when it was banned by Hitler.

      Cyrillic also only gets dual case in the time of Peter the Great, having been "upper case" only before. Lots of languages only have one case.

      --
      http://scottish.politicaldiscussion.org
    2. Re:How simple is English? by de+Selby · · Score: 1

      It's not two alphabets, it's two similar forms for each letter. And the usage is really really really easy.

  168. 10100010100 by 4of12 · · Score: 2

    We Bynari take issue with this.

    With much grief and gnashing of teeth do we stoop to use this ill-conceived and bloated Latin based alphabet with 26 characters to respond to this bigoted viewpoint in a way that your feeble minds may understand.

    Our alphabet has exactlytwo letters.

    --
    "Provided by the management for your protection."
  169. Did you even read the article ? by dingbat_hp · · Score: 1

    So what if you've been using Unicode for ages, Unicode can't handle Chinese in a way that can simultaneously satisfy mainland and non-mainland Chinese.

    #karma_whore
    M$oft is what most people use. Doesn't make it right though.

    1. Re:Did you even read the article ? by dingbat_hp · · Score: 1

      Most ordinary people won't see many restrictions from the current standard

      "Most ordinary people" think that a TV-soap is top-quality drama. This isn't about what's "most popular" or even "occasionally used", it's about claiming to offer a complete support that even the academics wil be happy with. So what if it's an obscure issue of interest only to linguists ? - linguists also have needs that are worth supporting, not just sports broadcasters and populist entertainment.

      As with anything that pits mainland China against Taiwan, an awful lot of hot-air is going to be generated by the governments. It doesn't change there being a very real, and worthwhile, issue behind all this.

      I'm only surprised that North Korea hasn't laid into the debate too. Maybe 50 years just hasn't been enough to produce a convincing, "decadent capitalist Korean dialect of the Southern region".

    2. Re:Did you even read the article ? by rst2003 · · Score: 1

      So what if you've been using Unicode for ages, Unicode can't handle Chinese in a way that can simultaneously satisfy mainland and non-mainland Chinese.

      Exactly. just like 7-bit ascii can't handle english in a way that can simultaneously satisfy Helvetica and Times Roman.

      --
      apply XOR 0x03 to characters in email address
    3. Re:Did you even read the article ? by vidarh · · Score: 2

      You mean that can't satisfy the bureaucrats. Most ordinary people won't see many restrictions from the current standard, as it does contain about CJK 65,000 codepoints, which should be more than enough for ordinary use. Those does at this point also include "compatibility" characters - duplicates that are there to be satisfy worries about compatibility with pre-existing encoding systems.

  170. Try reading the article by dingbat_hp · · Score: 1
    • You seem to be expecting Unicode to do some magic semantic translation for you.
    • Dropping the word "hiragana" into a posting doesn't make you an expert linguist.
    • You've entirely ignored the real issue that this article raises.
    --
    "There's something wrong with our bloody moderators today"
    Admiral_Jellicoe@slashdot
  171. Solresol is even smaller by dingbat_hp · · Score: 1

    The smallest alphabet is that of Solresol, with 7. The "letters" (or segmental phonemes, if you're being picky) in Solresol may be represented in several ways, not just written, and it's fundamental to the language that they're all identical. It's often called a "musical language", because of their ancestry from the Western chromatic scale, but they have equally valid written and spoken forms, even to the tone deaf. Solresol is interesting for several reasons, although I'd not claim that it has a particular significant future.

    • First wholly invented language to achieve any sort of widespread acceptance.
    • First "interlangua", the notion of a translation intermediary language capable of expressing all other language translations x->y as the sequence of x->solresol and solresol->y.
    • First language to formally separate semantics and encoding, i.e. the musical phoneme is exactly equivalent to the written phoneme (or that phoneme expressed in arranged pebbles, or smell-o-vision). As a result, it's entirely phonetic, but the distinction goes a long way beyond that.

    And of course, it's in Unicode too.

    On the downside, it's just French with squeaky noises.

    Hawaiian is probably the "naturally evolved" human language with the shortest alphabet.

  172. too sinocentric, but Unicode has problems by rjh3 · · Score: 4

    Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.

    The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    The history of encodings is roughly:
    1. There was chaos.
    2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
    3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
    4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
    5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
    Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.

    Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.

    China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)

    The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.

    Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.

    This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.

    But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.

    It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.

    These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.

  173. Re:Wrong, wrong, WRONG, WRONG! by Mendax+Veritas · · Score: 1

    If you're going to troll, at least make an account for it. Posting at 0 doesn't do you much good.

  174. Re:Wrong, wrong! by Mendax+Veritas · · Score: 1

    Yes, you're right. MBCS strings can't easily be scanned backwards because it's a little tricky to figure out whether the preceding byte is the trailing half of a double-byte character, but that's not true of UTF-8, which guarantees that the leading byte of a character will never have its high bit set, while all other bytes will. So when scanning backwards through a string, you just back up your pointer until you find a byte with a cleared high bit, and that's the start of the preceding character.

  175. Re:UTF8 by Mendax+Veritas · · Score: 2

    No, because the aliens are all so technologically and socially advanced that they've standardized on Esperanto.

  176. Wrong, wrong! by Mendax+Veritas · · Score: 4
    UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).

    It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

    1. Re:Wrong, wrong! by hackbod · · Score: 3

      People who think there is a problem with the number of different Unicode encodings -- including the authors of this article -- completely misunderstand how unicode works. The different encodings are -not- different character sets -- in fact, they are different ways to write the -same- standard Unicode character set. The transformation between UTF-8, UTF-16, and UTF-32 is only a simple bit minipulation -- it is completely independent of the character set.

      An implication of this is that UTF-8, UTF-16, and UTF-32 can all express the EXACT SAME NUMBER OF CHARACTER CODES. So, if you think UTF-32 is good enough for you, then UTF-16 and UTF-8 are just as good. The latter two simply use multi-word or multi-byte sequences to express the upper character values.

      After using BeOS for a number of years, where all character strings are natively handled as UTF-8, I am a very strong believer in Unicode. Having a Western perspective I may be missing something, but none of the "problems" mentioned in this article are actually problems that Unicode has.

      Of course, once you start using Unicode, the main problem you are going to run in to is having fonts with the characters you need. And if the Chinese, Japenese, etc. really need 50,000 of their very own characters, then this is going to be that much more of a problem. Unforunately, there is no easy solution to this -- but it doesn't have anything to do with the encoding you use, so changing to another encoding is not going to help here.

    2. Re:Wrong, wrong! by GordoSlasher · · Score: 1

      Not quite. The UTF-8 lead-in characters all have the high bit set. When scanning backwards, you back up your pointer until you find either a byte with a cleared high bit, or a byte that is a UTF-8 lead-in byte. A typical three-character Japanese name will be encoded as 9 bytes, all with the high bit set.

  177. Re:Hmm.. I must have been using something else the by ClarkEvans · · Score: 2

    And UCS-2 is not the only way to encode Unicode. You mean Unicode is not the only way to encode UCS-2. UCS-2 is a character set, unicode is an encoding of this character set.

  178. Re:Hmm.. I must have been using something else the by ClarkEvans · · Score: 2

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16. Whoa! UCS-2 is the character set. You can encode UCS-2 using either UTF-16 or UTF-8. Once again, Unicode is an *encoding* and UCS is the *character set*. Big difference and you seem to be reversing them.

  179. Language ID? by kreyg · · Score: 2

    being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters.

    So, add a byte or two per document as a language ID...

    Anybody feel like joining me at Milliways?

    --
    sig fault
  180. Re:UTF8 by egomaniac · · Score: 1

    *sigh*. No.

    UTF-8 is an encoding format, which specifies a means of encoding Unicode characters using variable-length byte sequences. The number of bytes it uses to encode characters does not dictate how many characters Unicode supports.

    Unicode, as I've stated elsewhere, supports a little over a million characters. There are ~50,000 characters in Plane 0, and 2^20 (~1 million) in Plane 1. Plane 1 is made up of surrogate pairs, which are two special characters next to one another (a high surrogate and a low surrogate). There are 1024 of each, leading to 2^20 Plane 1 characters.

    --
    ZFS: because love is never having to say fsck
  181. Re:Unicode has this covered. by egomaniac · · Score: 1

    No, the private use area is inappropriate for this sort of thing. Private use characters are (as the name implies) not intended to be visible to other applications; they are for encoding weird data within a single application.

    There is a much larger block of public code points, which allows for over a million characters (none of which have been assigned yet, but the code points are there).

    --
    ZFS: because love is never having to say fsck
  182. Re:unicode does *not* encode 65,536 characters by egomaniac · · Score: 1

    Not sure where you got Planes 1, 2, and 14 from. It's just Plane 0 (normal characters) and Plane 1 (surrogates). No characters whatsoever are assigned outside of Plane 0, although some are pending approval.

    UTF-8, UTF-16 (UCS-2), and UCS-4 (not just UTF-16, as you say) all allow Plane 1 to be addressed, as would any other encoding which covered the surrogate codepoints (although none such exists, to my knowledge). UTF-8 allows this either through discrete encoding of two separate surrogate characters, which takes six bytes, or a special 4-byte encoding which encodes the Plane 1 character directly (rather than as two surrogates).

    --
    ZFS: because love is never having to say fsck
  183. unicode does *not* encode 65,536 characters by egomaniac · · Score: 4

    It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

    Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

    So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.

    --
    ZFS: because love is never having to say fsck
    1. Re:unicode does *not* encode 65,536 characters by vidarh · · Score: 2
      AFAIK, each plane is only 16 bit. For Unicode 3.1, for instance, the new characters are placed in planes 1,2 and 14. But you're right that Unicode as a whole encodes over a million codepoints.

      The "surrogate pair" method only applies to UTF-16 encoding, AFAIK. UCS-4 should be equivalent to UCS-2 with surrogate pairs, except that the codepoint is always encoded as a 32 bit value, whether or not a single 16-bit character or a pair of two 16-bit surrogates are used.

  184. Westerner's attitude? by gaemon · · Score: 1

    I'm really sad that real Westerner's attitude prevails right here in slashdot. I'm not surprised, even emacs rmail writers think MIME as a useless thing. that's the Westerner's attitude, so ignorant about I18N. probably most of you MERKINs don't know what I18N is in the first place.

    Mr. Caroll is just wrong in everything he claims. even the most classical and even a little bit absurd, not proven to exist and pure theoretical Hangul (Korean script) glyphs are included in the block starting in U+1100. I won't repeat at each FUD's he is spreading since the reply from Unicode above sums them up very well.

    I'm a Korean. for us the ISO-10646 and the Unicode is the Right Thing(tm), not only a Good Thing(tm).

    well some of the Japanese seem to hate Unicode. so be it. but let me tell you this. the very notion that a coding system must be defined in a lexicographical order is just OBSOLETE. that's why you have LC_COLLATE in POSIX locale.

    and about ability to code fictional script in Unicode, you can use the 31-bit space in UCS-4, just set the MSB 1. you can do whatever crazy thing in that space. that's why ISO-10646 is a 31-bit representation, not 32-bit.

    the REAL PROBLEM in Unicode is that the standard itself is unavailable in web, so most /.ers (and most merkins) don't bother to find out what it is, so much for actually reading it. the standard is a hefty volume of dead trees (paper) and costs a hefty fit from your purse. before the standard itself is available on web FUD's like Mr. Caroll is spreading won't be stopped in english nations, most of all US of A.

    but wouldn't you must RTFM before you throw flame on everything that makes you feel shit, mostly because you have to pay (storage space) for things you don't want (languages other than english the language so-much-perfect-for-everything-even-jesus-speaks-i n-it)? I expected that much from /.ers. maybe I expected too much.

    ignorance IS the human anyway.

  185. Unicode by ralmeida · · Score: 1

    Check this link to see why unicode characters won't work on the internet:

    http://dábliü.ämêricõ.îñamè.com/índiçý.html

    --

    --
    This space left intentionally blank.
  186. 2 + 1 bytes? by whovian · · Score: 1

    Hm. log(170000)/log(2) = 17.4, so at least 18 bits is needed, as I cursorily understand this, to encode present human languages. Clearly a 3-byte unicode standard is needed. Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).

    --
    To-do List: Receive telemarketing call during a tornado warning. Check.
    1. Re:2 + 1 bytes? by vidarh · · Score: 2
      Uhm. Unicode already have at least four representations that allow for about a million characters each: UTF-8 (8 bit for US-ASCII, 2-4(?) bytes for everything else), UTF-16 (usually 16 bit, 32 bit for alternate "planes") and UCS-32 (32 bit).

      In other words, the limitation currently isn't lack of space in the Unicode encodings (unless you use UCS-2), but the fact that they simply haven't gotten around to specifying any more characters yet - unicode is still a work in progress.

  187. next version of unicode should be 24 bit by j0nb0y · · Score: 1

    I wonder how difficult it would be to make the next version of unicode be 24 bit? It would break all existing implementations of course, but since unicode doesn't solve the problem it was designed to solve, continued existance in its present form is certainly not beneficial...

    Maybe it should be 32 bit just to make sure...
    --

    --
    If you had super powers, would you use them for good, or for awesome?
  188. Language geeking by persist1 · · Score: 1
    "From what little I know of russian, it has a very simple writing system that is even clearer and simpler than e.g. german, danish or norwegian."

    This is actually more-or-less true... and within my experience the only language with easier rules of pronunciation is Spanish as spoken in Mexico and Central America.

    It helps that Cyrillic denotes ten vowel sounds and a pseudo-vowel with ten characters (accounting for almost a third of the characters used).

    One has to learn how stress falls in a word to know how to pronounce it properly, but there's a certain rhythm to that.

    Where Russian kills is not with the pronunciation (once you've gotten used to it - Russian has a lot of sounds that English speakers are never taught to make) but with the grammar. It's not that there are a lot of exceptions, but rather that there are six cases (where German has three and Latin seven). There's also a lot of ambiguity where verbs are concerned, almost as bad as the ambiguities in English verb usage. The lack of articles (a/an/the) in Russian takes some getting used to, but inflection generally helps there.

    My biggest gripe about Russian, though, has to do with prepositions. More on that if anybody asks.

    --
    ...When in doubt, think for yourself.
  189. Re:I had no trouble reading that at all by quahog · · Score: 1

    wont, dont, ill (as in sick), or I'll :)

  190. Re:You bring up a good point by d-rock · · Score: 1

    Just a comment on the "Straight Dope". I don't know if the info on the Chinese Typewriter was valid several years ago, but I know it's no longer true. Both Chinese and Japanese keyboards have multiple glyphs on each key, because both Chinese and Japanese have phonetics syllabaries (alphabets). In Chinese, it's Pinyin, in Japanese it's Katakana or Hiragana (same sounds, slightly different drawings). Either way, you input complex characters using multiple keystrokes, but not in english characters.

    So, for instance, "Flower" in Japanese is pronounced "hana". That is two characters, "ha" and "na". If you type ha-na I think you will see a menu pop up with possible Kanji (pictographs) and you can choose from them. I have only used a chinese keyboard but I assume it's very similar.

    Derek

    --
    Don't Panic...
  191. Unicode has this covered. by tjwhaynes · · Score: 3

    Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.

    In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.

    There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.

    The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

    Plenty of room.

    Cheers,

    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
  192. Re:This article is stupid by MrResistor · · Score: 1

    No, I propose that folks use readers that interpret Unicode appropriately for the language they wish to use, whether that's a browser, an email proggy, or whatever. The articles point was that Unicode won't work because it can't display every single character that anyone ever came up with. My point was that it doesn't have to. My inbox is cluttered with all sorts of spam that says something like &*^%*&%UYGVKNB&^$*^%#^%$FCJUY%^$&^%U^TRU&^%#$^$@#^ %$&YT. Why should a non-english speaker expect otherwise?

    --
    Under capitalism man exploits man. Under communism it's the other way around.
  193. This article is stupid by MrResistor · · Score: 2
    HTML 4 includes country codes so the browser knows how to interpret the Unicode character. Thus, the same 16 bit number will display a different character for an English document than it will for a Mandarin document.

    In other words, Unicode doesn't need to account for every single character in the world!

    But of course, this was posted on the internet, so it MUST be true...

    --
    Under capitalism man exploits man. Under communism it's the other way around.
    1. Re:This article is stupid by dot11 · · Score: 1

      So you propose we write everyithing in HTML 4? (Everyone likes HTML formatted email, I know.)

  194. Re:All Character sets simultaneously?? by spiro_killglance · · Score: 1
    In the output pages from a search engine, the indexed pages could have come from any website in any character encoding, the descrptions on the search engines result page should actually display the characters of whatever pages was found, which could be from multiple sites.

  195. Re:After some skimming... by RFC959 · · Score: 1
    I agree; the author seems more politically than technically motivated. So Unicode doesn't contain EVERY glyph ever created by humans. So what? Try typing "naive" correctly on your US keyboard. Somehow we've managed to survive this horrible cultural imperialism...

    The author allows his enthusiasm to carry him away more than once. For example,

    "[Hangul] was designed from the start to be able to describe any sound the human throat and mouth is capable of producing in speech..."
    Yes, Hangul is a remarkable invention, but try asking a Korean to say "Flushing" some time.
    "[Hangul] can be written with clarity, in a 24 X 24[dot-per-inch] space."
    Who cares? What does that have to do with Unicode, which has absolutely nothing to do with the physical representation of the glyph?
    "...the phrase, "Personal Computer"...is now 'PersaCom' in Japan.
    "PersaCom"? I've never seen or heard it rendered that way, and I really doubt that "persacom" is technically considered pronouncable Japanese. (I've always seen it rendered "pasokon".) And it still has nothing to do with Unicode. And printer manufacturers really made 8-pin printers so they could print hiragana and katakana, and they invented modes so they could print more complex characters but they sold them to Americans as "graphics modes", and, and, and...a whole flood of undocumented irrelevance.
  196. Minor nitpick by HalfFlat · · Score: 1

    Actually UTF-16 can't represent the same range as UTF-8 or UTF-32, it's a bit weird. UTF-16 uses surrogate characters to represent the 16 UCS-4 planes 0x00010000 through 0x0010FFFF as a pair of 16-bit words.

  197. Misconceptions in article by HalfFlat · · Score: 3

    As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.

    One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.

    It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)

    Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.

    Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)

    With this in mind, one can argue against the points raised in the article:

    1. The unification scheme, allows the representation of the 170,000 characters the author calculates in 70,000 or so codepoints. Which it now does with Unicode 3.1. The use of external context is still necessary for correct rendering, but if the document has no structure for representing language context, there are Unicode language tags that can fill this role. Similarly, context would be required for the presentation of different calligraphic variants of Roman characters (e.g. fraktur.)
    2. Unification is quite unlike the analogy described 'in Western Terms'. 'M' and 'N' could not be identified, as they semanticly distinguish words (e.g., 'rum' and 'run' have very different meanings.) Traditional characters and their simplified analogues are not identified under Unicode, so even if 'Q' were simply a fancier 'C' (which of course it is not), it wouldn't be given the same codepoint.
    3. Unicode is not limited to 16 bits as stated in the introduction to the article. There are over 2000 million available codepoints in UCS-4 and UTF-8, and UTF-16 can represent approximately 1 million of these. There is plenty of room - even in UTF-16 - to encode more characters as the need arises.
    4. With the exception of calligraphic variants in CCCII, Unicode can already faithfully represent characters in the major Chinese, Japanese and Korean character encoding standards.

    A little bit of research by the article author would have made the article unnecessary.

    References:
    Unicode 3.1 document;
    CJKV Information Processing, Ken Lunde.

    PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.

  198. Nonsense by GCP · · Score: 1

    Far more "technical people in Japan" are in favor of Unicode than are opposed to it, and the percentage opposed appears to shrink every month as the feared "dangers" somehow don't materialize but the benefits do.

    Re: your little list...

    -- The conversion tables differ only very slightly, and *almost* everyone uses the tables at the Unicode.org site, either directly or by calling converters in the OS. Still, there are potential tiny differences, as you see in all cases of matching massive character sets across borders, though in the case of Unicode the problem is much smaller.

    -- CJK Unification has the disadvantage that you can't be certain of picking a font that is guaranteed to be acceptable based on the code point alone. In practice, this rarely turns out to be much of a problem, but it can happen. On the other hand, there are some nice benefits of the unification that more than make up for that one problem. The problem you cite isn't a problem. The character distinctions that the Japanese want to make have been made by the Japanese in the JIS X character sets. Those distinctions were then directly ported over to Unicode.

    -- Certainly you're referring to UTF-16 surrogates, but calling it "Unicode" to make the problem sound larger. In fact, UTF-8 is "Unicode", too. It's the greatest method for text data exchange ever created, and it has no 64K issue, no endianness issue, is self-synchronizing (if you miss a byte in a stream, only one character [code point] is lost), and many other nice features. The greatest of all is, of course, that it can encode virtually every language in the world in the same encoding.

    --
    "Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
    1. Re:Nonsense by GCP · · Score: 1

      1) The conversion tables you use depend on your needs. The problem is not in the nature of Unicode, but in the nature of the script system itself and the slight differences in opinion of the experts who have designed all the common CJK character sets. It's quite similar to debates among professors in Asia regarding proper stroke order or even number of strokes in characters. No two experts answer exactly the same for all characters. Without Unicode, there are various alternative mappings among popular CJK character sets. Conversions to and from Unicode seem to have fewer problems than the others. What would you suggest? Create a new character set and declare a set of maps as standard? As soon as you do, a dozen other parties will make slight adjustments to reflect their views and publish "corrected" maps. It's not Unicode that is the problem. And, by the way, I have converted megabytes of electronic text of various sorts, and working with Unicode has been bliss compared to legacy CJK character sets.

      2) No, I admitted that CJK unification has one disadvantage but several nice advantages. Not unifying would reverse that. I can't think of a system that would have only advantages. Can you? "The Japanese" at JIS couldn't, which is why they essentially invented CJK unification several years before the Unicode Consortium was created. Not only have I been to a lot of Japanese discussions of Unicode, but my job has been to deliver the Unicode support demanded by our Japanese customers in products that I *guarantee* you would recognize. "Paid by the Unicode Consortium"? The average 7-Eleven has a bigger annual budget than the Unicode Consortium. What are you smoking?

      3) Why are there so many different tools if a hammer is so great?

      And, if Unicode lacks the characters to fully cover some writing system, please identify the missing characters and submit your data as scores of individuals and government bodies have done and continue to do. Surely you'll be able to find quite a few examples of characters missing from Unicode 3.1 that you need, right?

      --
      "Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
    2. Re:Nonsense by kwhistler · · Score: 1

      1. There are crossmapping difficulties between all large East Asian character encodings. This kind of problem predates Unicode, and Unicode has inherited many of the inconsistencies already present. For accurate mapping between particular vendor implementations (e.g. Code Page 932 on Windows) and Unicode, the right thing to do is to use the vendor's own mapping of their code page to Unicode. (Those tables are also often posted on the Unicode website, or can be obtained from vendors themselves.)

      And what is your *alternative* anyway? Do you think you can find more authoritative and less problematical tables for converting "megabytes of electronic documents" between, say CNS 11643 and JIS X 0208, without making use of Unicode?

      2. What the previous poster was pointing out is that all character distinctions made in the official JIS national standards are also made by the Unicode Standard. I might also point out that the number one OS in Japan (Windows) and the number one word processor in Japan (Ichitaro Dasshu) are both Unicode-based. Most people in Japan are perfectly satisfied with such products, as regards their character handling, and neither know nor care that they are based on Unicode inside.

      The Unicode Consortium has not paid anyone to put a rubber stamp on anything. Put up or shut up on a claim like that.

      And yes I do attend Japanese discussions on Unicode issues. (Although I don't hang out on boards devoted to TRON or Giga, which tend to uninformed Unicode-bashing.) I have been personally acquainted with the head of the Japanese national standards body delegation into the ISO committees for a number of years now. He is hardly a stalking horse for the Unicode Consortium! But he and JSC2 have cooperated for years with ISO SC2/WG2 in the development of 10646 *and* in the Han unification that that implies--the same Han unification used in the Unicode Standard.

      3. UTF-8 is great for some purposes. UTF-16 is great for others. And UTF-32 is great for yet others. Since all of them represent the same characters in the standard, and all those three forms interoperate easily (the conversion code is posted on the Unicode website, for anyone who cares), where's the problem? The reason there are 3 encoding forms is because the software vendor community demanded it: UTF-8 for 8-bit API compatibility and UNIX stream/file transparency; UTF-16 for size and processing efficiencies for most text; UTF-32 for UNIX 32-bit wchar_t implementations of character processing.

      And then you toss off a total non sequitur: "...the problem is that Unicode doesn't allow many people to encode their languages fully." Unicode doesn't encode *languages* -- it encodes characters from scripts. End users don't "encode their languages" -- they represent text in their languages on computer systems that make use of encoded characters. But now that we've got some terms straightened out, would you care to specify an instance of a language that Unicode doesn't represent fully? It's funny how the Unicode Consortium seems to have convinced experts from the Library of Congress, the Research Libraries Group, the European Community, and so on about this, but hasn't been able to convince you!

      So, since you asked, that was how your post was nonsense.

  199. No, no by GCP · · Score: 1

    ISO 10646 itself is now restricted to the range described by UTF-32. ISO has agreed to close down the state space and to never define any code points that can't be reached by UTF-16 surrogates, which is where the UTF-32 boundary came from. UCS-4 is now obsolete, even at ISO.

    --
    "Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
  200. Re:You bring up a good point by Kaiwen · · Score: 1
    Pronunciation is ridiculous

    English phonetics is only "ridiculous" when it's learned with one eye on the written form. If English were written using, say, the IPA, the consistencies would be much more apparent; as it is, however, they're obscured by the massive kludge of historical accretions that passes for modern written English. How many vowels does English have, for example? Five? It's actually closer to twenty.

    And there are obscure rules such as '"their" may be used in place of "his/her"

    Not in my classes. Any student of mine who tries to use third person plural as a substitute for "he" or "she" once doesn't make the mistake a second time.

    the difference between "lend" and "borrow"

    This is no more difficult than the difference between "bring" and "take" or "come" and "go", and strikes me as inexcusably ignorant. If it's not a problem for my seven-year-olds, it shouldn't be a problem for the educational elite.

  201. UCS by markbthomas · · Score: 1

    What about UCS?
    Support for up to 31bits per character and backwardly compatible with UTF-8

    0x00000000 - 0x0000007F: 0xxxxxxx
    0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx
    0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
    0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    This still leaves us with 0xFF and 0xFE as escape characters!

  202. �The point of c -> k and c -> s by yerricde · · Score: 1

    Looks like you've just stuffed yourself...

    The point of the transformations hard c -> k and soft c -> s is to free the letter 'c' to stand uniquely for the 'tsh' sound of church and Medici.

    --
    Will I retire or break 10K?
  203. Tengwar is SCRIPT not language by yerricde · · Score: 2

    rough translation of this Quenya Elvish phrase which is a derivative of the Tengwar elven language

    Script != language. The word tengwar is Quenya for "letters." Calling the tengwar script a language is like calling the cyrillic script (used for Russian), the katakana and hiragana scripts (used for Japanese), or the latin-1 script (used for many Western European languages) a language.

    Find Tolkien's tengwar and more in the conscript registry, which uses the 'private use' area of the Unicode space for scripts invented in modern times (all scripts are invented at some time or other). And there are "surrogate" codes in Unicode UTF-16 for a million additional code positions.

    --
    Will I retire or break 10K?
  204. Tengwar: Another alphabet designed on phonetics by yerricde · · Score: 2

    Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

    If you like hangul, you'll probably also like J.R.R. Tolkien's tengwar. Regular changes to the shapes of the consonants denote stop/fric/nasal and voiced/less. The structure of the script is such that unused letters (after t series, p series, and k series) can be used to represent sounds unique to a given language. It's available in both vowel-pointed (like devanagari and biblical hebrew) and vowel-letter (like greek/latin/cyrillic) modes.

    I'm not 100% sure about the legal status of a post-1923 script. Can a script be copyrighted or trademarked? Probably not. (Patents don't apply; it's been more than 20 years since the entire system was disclosed in RotK.)

    --
    Will I retire or break 10K?
  205. "Ye Olde" typo and Walt Disne�^WDisney by yerricde · · Score: 2
    Old English, by the way, did have more letters than are found from modern english ("thorn" letter for "th", and couple of others).

    The letter thorn looks like (U+00DE; Alt+0222; capital) or (U+00FE; Alt+0254; lowercase).

    Thus, "Ye olde ..." is a kind of a typo; the first letter wasn't Y, but was close enough visually that it started at some point to be thought to be Y...

    Except DisneyCo (famous for buying bad legislation) actually does the opposite: using instead of y in the corporate logo.

    --
    Will I retire or break 10K?
  206. Basic English by yerricde · · Score: 2

    if someone tried to remove redundancies from the English language such as pork and ham, or argue and dispute

    C. K. Ogden once did just this, reducing the English vocabulary to a set of 850 basic English words, but the result has (some foreigners claim too many) idiosyncratic idioms and turns of phrase.

    --
    Will I retire or break 10K?
  207. Likewise, for Latin-1 based languages... by yerricde · · Score: 2

    Likewise, in Unicode, English, German, and Finnish all share the same codepoints and glyphs, so you can't grep for one language or another without using META headers or something similar.

    For instance, if you were searching in English for "gift", this string in Unicode would be the same as the German characters for "poison" (Gift), so your search would get hits from other latin-based languages in addition to English.

    It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is a bit different for the different European languages, even though they use the same characters.

    Translation: Language collision can be avoided by exact phrase matching ("perpetual copyright" wouldn't return many matches for non-English documents) and specifying the natural language of a document either in the document or in the headers.

    --
    Will I retire or break 10K?
  208. And Unicode distinguishes those. by yerricde · · Score: 2
    1. The western way "1"
    2. The common Chinese character
    3. The complex Chinese character used for legal documents, cheques (when not in English), etc
    4. The Chinese character used in markets

    So then, are they the same, or not? The answer is NO.

    Unicode would distinguish among these four forms because they are distinct characters, but it would not distinguish among similar forms of the SAME character. Unicode does not distinguish sans-serif from roman from italic from fraktur from monospace; that's the job of the stylesheet.

    To answer another common objection: When two characters look the same but are not the same, they are assigned separate codespaces. For instance, Latin capital letter A, Greek capital letter Alpha, and the Cyrillic equivalent look exactly the same. Chinese 'yi' (one) is the same character with the same origin as Japanese 'ichi' (one), but it is not the same character as hyphen is not the same character as em-dash.

    --
    Will I retire or break 10K?
  209. UTF-8 by yerricde · · Score: 2

    As long as C programs have to be written in ASCII

    Supporting UTF-8 variable names as an extension to C and to C++ would not break any standard because, by definition of UTF-8, any valid ASCII string equals its UTF-8 representation.

    english will be the standard

    Programming languages use English as the standard for keywords because more programming language designers can speak English than any other language.

    Limit use of 'to be' verbs to add power to your English.

    --
    Will I retire or break 10K?
  210. "Extended ASCII" misnames ISO-8859-1 by yerricde · · Score: 2

    Perhaps not, but there is such a creature as "Extended ASCII".

    Say not "extended ASCII" or "high ASCII" but "ISO-8859-1" or "ISO Latin-1." Latin-1 happens to use the same characters at codepoints 00 to 7f as ASCII, but that of itself does not make it ASCII. Unicode uses the same characters at codepoints U+0000 to U+00FF as Latin-1, but...

    --
    Will I retire or break 10K?
  211. Bummer by SpanishInquisition · · Score: 1

    I always wanted to have greek letters AND hebrew letters AND a smiley face in my email address.
    --

    --
    Je t'aime Stéphanie
    1. Re:Bummer by joto · · Score: 2

      Yeah, that would be really useful. Only spammers would know how to copy and paste your email-address, while ordinary people you tell your email-address can't type it...

  212. Re:You bring up a good point by timbu2 · · Score: 1

    Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English.

    Sounds suspiciously like perl. How many times have I explained about scalar and list context.

  213. Re:You bring up a good point by joto · · Score: 2
    This of course depends on what you mean by a simple writing system. If by simple you mean only "has few letters" then English is about as simple as it can be. However, english spelling is extremely idiosyncratic, there are no simple rules to follow, and almost every word is spelt in some not entirely logical way.

    If you choose this view, then yes, most european languages have much more logical spelling than english. One exception might be french, which is not at all written as it is spoken, although there is admittedly a system to it.

    Accent marks and diacriticals doesn't make the writing system more difficult, it simply makes it possible to write more phonetically. I would prefer the writing systems of german, danish, norwegian or swedish any day before english. I don't know any east-european languages, but I would be very surprised if most of the accents and diacriticals weren't there for a good reason, and I doubt they can be much worse than english.

    From what little I know of russian, it has a very simple writing system that is even clearer and simpler than e.g. german, danish or norwegian.

    On the other hand, if someone makes a truly simplified and logical spelling of the english language popular, e.g: "I thought my bones were breaking during the fight" -> "Ai thokt mai bowns wer breiking diuring the fait", it could eventually become as simple as most other european languages (or those written with the cyrillic character set).

    Of course, most languages has some kind of idiosyncrasies when it comes to spelling, but english is certainly not among the easiest. And the few added letters in some european languages is laughable. German adds a few umlauts and ß, danish adds æ and ø, norwegian adds æ, ø and å, swedish adds å, ä and ö, and so on... No big deal! Besides, none of the above mentioned languages makes any use of x or z except in foreign words. Scandinavian languages never use w except in foreign words. The same is true for c in norwegian. So the letter count is mostly similar, as is true for cyrillic.

  214. Re:You bring up a good point by Doomdark · · Score: 1
    Well, finnish doesn't make use of a few of ascii-letters (except for loan words), such as 'b', 'c', 'f', 'q', 'w', 'x' and 'z'. It does have 2 additional characters (a and o with umlauts; 3 if you count in 'swedish o'). Diacritics, accent marks etc. are not used (umlauts are part of those 2/3 specific letters). Of course, nowadays all ascii letters are used and available due to foreign loans (and some ancient texts did use letters like 'w' in place of what nowadays would use 'v').

    Old English, by the way, did have more letters than are found from modern english ("thorn" letter for "th", and couple of others). Thus, "Ye olde ..." is a kind of a typo; the first letter wasn't Y, but was close enough visually that it started at some point to be thought to be Y... And Old English was, alas, easier to pronunce than modern english. Thanks a bunch, latin-loving grammaricians, who bastardized spelling of words like "island", "herb" and n+1 others (idea was to emphasize the origin of loan words, independent of whether spelling was consistent with pronunciation). Syntax and grammar were more complex, though (with germanic inflictions... of which 'bewitched' and 'awaken' are remnants)

    On an unrelated note, letters 'j' and 'u' were not part of european languages (that's why romans had funny habit of using 'v' everywhere...) before being invented few centuries ago (ie. "i" was used for both "i" and "j", "u" for "u" and "v").

    Oh and finally; it probably was a coincidence in sense that if computer science had bloomed in some other country (say, Germany), it would most likely have contained the local character additions (which in general in west Europe isn't all that many really... some languages do use diacritics more heavily, many do not)

    --
    I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
  215. Re:You bring up a good point by Doomdark · · Score: 1
    To really understand the English language itself, you need some knowledge of the many languages that it adopted words and rules from. Sadly, I don't know a lot about that.

    I'd recommend David Crystal's "Cambridge Encyclopedia of English Language" (or whatever title was, I don't have the book at hand right now). I'm not a native speaker, and found it very interesting reading (and it's rather complete in explaining history of english language).

    In nutshell; english is a germanic language, derived from 'old german'; oldest non-germanic influences from celtic languages (but very little) and roman. More influence (loan words mainly) from vikings (Norse is a germanic language, so not much grammatical changes). Major changes thanks to french conquerors; tons of loan words (many originally from Latin), messed up spelling. Both grammar and spelling further complicated by scholars who loved Latin so much they changed lots of rules... just because they thought Latin grammar "was perfect" and a model for all civilized languages.

    Of course, english has word loans from dozens of languages (surprisingly many from, say, portuguese and dutch... even one from finnish).

    --
    I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
  216. Re:one from finnish? by Doomdark · · Score: 1

    "sauna"... what a surprise! :-) (AFAIR, the source was one of the big encyclopedias)

    --
    I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
  217. All Character sets simultaneously?? by -tji · · Score: 1
    Why would one want to represent all character sets of the world simultaneously??

    In the WWW, doesn't the HTTP header contain character set information, so the client knows which of the many character sets/languages to use? Then, only the size of that one character set is important (which will always be FAR less than 64K).

    1. Re:All Character sets simultaneously?? by kurisuto · · Score: 1
      I have to represent multiple character sets all the time in my line of work (linguistics). For example, my dissertation included Roman, Greek, Cyrillic, IPA, and Runic characters, among others.

      You're correct that it is possible to mark ranges of text as belonging to a particular character set, but there are many drawbacks to this solution. For example, I was recently trying to grep Greek words from a text in both Greek and Latin; both languages were encoded in the same character space, with tags to show what text was in what language. I got all sorts of spurious matches from the Latin words, which wouldn't happen if the Greek and Roman letters weren't sharing a single character space.

      There are workarounds, but they are a huge hassle for anyone who has to regularly work with multilingual text.

    2. Re:All Character sets simultaneously?? by OhPlz · · Score: 1

      What if you wanted to view two documents at once and their mappings conflicted? Granted present day HTML restricts a document to one character set that may not be the case forever (or maybe it's not even the case now, what version HTML are they up to now?).

    3. Re:All Character sets simultaneously?? by vidarh · · Score: 2

      Wrong. The worst case for unicode is 4 times larger than normal. If you only use non-ASCII text spuriously, you can use UTF-8 and will get by with much less than that (as UTF-8 encodes all ASCII text in one byte).

    4. Re:All Character sets simultaneously?? by Ubi_NL · · Score: 1

      Is that a problem? I mean a *real* problem? Having my documents in unicode means my files get 18 times larger than 'normal' only for the off chance I want to put a funny character in.

      There are better ways for this. Why not just put a language directive in the header of a HTTP file?


      --

      If an experiment works, something has gone wrong.
  218. 1st Posts of the World by Shocker69 · · Score: 1

    Here is your first lesson if Slashdot goes global.

    French = Première Distribution

    German = Erster Pfosten

    Italian = Primo Alberino

    Portugese = Primeiro Borne

    Spanish = Primer Poste

  219. Re:In other news... by Shocker69 · · Score: 1

    Unfortunately these would all be rejected by the Slashdot editors.

  220. Esperanto, Ido, lojban; BCE by mrBlond · · Score: 1
    Ido fixes most if the stupid things in Esperanto, and lojban is much more logical.

    ...and isn't explaining B.C.E. as "before the Christian era" defeating the object? The reason I use BCE (before the common era) and CE (common era) instead of BC and AD is to remove the references to religious myth.
    --
    mrBlond

    --
    CowboyNeal for president!
    "Hit any user to continue."
    1. Re:Esperanto, Ido, lojban; BCE by GunFodder · · Score: 1

      What the hell is the "Common Era?" I am not a Christian but I think it is reasonable to respect the Christians that bothered to count 2000 years. Calling it the "Common Era" takes the Gregorian Calendar completely out of context.

  221. BCE by mrBlond · · Score: 1
    There are billions of people who do not believe the same thing Christians do about Jesus. Some of them prefer CE to AD.

    Interestingly (because Islam uses a lunar calendar, and Christianity a solar) Mohammed will one day be "older" than Jesus :)
    --
    mrBlond

    --
    CowboyNeal for president!
    "Hit any user to continue."
  222. Mrrp, wrong by Srin+Tuar · · Score: 1
    Man, go read for yourself. here is a link: http://www.cl.cam.ac.uk/~mgk25/unicode.html

    What you are talking about is UTF-16. Unicode can support up to 2^31 character codes, but they are not all reserved yet.

  223. funny by Srin+Tuar · · Score: 1

    most of the cruft you mention is as irrelevant as calligraphy. upper vs lower case could be called a stylistic difference.

    thats why the original implementations of english-machines were all caps. and this reply works fine in all lower.

    and sorry, the numbers are written and read left to right. 123 is "one hundred twenty three" not "three hundred twenty one"

    now, if you mention grammer, spelling, etc, youd have a point.

    1. Re:funny by matrix29 · · Score: 1

      I was polishing my Polish yesterday and a Father on TV offered to read the good Book to me. My friend came up to me. I said, "Sue, not now. Put the Turtle Wax away. I want a glass of Coke." I then decided to watch the Batman episode with the Green Hornet. I thought it was a shame they never had The Green Arrow on as well with Robin & Mr. Batman.

      ---- In all LOWER CASE. ----

      i was polishing my polish yesterday and a father on tv offered to read the good book to me. my friend came up to me. i said, "sue, not now. put the turtle wax away. i want a glass of coke." i then decided to watch the batman episode with the green hornet. i thought it was a shame they never had the green arrow on as well with robin & mr. batman.

      ---- A little harder to follow I think ----

      UPPER CASE & lower case are functions of punctuation, not disposable symbols. A word may appear the same in a single case, but the context can be easily lost without it. In some cases it denotes Titles, abbreviations, acronyms, and Proper Names for Pronouns. Case SWAPPING can denote emotion or highlighting. *Punctuation* can give some words needed importance and add emotion to rather dull sentences. All lower case denotes a passive voice, like in a poem or in a whisper.

      --
      "Face it, a nation that maintains a 72% approval rating on George W. Bush is a nation with a very loose grip on reality.
  224. UTF8 by Srin+Tuar · · Score: 2
    UTF8 is cabable of encoding up to 31 bits per character, which is 2,147,483,648 distinct glyphs. This should be plenty for all languages, and at least for linux/*nix, it is well recognized as the way to go.

    One upside of it is that that is almost no cost for english/ascii, which will remain 1 byte per character. You dont even have to recompile most apps to support it- only those that format character glyphs.

  225. You bring up a good point by Srin+Tuar · · Score: 2
    Does anyone know a a real language that has a simpler writing system than english?

    Almost every other european language I have seen uses some set of accent marks or diacriticals. And having studied japanese and vietnamese, they have orders of magnitude more complexity. Even esperanto has a larger alphabet than english.

    Is it just a coincidence that the simplest writing system was the first to be digitized? Too bad pronunciation of english isnt equally simply.

    1. Re:You bring up a good point by de+Selby · · Score: 1

      To really understand the English language itself, you need some knowledge of the many languages that it adopted words and rules from. Sadly, I don't know a lot about that.

      But, while English is one of the most difficult languages to use correctly, no one cares. Bad punctuation, spelling and improper plurals may look bad, but English can be understood no matter how many mistakes are introduced. That's why it takes only a few months to learn enough English to be understood, but many years to get most of it right.

    2. Re:You bring up a good point by de+Selby · · Score: 1

      Even more, English is expressive. Depending on what you want to say, you can choose a word derived from another language, say greek or french, to set the tone.

    3. Re:You bring up a good point by zephc · · Score: 1

      esperanto may have a slightly larger alphabet, but its has perfectly logical and regular rules for spelling as well as grammar, and is FAR easier to learn than english
      ----

      --
      "I would say that 99 per cent of what my father has written about his own life is false." - L. Ron Hubbard Jr.
    4. Re:You bring up a good point by SpeelingChekka · · Score: 1

      Sorry, I misunderstood what you meant by "simple". And yes I know what the "A" stands for.

    5. Re:You bring up a good point by SpeelingChekka · · Score: 1

      but English can be understood no matter how many mistakes are introduced

      I can think of many people who are supposedly educated, who have English as their primary language, who often fail to understand things. How often does it happen, for example, that someone posts something that is obviously humorous, only to have several people who completely miss the fact that the post is humorous reply? I mean, we have supposedly educated adults who cannot even pick up the basic tone of a simple piece of writing. How many of us have posted something well-thought out and well-written to an online forum, only to get some inane reply from some moron who completely missed the most fundamental point of your post?

      Mind you, this probably has nothing to do with the English language, and everything to do with culture and education ("uh duh, I don't want to think"; "why do they teach us this crap at school that we're never going to use"; "who needs to learn how to use English properly, as long as people can still understand each other" etc etc).

    6. Re:You bring up a good point by SpeelingChekka · · Score: 1

      English phonetics is only "ridiculous" when it's learned with one eye on the written form

      How can you possibly 'seperate' the written form from the spoken? The very fact that it is essentially impossible to deduce pronunciation from spelling is ridiculous in itself. "Gill" and "fill" look like they should be pronounced the same, but aren't. "One" and "won" look different but are pronounced the same. "Going", "boing" and "doing" all look the same, but are all three pronounced differently. "One"/"tone". "Edit"/"edited" (one "t"), "Spot"/"spotted" (two "t"s). "Post"/"lost". "Meat"/"great". "Dome"/"come". "Comb"/"womb". Pronunciation of "women". Words like like "light" and "thought", which if you didn't just happen to know how they should be pronounced, you'd have a tough time figuring it out. "Tough"/"dough"/"plough" - all look the same, but three different pronunciations. "Dough","doe","dow" - three spellings for a word that sounds the same. There are hundreds of inconsistencies, those are just a few off the top of my head. And all of these things have to be learnt on an individual basis. Its silly to try pretend that these inconsistencies "don't count" by making some assertion that pronunciation should be learnt without "one eye on the written form". Why should it have to be learnt this way in the first place. Lets face it, lets not be zealous about it - English is ridiculous. I'm not anti-English, actually I happen to like English, although it may not sound like it :). I'm all for English being taught everywhere possible, and I do believe that English should become an "international language" of sorts, as it already has to an extent. But that doesn't mean we should ignore all its flaws. English is like the Win32 API - it has all the obvious evidence (kludges, nuances and inconsistencies) of something that evolved over a long period of time from many seperate sources, as opposed to something that was designed.

      Not in my classes. Any student of mine who tries to use third person plural as a substitute for "he" or "she"

      Uh .. thats the thing .. you CAN (as far as I know, its allowed). In place of using the term "his/her", e.g: "the reader should boot up his/her computer and then format their C drive" is valid in place of "the reader should boot up his/her computer and then format his/her C drive". Unless you misunderstood what I was referring to?

      This is no more difficult than the difference between "bring" and "take" or "come" and "go", and strikes me as inexcusably ignorant

      Ignorant it may be, but if you're looking at a basic rule that the majority of people can't remember, then you have to admit that the problem may lie with the rule, not the people. Its like the butterfly ballot issue - people debate whether it was confusing or not - but the fact is, people got confused by it - therefore, it WAS confusing. There really is no question, if people found it confusing, then it is confusing. These things (English language / ballots) are designed for people - not a small minority of the most intelligent people, but ALL people. "Ignorance" tends to spread by spoken word too. When I was still in school I could remember the difference between "lend" and "borrow". Once I left school my English skills started to slide a bit, as they weren't actively maintained. Also, I mixed a lot with people of a different language at University, who tend to speak English with very broken grammar. After some years I heard the terms "lend"/"borrow" used interchangably so many times that I can't remember the difference any more.

    7. Re:You bring up a good point by SpeelingChekka · · Score: 4

      Does anyone know a a real language that has a simpler writing system than english?

      Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.

      I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.

      Is it just a coincidence that the simplest writing system was the first to be digitized

      Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?

    8. Re:You bring up a good point by linca · · Score: 1

      It depends on what you call "simple"

      Latin (still in use as a nationwide language, though only in the Vatican :)) has less letters than english, is only upper caps, and ancient versions use no ponctuation. Classical arabic (without diacriticals) has even less letters. Of course it is unreadable... I guess Hebrew is low on the amount of letters, too.

      Saying that accents complicate a writing system is not so true either : you indeed get a more powerful writing system with less signs to write it. indeed, the use of doubled vowels, or particular succession of consonants, to mark particular pronunciation, is complicated too.

    9. Re:You bring up a good point by linca · · Score: 1

      Yes, that's why it has'nt got many letters, like Arabic. And english kids complaining about homework... I even heard people say english is a hard language... Have they ever seen any other language?!?

    10. Re:You bring up a good point by 21mhz · · Score: 1
      From what little I know of russian, it has a very simple writing system that is even clearer and simpler than e.g. german, danish or norwegian.

      Schaz. It's not particularly simple or phonetically clear. There are rules that help to deduce what letter to spell, but in many cases you just have to know, as you have to do with English. The percent of people that spell illiterate is large here in Russia.

      --
      My exception safety is -fno-exceptions.
    11. Re:You bring up a good point by Epicure · · Score: 1

      Modern American english's biggest influence has become technology. Think of how many words have had to be created in the past few years....

  226. (reply to AC) by Srin+Tuar · · Score: 2
    Wrong, look here: unicode faq

    quote: All possible 2^31 UCS codes can be encoded.

  227. Unicode and CJK Characters by torokun · · Score: 3
    There are some good comments here, clarifying why this article is fundamentally wrong in its assumption that Unicode only encodes 2^16 characters. This is the first reason why this article is wrong.

    The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.

    The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...

    The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.

    The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.

    On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand ( Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.

    So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.

    The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.

    This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)

  228. A Short History of Character Encoding by KidSock · · Score: 1

    In the biginning there was ASCII. It's a 7 bit code which means you only have room for the common 127 english characters. This didn't do any good for forigners so they made up language specific code pages like Cp437 or the MS Windows Latin-1 encoding Cp1252 that just redefined what codes corresponded to what characters. This was a little ugly because you could not easily use characters from different languages together. So then someone come up with ISO-8859 which was backwards compatible with ASCII, meaning all the lower codes were ASCII. So this was a step in the right direction but the extra 127 characters gained from that extra bit didn't give you much; you still needed language specific versions like ISO-8859-1 is Latin-1 for the US codes, ISO-8859-2 is for Europe, etc. You see, the barrier here is the dependance on fitting character data into an 8 bit byte. Anything more and you really screw up existing kernels, libraries, and programs that depend on a character bing one byte like terminal drivers, strlen, and your ini file parser ...etc. Finally, both ISO and the Unicode consortium, at first independantly, decided to come up with a universal character set. Both standards resulted in what amounts to a set of tables that defined exactly the same codes for all the characters in every language. At first I think they thought they could get away with a 2 byte code. This was called UCS-2, which is the route Microsoft is going and I belive Java as well. Now this expanded the number of possible charaters considerably, however this still didn't solve the existing dependancy on 8 bit character strings. For that they came up with UTF-8. The clever trick here is that they cannobalize the last bit to indicate that another byte gets tacked on. That gives you two bytes to play with. If the first three bits are on in the first byte then there are three bytes to store your large UCS code corresponding to some exotic character. But this still wasn't enough. The characters started push the envelope of two bytes and so they upgraded to UCS-4 which now has 4 bytes and will hold all the characters of every language including the languages of yet-to-be-discovered alien civilizations. But now you have sofware, like from MS, that favors the somewhat more effiecient and practical two byte UCS-2 codeset so you need to extend the UTF-8 concept to give you UTF-16. Well, that's about where we stand and there's a lot I left out.

    Interesting?

    Read this: http://www.cl.cam.ac.uk/~mgk25/unicode.html

  229. Helping the poor by peccary · · Score: 2

    One way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible.
    One way to do this is by making sure they possess the knowledge and skills of the modern economy. One of those skills is the dominant language. If you want to be rich, you learn to act, speak, think, like the rich people. Preserving the "native language and culture" is the province of romantic idealists.

    Don't go calling me a cultural imperialist, now. I actually read, speak, and write three languages, and could easily add a couple of more. I love the differentness of distant cultures. I am a "romantic pragmatist." I would love to see this differentness preserved, but I recognize that its passage is inevitable. The fact is that all these languages and cultures sprang up because the world was so vast. Groups of people were totally isolated from each other.

    The "western" world isn't that large anymore -- it's actually smaller than it's ever been. When Alexander the Great ruled the world, it was months from one end to the other. Now, the western world is maybe a day from one end to the other.
    The natural circumstances under which those languages arose simply do not exist any longer. They are fish out of water, and they must naturally pass on -- it's just the way of things.

    There may be room for many different languages when the human race colonizes the solar system, but I suspect that even then, the communications delays will be low enough that a single culture will be maintained, more or less.

  230. Cultural Heritage is important! by peccary · · Score: 3

    I mean, imagine how much pooerer you would be if you had been unable to read the epic poems of early Anglo-Saxon culture in their original form! Or the early Judaic and Greek writings on which much of our more recent culture is based.

    You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?

    Now imagine how hard this would be if your computer didn't support the full character set in which they were written.

    1. Re:Cultural Heritage is important! by Computer! · · Score: 1

      Not harder at all. I read them on PAPER printed with INK. Making everyone pay for expanding a character set we all use in order to view the Dead Sea Scrolls is ridiculous. These documents can be preserved in antiquity in several dozen different ways, including taking pictures.

      --
      If you fall off a building, go real limp, because maybe you'll look like a dummy and people will be like hey, free dummy
  231. Ignorant nonsense. by fm6 · · Score: 2
    I'm repeating what other posters have already said, but I think it's worth boiling down the basic issue.

    The simple fact is these guys are totally ignorant. They confuse a particular 16-bit implementation of the Unicode "basic plane" with Unicode itself. If they'd done any research at all, they'd know that there are 16 planes, with support for about 1 million characters. Plus some there are "private spaces" so people can create their own extensions of Unicode. There's already the ConScript registry (which supports Shavian and Klingon).

    I'm reminded of people who thought computers would never catch on because keypunches were too bulky.

    Another ignorant assertion: that 1.5 billion people "speak" Mandarin. Mandarin is the standard dialect of Chinese, but only about 800 million people actually speak it.

    __

  232. Re:After some skimming... by kurisuto · · Score: 1
    The author's example of treating "M" and "N" as the same character is just plain wrong. Changing one for the other changes the meaning (e.g. "smack" vs. "snack"). Not so for the various presentation forms of a single Chinese character.

    A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.

  233. Re:Is this really such a problem? by kurisuto · · Score: 1
    Your arguments fail on languages such as Russian, Greek, Arabic, Hebrew, Hindi, etc. These languages use non-Roman character sets, but each has many millions of speakers; none is faced with language death in the foreseeable future. None of these languages can be considered politically or economically insignificant.

    Even if a language does die, there is often still a need to work with it. For example, I specialize professionally in various pre-modern languages, some of which have not survived to the present. I still need a way to encode these languages as I use computers to produce dictionaries and online corpora.

  234. Re:I'll take that challenge by kurisuto · · Score: 1
    You've giving a case where the visual representation of the ampersand is described in the text. What I'm looking for is an example like the following:

    Bob & Mary are coming over on Tuesday.

    If you substitute the other presentation forms of the ampersand here, the sentence still means exactly the same thing (i.e., there are no situations where the sentence is TRUE with one form of the ampersand, and FALSE with another).

    Contrast "I gave him a snack". Substituting "m" for "n" does change the meaning ("I gave him a smack"). There could be cases where I gave someone a snack without giving him a smack, and vice versa.

  235. Re:Overstating and misunderstanding the problem by kurisuto · · Score: 1

    There's probably variation among Germans, but at least some Germans do write a "1" as an upside-down V.

  236. Re:another drawback of unicode by kurisuto · · Score: 2

    There is in fact a group working on Unicode encodings for the Egyptian heiroglyhic character set. The codes will go in the "surrogate characters" range of Unicode. Regular Unicode uses the codes between 0 thru 2^16-1; the surrogate range runs from 2^16 thru 2^32-1, and has been designated by the Unicode Consortium for exactly this kind of case, i.e. large, rarely used characters sets.

  237. Re:Unicode includes all common Asian character set by kurisuto · · Score: 2
    To the contrary, Unicode 3.0 does include the Germanic runes.

    I don't see a need for special software to display runes. It's just a matter of having a font architecture which allows you to create and install a font for an arbitrary subrange of the Unicode space.

  238. Overstating and misunderstanding the problem by kurisuto · · Score: 4
    This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.

    This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

    For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

  239. People didn't get your joke by GodSpiral · · Score: 1

    That was really hillarious. Loudest laugh I ever remember having.

    It was particularly clever in that it transformed english into the correct language: early 60's sitcom version of German.

    After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!

  240. Re:After some skimming... by 2Bits · · Score: 1
    It may true that Unicode is enough for normal web publishing, as well as normal newspaper publishing, which requires only around 5 thousand chinese characters. But the chinese langugage has more than 60 thousand characters (a research from the 90's show that there are a little more than 65 thousand). So the 2-byte character in Unicode is just about enough to represent all chinese characters, if you take all possible combinations.

    Personally, I think Unicode is not enough. What if I want to digitize the whole collection of the Beijing library, which contains millions of texts from thousands of years? How am I going to represent all the characters with Unicode?

    You may think that chinese orthography has a tradition of simplication and variants, but this only applies to modern use of certain characters. These simplications and variants can't replace the characters in ancient texts, or they will totally alter the meaning of the texts.

    I think Unicode is developed by a for-profit corporation, which tends to oversimplify without doing thorough resarch into a specific culture before trying to encode the language.

  241. Re:Solution - Everybody use Euro-English! by Skuto · · Score: 2

    a) This is (adapted?) from Mark Twain, it's in
    most fortunes.

    b) No matter how funny it looks, if you read
    it aloud its prefectly understandable...

    c) ...but it keeps reminding me of 'Allo Allo?'

    --
    GCP

  242. Solution - Everybody use Euro-English! by saider · · Score: 4

    The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

    In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

    There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

    In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

    By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

    After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!


    --


    Remember, You are unique...just like everyone else.
  243. Characters, not glyphs! by Wulfstan · · Score: 1

    I think, frankly, that this report is rubbish. The purpose of Unicode is NOT to provide a full listing of all possible glyphs; it is to provide a list of characters. The author of the report appears to me to have made a reasonably common mistake when reading through the Unicode spec; he sees one of the Unified Han characters, says "Ha! That looks nothing like the character in !" and assumes that Unicode is some Western pigheaded colonialist rubbish.

    For a more complete discussion, which summarises more accurately the way to use the Unified Han character section of the Unicode specification, trot off to here. Particularly read the section on "why were the characters unified". Unicode isn't perfect, but the Unified Han system is a good attempt to minimize bloat in the character tables.

    p.s. Those dudes from the Klingon Language Institute have been trying to get themselves a spot in the Unicode tables for ages and have recently had their application rejected :-( see here

    --
    --- Nick, hard at work :->
  244. binary by SkyLeach · · Score: 1

    Let's just all learn to speak binary. We can seak in long and short dashes built from words gathered from all languages. Morse code can become popular again. -1010011 1001100 (SL)

    --
    My $0.02 will always be worth more than your â0.02, so :-p
    1. Re:binary by thinkit · · Score: 1
      11.0010010000111111011010101000100010000101

      and if you don't recognize that, you have "pie" on your face. as for morse code, it's actually trinary, with the delay being important.

      --
      --how long till the operators are jailed for anime-induced pedophelia and /. dies?
  245. Re:After some skimming... by gea · · Score: 2

    Consider English literature and ASCII. If you look at a reproduction of Beowulf in the original Old English, you find lots of characters that aren't present in ASCII. That doesn't mean ASCII is worthless, and it doesn't mean anyone had to accept restricted access to literature. It just means there was room for improvement because ASCII wasn't suitable for all purposes.

    The Unicode designers got bogged down trying to create an encoding suitable for every possible purpose. If the goals had been more modest, say to allow Chinese language URLs, there would have been faster ways to go about it.

  246. Re:No it's not by Pru · · Score: 1

    >>>>> Yeah, let's get down to the lowest common denominator and make laws that require all internet content to be at least in X number of languages. A bit like the silly EU regulation because of which every fucking document, web site and audio recording concerning the Union must be available in at least. You have to be kidding right? Laws that goverent the internet? isant that dumber then multipul languages?

  247. ISO-2022-JP and "alphabetical order" by achurch · · Score: 4

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.

    Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?

    Have you ever tried to program in shift-JIS? It is horrific.

    I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.

    On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.

    --
    BACKNEXTFINISHCANCEL

  248. Re: Simpler than English by jpm242 · · Score: 1

    French isn't simpler than english. It has many exceptions to just about every rule and its vocabulary is as large as the english language. It's much harder to learn than english.

    I speak both, and anyone who does can agree with me.

    JP

    --
    --- Worst tagline ever.
  249. Re: Simpler than English by cougio · · Score: 1
    To quote an AC that no one will see,

    "English is a Germanic language, with the influences of: Latin, French and Scandinavian dialects. French is a Romantic language with far less influences."

    It is harder to learn than english only because of the exceptions. And it's the language of poetry. And has even more vocabulary than english (tons of beautiful synonyms...)

    Alors VIVE LE QUÉBEC!!!

  250. Ancient Latin... by IngramJames · · Score: 1

    Ancient Latin only used upper case. And one bit of punctuation.

    I.SAY.BRING.BACK.THE.GOOD.OLD.DAYS.WHERE.DID.THE.F IRST.SENTANCE.END.IS.THIS.A.THIRD
    --------------- ------------

    --
    'No rational religion claims "supernatural" exists, that's an atheist slander.' - seen on slashdot.
  251. Re:After some skimming... by Decado · · Score: 1

    Well since the language used for Irish uses just 18 of the 26 letters of the english alphabet and gets by just fine I assume it wouldn't matter that much. For the record the missing letters are j, k, q, v, w, x, y and z

    --

    Slashdot: Proof that a million monkeys at a million typewriters can create a masterpiece

  252. Re:Prejudice? Or technical hurdle... by GunFodder · · Score: 1

    Actually I think there is some discrimination, but it is not motivated from hate but rather ignorance. The computer industry is still strongest in the US, and most OS software is still written by US-based companies. Why don't some Chinese software developers come up with their own language standard and write a bunch of software with it? Then the Western software industry will be forced to deal with the situation.

  253. Prejudice? Or technical hurdle... by fleeb_fantastique · · Score: 1

    I find myself somewhat frustrated with the viewpoint that western programmers "discriminate" against other cultures because the culture has too many characters, where "discriminate" implies a political, social, or personal conflict. The problem, frankly, seems more technical in nature to me.

    Many operating systems have a design that uses a smaller character set, if for no other reason than to help conserve space. Take your average file system; the character set doesn't permit Unicode characters in most cases, and even the C++ STL doesn't have a spec for streaming files with wchar_t names.

    Then consider that you have several evolving programs that have to be modified to use a different character set. From experience, I can tell you that, particularly for complex programs, this is not a trivial job.

    Finally, imagine that a political body imposes a deadline on imported programs.. that they must support their new standard by such-and-so a date or it won't be permitted within the country. The Chinese did this, extending the deadline to Sept. 2001. I only found out about this yesterday.

    It doesn't make a job easier.

    --
    And so it goes.
  254. Re:I'll take that challenge by easyfrag · · Score: 1
    To delimit a path, *nix uses as slash, whereas MS* uses a backslash; if you get these confused it helps to remember that ampersand is a rounded "E" with a slash through it.

    You have obviously never worked a help desk.

  255. Re:I had no trouble reading that at all by tswinzig · · Score: 1

    same thing goes for dealing with contractions, a la dont, wont, ill, and so on.

    I'm sorry, are you ill?

    --

    "And like that ... he's gone."
  256. Re:I had no trouble reading that at all by tswinzig · · Score: 2

    Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.

    Ill is not a homonym for I'll. You're talking about how things sound, and the discussion centers on how English LOOKS. :-)

    --

    "And like that ... he's gone."
  257. Is this really such a problem? by Dan+Hayes · · Score: 1

    While it's a noble and practical goal to eventually allow every language to be rendered as part of a website to allow for maximum access, I don't think that this limitation will really be much of a problem in the long run for two reasons.

    Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent. And in many of these places, political and economic instability caused by civil wars, corporate greed and a lack of resources will mean this situation will continue for some time.

    Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language. Cultural assimilation has proceeded at a brisk pace, with western countries only too willing to help with the "modernisation" of other cultures, which invariably results in a loss of their original heritage and linguistic uniqueness. And already globalisation is turning English into the de facto second language of the world.

    By the time the 65K limit would become a problem, I estimate it won't be a problem any more - there will be far fewer languages around, and only a subset of those will require online access. If all else fails, many of the majority remaining will speak English anyway.

    1. Re:Is this really such a problem? by trash+eighty · · Score: 1
      Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access.

      such as china, taiwan and japan? =P

      And already globalisation is turning English into the de facto second language of the world.

      this is not and never will be true nevermind how many times people repeat this fallacy

  258. Mac OSX has character mapping problems by wmulvihillDxR · · Score: 1

    In using OSX from the start, it has always had problems mapping characters. Even the "normal" weird ASCII characters get mapped strangely. Upside-down question mark is one. I could deal with it changing, but what frustrates me is that it doesn't change back whenever you go back to other UNIX systems. For instance, downloading a text file with weird ASCII characters with OSX's scp will make things go awry. But then transferring that file back up does not switch it back. Weird stuff!

    --
    Check out Althea for a stable IMAP email client for X. Now with SSL!
  259. Re:After some skimming... by update() · · Score: 1
    I don't presume to say what people should accept.

    I'm stating my impression about what they do accept (that Chinese users and standards bodies are far less troubled about Unicode than is the author) and speculating on why that might be (that out-of-the-box support to edit ancient texts in Word is more important to a scholar than to the vast majority of users).

    Unsettling MOTD at my ISP.

  260. Re:More Flamebait :) by update() · · Score: 1
    Ironically, I was just reading this story on As The Apple Turns:

    However, that's not to say that Mac OS X is truly uncrashable. (Yet.) We appear to be somewhat lucky on the stability end, whereas some other hapless customers are not. For instance, take Tony Smith over at The Register; the poor man nearly reached his wit's end trying to keep his Mac OS X-loaded blue and white G3 from taking frequent and unplanned trips to Crashville. (Spookily enough, Tony's crashes left him with "nothing but a blank, mid-blue screen"-- is Apple hard at work reverse-engineering Microsoft's Blue Screen of Death?) After multiple reinstalls, he eventually figured out what was causing his grief: an aftermarket PCI ATI Radeon graphics card, which he determined was not supported. Replacing it with his original OEM Rage 128 card left his system solid as a rock. Or so he thought.

    Once he got around to reinstalling his third-party fonts, his crashes came back. And so, by adding one font at a time, he was eventually able to isolate the real cause of all his woes: "a single Star Trek symbol font... OS X doesn't like it one little bit." So while Mac OS X is able to use his zippy Radeon card after all, Tony will sadly have to boot back into Mac OS 9 whenever he wants to stick the Starfleet Insignia into one of his party invitations. Now that's a problem that Apple's really going to have to fix before Mac OS X will ever catch on as a mainstream operating system.

    In fairness to OS X, I think it was actually application crashes -- the font wasn't bringing the system down.

    Unsettling MOTD at my ISP.

  261. After some skimming... by update() · · Score: 4
    I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
    Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
    and I decided to skim the rest.

    To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.

    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Unsettling MOTD at my ISP.

    1. Re:After some skimming... by bmongar · · Score: 2

      Of course the display sets could develop dipthongs, ie more than one unicode char to represent a Chinese character. Part of the problem with the Chinese character set is that it is not an character set so much as a dictionary, with words having only one character, and no restriction in adding new ones. So don't make a single character for each of them, use letter combinations. OF course that is my western bias.

      --
      As x approaches total apathy I couldn't care less.
  262. In other news... by ackthpt · · Score: 3

    Bush bolts GOP to join Democrats, fires entire Whitehouse staff

    Linus Torvalds to join Microsoft as OfficeXP advocate

    NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"

    Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts

    RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"

    Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets

    Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway."

    --
    All your .sig are belong to us!

    --

    A feeling of having made the same mistake before: Deja Foobar
  263. So... by ackthpt · · Score: 4
    4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

    --
    All your .sig are belong to us!

    --

    A feeling of having made the same mistake before: Deja Foobar
  264. ASCII stupidity all over again... by Matthias+Wiesmann · · Score: 2

    It's not new, and alas not surprising.

    When they did ASCII, it was a standard by the US, for the US, the mess it created in the high-ascii range (128-256) is still not resolved and I'm talking diacritical characters like those used in western european languages (French, German, Spanish etc...) nothing fancy or very exotic. Problem was, of course the europeans were not implied in the process.

    Now they do a universal standard that should correct all problems and surprise, they don't actually bother to check with the implied persons. Even if they did, it would make sense to have provisions for a few unknown character sets (like ancient civilisations or the myriad of small groups of people living in lost parts of the world).

    Anyway, if computer history has told us something, is that a 16bit range is never sufficient for practical uses. Well, just another sad example of one size does not fit all... But I suppose the slashdot response will be - why the hell don't they all speak/write english...

    1. Re:ASCII stupidity all over again... by vidarh · · Score: 2

      Get your facts straight. Unicode isn't written in stone. It is an evolving standard. And one of the reasons it is taking so long is precisely because everyone affected can get involved - there's been a lot of infighting about which glyphs should make it and how to organize them. The result, however, is that most commonly used scripts can be handled by the current version of Unicode. More will most likely be handled in the future.

  265. Re:Duh. by devnullkac · · Score: 1

    I disagree. I don't see Unicode (or its alternatives) as a way to resolve language barriers. Rather, it defines a framework within which all programmers can use the same libraries and programming languages to develop applications using their own language.

    To use a gardening analogy: it doesn't make us all plant the same things or help us understand the meaning of what someone else has planted; it just lets us all use the same tools for working in our gardens.

    --
    What do you mean they cut the power? How can they cut the power, man? They're animals!
  266. Alrighty by rabtech · · Score: 2

    The guy obviously has an anti-western mindset.

    But to simplify, the crux of his argument seems to be that in order to read ancient works from the Chinese/Japanese/etc, they need about 40,000 to 50,000 characters each.

    But in reality, the average Japanese person would use less than 10,000 characters. In fact, probably much less.

    Besides -- it is mostly a moot point until you can show me a keyboard capable of entering 50,000 unique symbols efficiently.

    His solution seems to be allocating 32-bits of storage per character, rather than the 16-bit Unicode standard we have now.

    For the forseeable future, it would seem that Latin-esque alphabets have the upper hand. It just makes more sense, especially in terms of programming and protocols. Do we really need web servers that understand how to read "GET / HTTP/1.1" in thirty different character sets?


    -- russ

    --
    Natural != (nontoxic || beneficial)
    1. Re:Alrighty by Pembers · · Score: 1

      OK, the author didn't actually say it, but he seems to be implying that the percentage of people in Japan and China who want (or need) to read the older texts is much higher than in Western countries. They therefore have a greater need of those characters than we would.

      For example, how many English speakers have read Chaucer in the original 13th-century language?

      As well as that, when the Communists came to power in China, they effectively changed the entire written language. This is difficult for us to grasp as (Quebec notwithstanding), nothing like it has ever happened in any English-speaking country. Imagine being told that everything you wrote or read from now on would have to be in Russian or Arabic or Hebrew. If English wasn't actually banned, you would still need some standard for representing it in a computer.

    2. Re:Alrighty by vidarh · · Score: 2
      Input methods for Chinese, Japanese and Korean exists, and can efficiently handle the number of characters required. Some do it by typing out the romanized sound, and mapping it to the characters.

      And actually, the "Unicode standard we have now" does not fit in UCS-2 (16 bit). It requires one of the UTF-* encodings (which are variable length encodings), or UCS-4 (32 bit).

      As for his gripes about Unicode 3.1, sure, there are things you can't write with it. But it's a good step forward. And it doesn't fill the entire glyph-space, by far. The 32 bit encodings, because of the way they are arranged can "only" handle about a million characters if I remember correctly, but that is still way more than is needed.

    3. Re:Alrighty by vidarh · · Score: 2
      Do you use Linux? Try starting "kterm" or similar. If you're using Redhat and Gnome you'll likely find it under "System" in the program menu as "Kanji terminal". Try holding down alt and pressing a couple of character combinations.

      You don't need a special keyboard.

    4. Re:Alrighty by haruharaharu · · Score: 1

      Sure, a Japanese person needs less than 10k glyphs, but they all need different sets of 10k glyphs.

      --
      Reboot macht Frei.
  267. umm by zephc · · Score: 1

    isnt that why there are all those different encodings, so there IS no overlap? i dont think there is any language that cannot use the 65K characters to construct its written language on a computer.

    kanji has what, 3000 characters? Chinese uses unicode by combinations of roots and the other parts of the characters (its kinda complex, but it works! :P)

    So what is the problem? I see no conflicts in any encoding scheme, where even really complex ones like Chinese still work?

    (i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)
    ----

    --
    "I would say that 99 per cent of what my father has written about his own life is false." - L. Ron Hubbard Jr.
    1. Re:umm by Computer! · · Score: 1

      Yeah, you're all right. Americans are stupid and Amero-centric. Microsoft is stupid and Amero-centric. I think most of you claiming this are stupid and you-centric. The difference? My language, and character set are covered by Unicode. In fact, my language is covered in ASCII. This is not just because ASCII was invented by Americans, as might seem obvious. This is because the English language is an incredible language. We use 26 letters to express every concept imaginable. It takes the Chinese 50K+, and that's our fault? So that some obscure historical works can be digitized, I've got to increase storage space two-fold on all character strings that might be sent across the Internet? The total cost of this Unicode pissing match might be up in the Trillions of dollars, with re-coding, re-compiling, storage space and bandwidth. All because of several nations that, for the most part, couldn't have cared less when these decisions were being made. Now that there's money and power involved in computing, they want to break standards. Screw them! They can hump along in whatever system they are provided, and stop griping about unreadable 2000-year-old texts. There never seems to be a shortage of Chinese programmers, so it's obviously not hurting them. Maybe this will be an incentive to simplify their written language, because until they do they'll never really catch up. Not to mention that all programming languages are essentially in English anyway, so in order to write any software you have to speak English. This should allow for us patriots to continue to maintain our stranglehold on the world economy. This is not a troll. This is for real. It's called social darwinism. The freedom provided by democracy, along with the ease of encoding provided by the English language has allowed the US to invent computing as we know it, and excel in all aspects of it. Why shouldn't we continue to lead the world in the development of technology? Because of Confucious? I think not!

      --
      If you fall off a building, go real limp, because maybe you'll look like a dummy and people will be like hey, free dummy
  268. side topic by deXela · · Score: 1

    >Unicode, the semi-commercial equivalent of
    >UCS-2 (ISO 10646-1)

    How is unicode commercial, and if it is, how does that effect it's use with software-libre? Who owns unicode, what kind of license does it have? A quick look at www.unicode.org isn't very informative on this subject.

  269. No, _n_ bytes per character! by The+Monster · · Score: 3
    Sort of. You define a 32-bit space for now, then use something like UTF-8 to encode it.

    Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:

    A "character" is defined as a sequence of bytes ("octets" for the RFC-phile) that ends with a value which has the most-significant bit clear. (If you treat byte as unsigned, this means nonnegative; if signed, it's < 128, whichever test you'd prefer to code. I have my preference...)
    This gives 2^(7 * n)possible characters of length n:
    1. 128.
    2. 16,384, cumulative 16,512.
    3. 2,097,152, cumulative 2,113,664.
    4. 268,435,456, cumulative 270,549,120.
    5. 34,359,738,368, cumulative 34,630,287,488.
    6. 4,398,046,511,232, cumulative 4,432,676,798,720.
    7. ...
    As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.

    The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.

    --

    [100% ISO 646 Compliant]
    SVM, ERGO MONSTRO.

    1. Re: No, _n_ bytes per character! by Kearwood · · Score: 1

      This number format is identical to the one used inside midi files to store the length of time between events... Macromedia Flash files also use variable bit-length to compress their structures..

  270. Not China, Greece. by titaniumball2000 · · Score: 1
    There was a young fellow from Sparta, A really magnificent farter,

    On the strength of one bean

    He'd fart God Save the Queen

    And Beethoven's Moonlight Sonata.

  271. A tough problem... by RareHeintz · · Score: 2
    This is a problem that the Chinese gov't has realized in the past, and the development of the Pin-Yin phonetic romanization system was originally started with an eye toward phasing out the (admittedly more cumbersome, but significantly more beautiful) ideograms. (Of course, they had no idea about the Unicode issue back then, but I'm speaking of the larger issues that having a huge, ideogrammatic written language, of which the Unicode problem is just a new manifestation.)

    I don't know where these plans for conversion to a phonetic written language stand now, though I'm sure it wouldn't be hard to find out.

    OK,
    - B
    --

  272. Not my problem by pkesel · · Score: 1

    Is everyone who wants to publish on the web or otherwise electronically owed a simple solution to all their problems? If your language is not representable in a particular code set DON'T USE IT. If you need to publish in your native language, SOLVE YOUR OWN PROBLEMS! Or better yet, USE PAPER!

    Problems are solved by those who need to do so. They solve their problems, and if it catches yours too, good for you. If not, get to work.

    --
    - Sig this!
  273. Unicode's reply by roozbeh · · Score: 4

    It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:

    Dear Mr. Carroll,

    I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

    Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

    Here are some specific comments on items in the article which are either misleading or outright false.

    Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

    In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

    Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

    *Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

    Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

    http://www.unicode.org/unicode/uni2book/appA.pdf

    The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

    Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

    Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.

    And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

    Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

    Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

    The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

    The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.

    And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

    Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

    And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

    In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

    Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.

    --Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
    Technical Director, Unicode, Inc.
    Co-Editor, The Unicode Standard, Version 3.0

    --

    1. Re:Unicode's reply by kwhistler · · Score: 1

      > there is always some need to represent, in some
      > consistent and unambiguous manner, text in
      > languages that can't be possibly accepted into
      > Unicode, such as fictional languages

      Well, fictional *languages* are easy to represent in Unicode, if you use one of the existing scripts in the standard. Pig Latin, whatever. In fact this is exactly how Klingon works -- its all done in Latin transliteration anyway by the Trekkies and the official Klingon Language Institute (I kid you not), so it already works in Unicode.

      If you are talking about fictional *scripts*, then in fact the most important, most studied and cited of those are Cirth and Tengwar, the scripts invented by Tolkien. Guess what, those *are* roadmapped for inclusion in the Unicode Standard. You might want to actually take a gander at the official roadmaps for Unicode and 10646 before mouthing off about what cannot possibly be included in the standards:

      http://www.egt.ie/standards/iso10646/ucs-roadmap .h tml

      > they can be easily handled by any expandable
      > charsets-handling system...

      And Unicode is not expandable? It is already planned for expansion to include Egyptian hieroglyphics, Sumero-Akkadian cuneiform, Limbu, Buginese, Avestan, and dozens of other minority and historic scripts you've probably never heard of. There are 882,373 code points still available for that kind of expansion, which is something like 800,000 more than all the known requirements of all the known writing systems current and past. And beyond that, there are 137,468 private use characters permanently set aside for anyone to define anything they damn please with. And if goofy expansion systems are your cup of tea, then your private use of Unicode private use characters could be to define them in pairs (for example) to create over 18 billion encodings of things, or in triples to create 2,597,794,797,367,232 (that's 2 and a half quadrillion) encodings of things. *That* should keep you busy.

      > Unicode supporters do everything that is
      > possible ... to prevent any competing system
      > from being developed.

      Well, that is some pretty hyperbole. No one is holding any guns to anyone's heads on this, figuratively or literally. The main reason no competing systems are having little success is that universal character encoding schemes are *enormous* undertakings and commitments of resources. Try looking at the Acknowledgements page of the Unicode Standard: 5 pages long in small print! You try organizing hundreds of people to work on a project for a decade, and then get hundreds of companies and dozens of other standards to implement what you come up with. Most competing efforts simply founder quickly on the sheer amount of work involved.

      > The problem is, Unicode is being used for things
      > it is inadequate for..

      Such as? Perhaps you could be more explicit in stating an example, so it would be possible to evaluate what you are talking about.

      You seem to distrust the universality of the Unicode Standard. But for use on the Internet, and as the backbone of XML, HTML, Java, and other standards, it is the universality which is the attraction and the big advantage. What are you proposing instead? Use ISO 2022 with Escape switching to hundreds of individual encodings, many of which may have totally incompatible models of text handling and which thus would have little or no chance of being correctly handled or rendered on any average system that might encounter them? Do you think that computer systems just magically deal with some arbitrary, idiosyncratic local encoding because somebody, somewhere thought it was a better idea for whatever language they are familiar with?

  274. moderators dude! by invalid_user · · Score: 1

    Please mod this one up some more. Please? TIA.

  275. Are you Chinese? by invalid_user · · Score: 1
    I mean, it's not like you can't read old Chinese literature with the current character sets, right? Most of these characters are archaic and often substituted with variants which either look or sound similar to the original (which are easy to annotate with clarifications). For god sake 99% of chinese people won't even come across any of these words in their _entire_ life.

    If the guy who wrote this article really care so much about these archaic heritage, let him return to writing with the original (fan-ti) characters, okay?

    Societies change. People change. "yi" is inevitable.

  276. Duh. by Shoten · · Score: 2

    This should be obvious to anyone who has ever looked at a unicode chart or has had to click "Cancel" when asked to install character support for any of the myriad languages that need language packs to be displayed in Windows. Ok, so they built a way to theoretically support all of these characters. This does not mean that I can read Japanese, however, and making it possible to see it in my browser will not change that fact, nor will it make Google searchable in Japanese, cause IRC to support katakana or hiragana characters (and just freaking forget kanji unless you want to chat with a graphics tablet). Unicode has purposes (besides making it easier to hack web servers, that is), but the hopes and dreams built around it are a classic case of throwing tech at a social barrier to try and make it go away.

    --

    For your security, this post has been encrypted with ROT-13, twice.
  277. But for Java by Husaria · · Score: 1

    They made it work for Java, I'm sure a interpeter for HTML, ASP, could be worked out using unicode.
    Download once, read anywhere

  278. Unicode != UCS-2 by Snowhare · · Score: 1

    The author of that article did something fundamentally wrong (equated Unicode with UCS-2) and then proceeded with perfect logic to product garbage output. As others have pointed out, UCS-2 is just an _ENCODING_ of Unicode, and not even the most general one. UTF-32 can handle 4 BILLION code points. Even the more common UTF-8 can handle over a million. Now if the author of that article could see beyond his anti-western bias, he would have learned that people who work routinely with Unicode addressed his underinformed problem years ago.

    If you think even every human language when put together needs more than 4 billion code points, you live in a different universe than I do.

  279. IOW, Unicode can't do everthing. by AnotherBlackHat · · Score: 1
    Ok, you've convinced me - Unicode can't handle the large number of asian "letters." I never liked Unicode anyway - let's just go back to 8 bit codes, admit that letters only work for European languages, and force everyone else to use graphics.

  280. 4 bytes per character? by oogoody · · Score: 1

    Don't think so.

  281. Unicode Surrogates by Mumbly_Joe · · Score: 1
    I've been writing Unicode code lately (using UTF-8 encoding) and I've been reading the 3.0 standard.

    The Unicode standard supports surrogates, which are pairs of 16-bit code points. These pairs defines about an additional 1 million code points within the standard. A "code point" is a unique value for some character.

    There is plenty of room in the Unicode space for all the characters.

  282. Re:More Flamebait :) by cavemanf16 · · Score: 1
    "Elen sela illumen omentielvo!"

    Spelling and pronunciation not perfect, but it means "A star shines upon the hour of our meeting!" - rough translation of this Quenya Elvish phrase which is a derivative of the Tengwar elven language built by JRR Tolkien. And yes, I have actually used it with close friends before.

  283. Re:I had no trouble reading that at all by cryptochrome · · Score: 2

    Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.

    cryptochrome
    time to get ill

    --

    ---If you can't trust a nerd, who can you trust?

  284. Pictographs suck by cryptochrome · · Score: 3

    For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.

    And besides which, the point of the article is moot. As this article states:

    ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.

    The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?

    The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.

    cryptochrome

    P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "

    --

    ---If you can't trust a nerd, who can you trust?

  285. I had no trouble reading that at all by cryptochrome · · Score: 3

    The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.

    We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.

    cryptochrome

    P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.

    --

    ---If you can't trust a nerd, who can you trust?

  286. Re:Pictographic icons are not letters! by Alanus · · Score: 1

    In Japanese house has 10 strokes. The problem is that the strokes cannot be mapped to individual "characters" since the individual strokes have no standardized position, direction or size. To encode the strokes would take a lot of information (a bitmap would probably be easier).

  287. Re:Hmm.. I must have been using something else the by vidarh · · Score: 2

    No. See the glossary at www.unicode.org - UCS-2 and UCS-4 are encoding forms of the unified character set defined by the ISO/IEC 10646 standards, which now include at least 10646-1 and 10646-2. Unicode is mostly a different name for the ISO/IEC standards, but also include additional information about the use of the characters.

  288. Re:Hmm.. I must have been using something else the by vidarh · · Score: 2

    See my other post below. ISO/IEC 10646 and the Unicode standards define the character sets. UCS-2 and UCS-4 are encodings of those characters sets. UTF-7/UTF-8/UTF-16 are transformation formats that allow variable length encodings of the UCS-2 and UCS-4 encodings.

  289. Hmm.. I must have been using something else then? by vidarh · · Score: 4
    I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.

    UCS-4 is also quite common, and allows for the new extensions.

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.

    Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

  290. Re:More Flamebait :) by tb3 · · Score: 2

    Klingon into Unicode? I knew those people were obsessed, but that's just asinine! Fictional languages shouldn't even be considered, where would it end?

    "What are we going to do tonight, Bill?"

    --

    www.lucernesys.comHorizon: Calendar-based personal finance

  291. 16-bit Should Be Enough. by robbyjo · · Score: 2

    First of all, I think the editor (not the author) is right: "We're not in the same room". Therefore, 16-bit should be enough to encode even all the 50,000+ chars of K'ang Hsi dictionary. Moreover, if we try to encode ALL characters in the world, how redundant it would be. Surely Hindi speaking people won't speak Chinese and Hindi at the same time.

    Moreover, we have "Content Language" and "language" tag in HTML, don't we? If we ever want to encode two or more different languages, we can simply include these tags and be done with it. The browser can then pick the appropriate fonts and voila!

    Of the claimed 170,000 characters from the Orients, many of which can be unified since they are the same (in Japanese Kanji, Simplified, and Traditional Chinese). Simplified and Traditional Chinese share a lot of similarities. Even the simplified writings of a particular character often look nearly the same as the traditional one. Thus, the encoding for these two can be unified, only the font bitmap is different. Moreover, it won't be logical to use both simplified and traditional characters in the same article (except if they are exactly the same). So, these can save 50,000 characters.

    Japanese kanji, also shares a lot of similarities in both Traditional and Simplified Chinese (more to traditional than simplified). So, the encoding can be simplified too. Save another thousand characters.

    --

    --
    Error 500: Internal sig error
  292. The reason why Microsoft don't like unicode: by DavidJA · · Score: 1
    As taken from my web log:
    /scripts/..Á%8s../winnt/ system32/cmd.exe?/c+dir
    /scripts/..Á%pc../winnt/ system32/cmd.exe?/c+dir

    Note to moderators: this is not flame bait, it's funny!

  293. More Flamebait :) by bark76 · · Score: 3

    Maybe if people didn't try to get character sets like Klingon, Cirth and Tengwar added into unicode we wouldn't have this problem!

  294. Danny Boy? by Flying+Headless+Goku · · Score: 1

    You mean that song with words written by an English lawyer, using the tune from Londonderry Air, and marketed most successfully in the United States?

    It is an Irish style song, not an Irish song.
    --

    --
  295. Re:Perl in Hierogliphics by Magumbo · · Score: 2

    :) Oh yeah! The king of wacky, terse, symbolic programming. You've gotta love it.

    --

  296. another drawback of unicode by Magumbo · · Score: 3

    And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.

    --

  297. (correction to reply to AC) by haruharaharu · · Score: 1

    Right. Check your own faq:

    in UCS, up to 6-byte long UTF-8 sequences are possible to represent characters up to U-7FFFFFFF

    --
    Reboot macht Frei.
  298. Re:Well DUH! It's not meant to have every characte by haruharaharu · · Score: 2

    50k? The numbers i got from the Japanese Ministry of education were closer to 900.

    --
    Reboot macht Frei.
  299. I'll take that challenge by MarkusQ · · Score: 1
    First off, I agree with you. But you post such an interesting challenge I can't resist:

    A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.

    How about:

    "To delimit a path, *nix uses as slash, whereas MS* uses a backslash; if you get these confused it helps to remember that ampersand is a rounded "E" with a slash through it."

    Contrived, I will admit, but I think it answers your challenge.

    --MarkusQ

  300. It allready works by Greedy · · Score: 1

    Having worked on internationalisation in some compagnies and now working in China, the unicode standard turned out to be a very good thing. First of all it might not be perfect but it works very well. It is a lot more easy to have one table instead of different encodings which there used to be. It makes life as a software developer a LOT easier. Companies can easily tell their developers to write in unicode in ascii (C wchar_t works fine) and if they ever want to write a multilanguage version then it is SO much easier. This will improve the number of translated applications in the future. Second. Unicode is allready being used majorly. Every new OS out these days uses unicode internally. Win2k and symbian epoc are good examples. The last is also the reason why all mobile phones work perfectly with Chinese characters. The IETF also works on standardisation of foreign characters in DNS names and also uses unicode for that. (Note.. does not restrict to unicode.) As mentioned in http://search.ietf.org/internet-drafts/draft-ietf- idn-requirements-07.txt Conslusion Thanks to the unicode standard for making the life of a software developer a LOT easier!

  301. Yes, it is by absurd_spork · · Score: 1

    Just because English is the most popular language on the Internet at the moment, that doesn't mean that either other languages were not used or that other languages might not take over that role in the future. If, for example, the growth of Internet accessibility in China keeps up at that rate, Chinese will be language #1 in the Internet by 2007, especially since Chinese will be read and understood by Koreans and Japanese as well.

    1. Re:Yes, it is by trash+eighty · · Score: 1
      well they know the characters but thats like a german knowing what the letters in an english sentence are

      not that useful

  302. You don't really KNOW about unicode, do you? by absurd_spork · · Score: 2
    Honestly, you don't really KNOW about Unicode and how it works, do you?

    The idea behind Unicode is to have a uniform encoding for all the world's scripts, not for all the world's languages. The necessity of this is evident for anyone who has experience with the insufficiencies of the individual codepage systems (Windows CPxxx, ISO 8859-x, ISCII etc.) currently in use. Have you ever tried to send an Arabic e-mail through a non-Arabic mailserver or run a program with German character support on a codepage 450 windows? Unicode is designed to programs and data interoperable regardless of either's language encoding.

    Just because you don't know Japanese it doesn't make the rendering of Japanese pointless. Just because you don't have a clue how a Chinese or Japanese Kanji input system works doesn't render the idea of being able to chat in IRC using Japanese characters entirely pointless.

  303. Workaround by oliveloaf · · Score: 1

    Why not use unicode for everyday use, and a PDF'ish format that could have every character of said language for special purposes, i.e. historical documents.

  304. Why get all upset about it? by m08593 · · Score: 1

    One the one side, you have a country with a pretty but otherwise messy, outdated, and unwieldy writing system, unwilling to move to a more convenient alphabetic writing system (rightly or wrongly). On the other side, you have a large collection of western corporations that desparately want to sell lots of equipment there without the cost of doing specialized software development. This ought to be an interesting fight.

    1. Re:Why get all upset about it? by m08593 · · Score: 1
      Actually, I do know it. I also know other people who spent years learning it, including native speakers. And I've had enough opportunities to watch native users of those writing systems struggle with them. And my conclusion is: alphabetic writing systems are simply a better idea. That has nothing to do with whether other aspects of English are complex or baroque (which they are).

      There is nothing western about alphabetic writing systems: they trace their roots to the Middle East and are used as much in Asia as in Europe and the US.

      But reality is that the CJKs aren't going to change. Hell would freeze over before those cultures would undertake such a step even if it were practical. They are stuck with an unwieldy writing system. In fact, they probably like the barriers that creates to foreign competition; that alone makes Unicode a losing proposition. And the only reason US companies bother is because they want to export there; non-CJK users otherwise have little interest in paying extra for the complexities and cultural sensitivies of CJK countries.

  305. you can't make the Chinese happy anyway by m08593 · · Score: 1
    No matter what people come up with, I suspect the Chinese are not going to be happy with a character set that wasn't designed by them. They will probably also not be happy with character sets that accomodate Taiwan.

    The best solution, in my opinion, rather than to come up with a global standard, is to let different countries work out their own coding schemes and then come up with a way of encapsulating those schemes in an 8bit code. That way, people who don't need Chinese or Japanese don't have to pay for the overhead resulting from the complexity of those writing systems.

    Mixed language editors would continue to be the specialty software they are and have to come up with their own representations.

    Of course, US software vendors hate that because they would have to spend a lot more money on customizing their software to particular target markets; they can't just translate a file of message strings. They might even have serious competition from local vendors.

  306. Re: Is this a problem by slashtop · · Score: 1

    Well, if we want to have the "standard" language be "Chinese", you'll first have to decide which one you want.
    China has 7 main dialects, according to my Chinese language class teacher. People in Shanghai speak a language that can almost be considered completely different than the one in Beijing. They use the same characters for writing, but use them to mean different things. At the very least, you have Mandarin and Cantonese.
    Also, while Chinese is a grammatically simple language (no conjugation, no pluralisation, etc.), it is less fun to write, because there is no alphabet. Yes, there is a different character for every word. Yes, there is a rhyme/reason to the characters, but that doesn't make it all that much less difficult to learn all of them. Oh, and you have to decide whether you want simplified or traditional Chinese characters to be the "standard", too.
    Finally, while the population of China is certainly the largest in the world, do they really have the most people _online_? I have no statistics, I'm actually curious.
    Sotto la panca, la capra crepa
    sopra la panca, la capra campa

    I am Chinese, Chinese has only one charset, and everyone can speak in mandarin to interactive with each other. of course they can also talk in local native language with same charset with mandarin!
    enjoy!

  307. You are SO naive... by brendano · · Score: 1

    Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?

    No offense, but your post reeks of the naive, self-absorbed Western arrogance that the entire world hates. Have you actually LEARNED or even had tiny experience with a non-phonetic written language? There are COMPLETELY different ways of communicating ideas or expressing emotions. Forcing billions of people to convert to your culturally imperialist straitjacket isn't just infeasible, it ignores an entire dimension of humanity.

    I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.

    Um... Ever read a great piece of literature in one language, then read the translation? It's NEVER the same. Languages are much more than communication -- they embody ways entire modes of thinking, of cultural assumptions. Any modern linguist (Noam Chomsky comes to mind) could tell you that. In the case of, say, poetry, you will always lose the meter and rhythm. In the case of, say, political works, you will always substitute in words that have the wrong connotations or sound funny in the new language. Translators ALWAYS struggle with these issues.

    While the author of this original article may be misinformed on the particulars of Unicode, or may be flawed in asserting Unicode should accomodate every single character system that exists, it's the overall message from your post that is most misinformed -- that the language YOU use is the best, and the rest of the world should convert to YOUR way of thinking and communicating without your even having to try to understand them. Anyone who actually has studied the issue would not reach the sadly shallow conclusion you have.

    I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.
    Who's whining? People who point out the horrendous exclusion and bias in the Western-dominated technological conferences that dictate standards of communication, or people like you who don't understand anything past what you grew up with?

    This is nothing like the metric system. Ditching the inch and pound carries is nothing like ditching the bedrock texts of your culture.

    Personally, I do not fully agree with the article's author. Of course ancient documents don't need to be represented in the character set intended for everyday use by businesses or other entities. However, the attitude expressed in your post is what causes so much resentment and pain in the first place.

    I'm sorry if this post sounds rantish. I'm just ashamed to belong to a community that produces posts like yours.

    Language is about what people think, how people live, and at root, their culture. Language is not about the undefinable concept of "efficiency."

    --
    -Brendan
  308. Use Chinese for English Data compression! by eknuds · · Score: 2

    Actually, if you wanted to, you could write English/German/French/Spanish using Chinese! It would actually be fairly simple, one Chinese character == one English word. Just have the display program figure it out, ie translate the Unicode Chinese into the English/French/etc.. Achieve instant 50% or greater data compression. It's the perfect compression solution for us bandwidth-sucking Westerners! No verb tense or plurals? Don't need it! It's all fluff anyway! I'm only 1/4 joking...

  309. Conspiracy Theories and Unicode by kwhistler · · Score: 1

    Much of this sounds like the old evil empire Microsoft conspiracy theory out to squash the good cowboy Linux true blue we want to save the world from evil story.

    What this *really* has to do with Unicode isn't clear. The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.

    As for multilingual text and statelessness, was kann ich Ihnen sagen? Comment pourrai-je réparer ma bêtise? Oops! Sorry, I guess I couldn't do that in Unicode, could I, or Code Page 1252, or Latin-1 for that matter?

    Stateful language processing has its place, in multilingual text or monolingual text, even. But how you construct that stateful processing is not dictated to you by Unicode, any more than it is dictated to you by having Latin-1 implementations on Unixes. XML defaults to Unicode, but you can use it with any character set you choose to mark. And if you use it with Unicode, you can span mark any statefulness you want into it.

    But in any case, feel free to go off and invent your systems of language and charset tagged substrings handled "transparently as sequences of bytes" and come back to show us all when you have your better mousetrap working.

    1. Re:Conspiracy Theories and Unicode by kwhistler · · Score: 1

      > While there is a lot of effort to shoehorn
      > Unicode into Unix and Unix software, the actual
      > results are beyond miserable, precisely because
      > Unicode does not work.

      Ah, I understand now. Not that I am going to praise the Unix vendors' support of Unicode as the best and most usable around, but I suggest you try making that claim directly to the Unicode representatives working for Sun, Compaq and others, and see if you can pull off such a claim.

      > ... thus getting blessed by Unicode consortium
      > as compatible.

      Wrong again, Alex. The Unix vendors added Unicode support because they perceived it to be in their commercial interest to support a universal character encoding standard that other vendors and standards were starting to make widespread use of, and which growing numbers of customers started to ask them to support.

      The Unicode Consortium doesn't "bless" any vendor, and doesn't have any certification program that anyone needs to pass in order to be declared "compatible". People claim themselves conformant to the standard if they choose, and if their implementation is defective or non-conformant, they get beaten on by disappointed customers, not by the Unicode Consortium.

      > UTF-8 can be "supported" in that way even by
      > abacus, if that abacus is long enough and
      > has at least 8 stones in a row, ...

      Well, most of us also took elementary computer science, and learned that any algorithm can be implemented on a Turing machine. So I guess we should go to the NOAA weather modelers, when they run a weather simulation on a supercomputer, and let them know they could use a Turing machine, instead, eh?

      UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.

      > I have never in my life seen a filename in UTF-8
      > outside of Unicoders' demos...

      I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.

      On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.

      > and I am Russian myself and have a lot of
      > friends that speak Japanese.

      Umm. And the relevance of that comment is what?

      > So, again, Unix vendors' support of Unicode is
      > in fact a lip service, ...

      Implying that you think it is a cynically added feature to get a checkmark or a brownie point somewhere, and that they all think it is really doomed to the trashheap of history like OSI. I'm hardly going to take your word for it. I suggest you get some international architects for the Unix vendors to come on list and support your contentions.

  310. Unicode closed to participation? by kwhistler · · Score: 1

    And to further support your point against Alex, I would like to point out that I have attended nearly every Unicode Technical Committee meeting, since its inception, and to the best of my knowledge, *never* has an interested participant or observer been turned away at the door, whether they were formally a member or not.

    Also, unlike ISO, which restricts primary membership to accredited national bodies (but does, however, allow expert participation in the working groups, regardless), the Unicode Consortium memberships are open to anyone who wants to pay the dues. In the history of the Consortium, there have been cases of an individual person forking out for a full membership because they wanted voting participation on a particular issue, and the Consortium has not only commercial corporations as members, but also national governments, state governments, libraries, academic institutions, whatever. Anyone who wishes to participate is welcome.

    Anybody in the world can and does join the open discussion list, unicode@unicode.org, hosted by the Consortium, and is free to discuss or browbeat on whatever Unicode-related topic concerns them.

    So unless those who claim that Unicode is a closed cabal mean by that that the Consortium should be subsidizing free memberships (it is a registered non-profit corporation) or should be holding its deliberations on public-access TV, I fail to see what the knock is on the Consortium.

  311. Statelessness of text by kwhistler · · Score: 1

    > Unicode is made under the slogan of total
    > statelessness of text, so while applications'
    > file formats may allow this, arbitrary
    > substring in a text can't.

    You keep harping on this "statelessness of text" issue as if this is something that Unicode caused that is destroying the capabilities for decent multilingual processing. But in fact, the same assumptions, as regard text representation, underlie ISO 8859-1 (Latin-1), Code Page 1252, or nearly every other character set in widespread use in the world today. You can use Latin-1 to mix English, French, German, Spanish and any other of dozens of languages, but you cannot do tagging of charset or language in arbitrary substrings of Latin-1 without the use of a higher-level markup language, any more than you can in Unicode.

    All character encodings work that way -- except for 2022, which itself is just a framework for implementing switching between the other character encodings in stream, and doesn't have the kind of language tagging for arbitrary substrings you seem to be advocating, anyway.

    So what is the basis for the knock on Unicode here?

  312. Unicode character allocation (was Unicode's reply) by Mokurai · · Score: 1
    You say "the allocation of characters is handled by a single, and not in any way open, organization," but this is not in any way the case. Apart from the fact that Unicode, Inc. and ISO have published their plans for encoding every known writing system, and invited anyone to propose any needed characters, they have set aside a Private Use Area of more than 130,000 code points where anyone at all can encode their own characters. No character encoding in history has so liberally supported so many user communities.

    The PUA is presently used for such things as corporate logos, which are not accepted into Unicode, and for many writing systems which have not been worked out in sufficient detail for formal encoding. This includes real, historical languages, and also Tolkien's Cirth and Tengwar. I hear Klingon is out there too.

    So what do you claim that you can't do?

    --
    "A knot!" said Alice, ever ready to be useful. "Oh, do let me help to undo it!"