Why Unicode Won't Work on the Internet
We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely
assumed to be a comprehensive solution for electronically mapping all the
characters of the world's languages, being a 16-bit character definition
allowing a theoretical total of over 65,000 characters. However, the
complete character sets of the world add up to approximately 170,000
characters. This paper summarizes the political turmoil and technical
incompatibilities that are beginning to manifest themselves on the Internet
as a consequence of that oversight. (For the more technical: the recently
announced Unicode 3.1 won't work either.)" Read the full article.
Introducing foreign language character sets and languages only splinters the internet into artificial factions and we end up having borders on the net. Is that what you want?
That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.
... Perl readability joke... arrrghhh....
Must...resist... making... lame
FYI, There are more Chinesse who speak English than there are Americans who speak english. (I saw that on the Discovery Channel :p )
In fact I'd propose 8 bit ASCII as the standard with say, 4 escape characters, each of which is followed by two bytes. This allows 252 + 64K + 64K + 64K + 64K or roughly 256000 characters and does so WITHOUT breaking most ASCII based services and code out on the net and in the world.
Keep it compatible, stupid!
A good portion of India's 1 billion inhabitants speak English. The CIA World Factbook calls English India's "most important language for national, political, and commercial communication." So if even 30% speak/read English, that quantity alone puts it in competition with the number of people literate in Chinese.
/ in.html#People
:)
Don't believe me? see: http://www.cia.gov/cia/publications/factbook/geos
Plus, add a few hundred million who speak 'Bad EU English', U.S./Canadian, Austrialian. etc.
But most importantly, most of the techhical documentation for the Internet and Web is in an English derivative.
However, I do think that we can use Traditional Chinese to replace all these stooopid colorful little icons on my computer. At least then I can look up what they mean.
>Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Man, I strongly doubt that even a single person in the world knows 50000 characters off-hand. Even if you divided by 10, it would still be too large. The Japanese children slightly more than 2000 characters before they graduate from *HIGH SCHOOL*. Stop showing your ignorance. Christian Laforte
"Unicoders" ignore the fact that any multilingual text is inherently stateful, so their idea of stateless stream of giant "characters" that will be easy to process is flawed at the core. In fact it's useful for decorative purposes only -- while it's easy to _display_ a unicode text (given in any of countless Unicode encodings), it's impossible to process or edit it without at least some state (current language) information to determine, what input method, dictionary, grammar rule, etc. to apply to any substring, so the goal of stateless text is just as misguided as the initial stateless filesystem representation in NFS. But if statelessness is kicked out of the window (like it should for anything multilingual), then information about both language and charset can be easily added to any substring, so all national charsets, ones that were specifically designed to be used in some particular language, and to which all processing rules and dictionaries were already written, can be used -- programs that don't care about charsets and languages will just handle them transparently as sequence of bytes, and programs that care should use state information anyway.
How to include state information is a good question -- there are a lot of posibilities, and one of them is modification of HTML and XML specs to add charset attribute to everything that can have LANG. The problem is, for purely political reasons those specs specify only global charset for the whole document, and include LANG but don't include charset as an attribute for everything to make it impossible to use any non-Unicode charset for multilingual documents in them. This does not serve any legitimate purpose, and is an example of blatant sabotage of the specs to serve the interests of small but very influential and vocal group of companies that are interested in making multilingual processing as complicated as possible, so every simple task requires huge bloated application just to comply with the sabotaged specs, instead of simple byte-value transparency that otherwise would be sufficient. Raising the barrier for entry, decommodification and contamination of the standards at its finest.
The reason why things like that are possible is, that in fact the demand for multilingual text processing (multilingual as one document that contains text in more than one language other than English because English is usually supported within non-Unicode national charsets and works just fine with them) is currently very low, and was even less when those "standards" were adopted, so obvious flaws did not cause immediate havoc. This is a commonly used strategy -- when no one needs something, write a standard for it that favors you, create a lot of noise around it, declare that it "dominates the industry" because no one else is doing it, and then wait until the need becomes more or less apparent. Then when it happens, everyone will somehow remember a piece of your noise, and you can loudly proclaim that all that time you was busy including new great standard into the innards of your software and lobbied all standards groups to include some reference into standards (that everyone, of course, ignored all that time because of the lack of the need for application). So, at the time when need is "more or less apparent" and the requirements to applications and standards quality is low, you can expand the "use" of your standard by people who don't need it or care about it, just because it was included into some of your products -- if features support in them is ridiculously poor, no one would notice because there isn't that much use anyway. The development of other, superior, standards will be stifled because you will always be able to claim that everyone is happy with your standard because there aren't many people complaining -- of course, there won't be many complaining because almost no one actually uses it for what it was supposed to be used it in the first place yet. At the time when real need arises so many products and standards will be contaminated with your standard that people will have to use it despite the obvious flaws. If standard stinks, you still can claim that no one made anything better anyway, so everyone should just use your POS, and if it breaks others' software design, they should just adopt yours.
If this sounds too close to some particular company's favorite strategy, it probably is -- Microsoft with its nauseating file/documents formats design, mediocre and bloated, display/printing-only oriented text editing software is one of the most enthusiastic backers of the Unicode, and they do it despite the fact that their software itself often gets into trouble because Unicode is both hard to use and hard to implement. It doesn't matter, important thing is, if we had trouble with it doing it half-assed, everyone that will try to do it better will have much more trouble. Scorched earth strategy.
Contrary to the popular belief, there indeed is no God.
It does not matter, what Unicode in theory can have in -- the allocation of characters is handled by a single, and not in any way open, organization, so the standard is all that is allocated and not that in theory can be if Unicode consortium would be benevolent enough, that we all know that it is not. Even if it would be, there is always some need to represent, in some consistent and unambiguous manner, text in languages that can't be possibly accepted into Unicode, such as fictional languages -- they can be easily handled by any expandable charsets-handling system and it won't be a rocket science to develop one, however Unicode supporters do everything that is possible for humans and sometimes more, to prevent any competing system from being developed. Also it does not matter what stated goals of Unicode are -- in fact it is being hawked to be used as the required internal representation of all text in all applications, and as the origin for encodings used for data manipulation, storage and transmission. These are facts, and so are the real problems that Unicode generates if used in that way. I have no problem with Unicode standard being a big dusty book used as a simplified manual for world''s alphabets, or as an intermediate format for fonts handling and texts conversion between different charsets of the same language. The problem is, Unicode is being used for things it is inadequate for, and its existence is loudly proclaimed as the reason to make no progress in development of any solution for multilingual texts handling that is not entirely based on Unicode-derived representation over the wire and in storage. This is selfish and counterproductive.
Contrary to the popular belief, there indeed is no God.
1. The standard is expandable if, and only if, it does not require a change of itself to adopt an expansion. For example, the addition of a new MIME type does not change the MIME standard, however the addition of a new tag does change HTML standard, therefore HTML is not expandable, what is pretty easy to notice while comparing different HTML renderers. XML is a near-absurd case because it's basically an umbrella that allows to declare all kinds of tags and therefore is supposed to be flexible and generate expandable standards, however the catch is, it does not provide any facility do automatically determine how to handle those tags' semantics in applications, so mere possibility to declare something new does not make it expandable either if applications' algorithms have to be modified. In Unicode however the situation is much more simple -- any addition of the characters IS a modification of the standard, and there is no possibility to automatically provide interoperability between older and newer versions.
The existence of the procedure TO change the standard does not make it expandable.
2. If Unicode will adopt all fictional languages/scripts/... it will become absolutely impossible to make complete fonts for it -- now it's merely a huge task, but then it will be plain impossible. The only real solution is to have standard that allows to name a language/charset combination, and leave the text in them intact until either user will install support for them, or application will automatically download it. Unicode doesn't help with it a single bit -- application encountered a character in unsupported range, and all it has is 16 or now 32 bits that it can only stuff in its virtual ass and report an error because no reasonable resolution can be made without some external assumption.
3. ISO 2022 is a very poor implementation of stateful multi-charset character stream, and Unicoders are very fond of mentioning it as a proof that all possible stateful systems are bad. However repeating something that is false does not make it any less false -- in fact, after Unicode was adopted by IETF (on meetings behind the closed doors) all work on stateful character streams standardization was stopped.
4. Computers can magically process all kinds of charsets. It's called byte-value transparency. Most of applications would work just fine if they just copied strings without making any assumptions about their structure or number of characters in them as long as bytes are bytes, and end of string is always 8-bit 0, what would be quite trivial for any stateful text system to implement. Tiny minority of programs need anything from a text that requires actual parsing other than finding newlines and, rarely, whitespaces. Display routines are different thing, however there aren't many of them, and all systems other than Windows support Unicode by combining and translating multiple fonts for multiple ranges, so supporting multiple fonts subsets for multiple marked charsets would be only easier to implement.
The problem is, there are too many Windows programmers writing internet drafts now, so semantics of text display routines got stirred up from system-specific and application-specific processing where they belong, and contaminated standards responsible for data transfer, where they don't belong, and a lot of people now believe that to transfer some data one has to know how to display it in some pretty letters. Shame on you.
Contrary to the popular belief, there indeed is no God.
1. ISO is a closed standards body -- if it does anything, it makes standard less open.
2. Private use codes aren't standard -- they don't provide any guarantees of interoperability, and merely provide a way to break the standard while fooling a program that is compliant with it into behaving how the user wants. If there was a way to put somewhere even a name of a charset to map "private" codes to a font name, it would solve a piece of the problem, but alas -- Unicode is made under the slogan of total statelessness of text, so while applications' file formats may allow this, arbitrary substring in a text can't.
Contrary to the popular belief, there indeed is no God.
The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.
This is, to say the least, incorrect. While there is a lot of effort to shoehorn Unicode into Unix and Unix software, the actual results are beyond miserable, precisely because Unicode does not work. Unix vendors solved this problems by adding a small support for to/from unicode conversion and by declaring that their filesystems support UTF-8, thus getting blessed by Unicode consortium as compatible. Guess what, UTF-8 can be "supported" in that way even by abacus, if that abacus is long enough and has at least 8 stones in a row, however actual use of it is a completely different thing -- I have never in my life seen a filename in UTF-8 outside of Unicoders' demos, and I am Russian myself and have a lot of friends that speak Japanese. So, again, Unix vendors' support of Unicode is in fact a lip service, not unlike Microsoft's support of POSIX or claims that Internet would support OSI 7-layers model (what ended with "temporary solutions" known as TCP/IP and Berkeley sockets replacing it).
Contrary to the popular belief, there indeed is no God.
Statelessness of text is something that Unicode tried to achieve, and still is using as their main argument toward its acceptance. Latin-1 is quite irrelevant here because its goals weren't as pretentious as Unicode, and impact on existing applications was near zero, and was basically "where are we going to use those values anyway?" Unicode actually is supposed to be used for serious multiple languages support, and requires fundamental changes in both applications and protocols -- with protocols causing a lot of infiltration of Unicode-based requirements into otherwise tansparent protocols. This would be at some extent justified if Unicode actually was a base for serious multilingual processing (what Latin-1 never claimed to) but otherwise it isn't worth the effort and problems that Unicode brings in. So, main advantage of Unicode over basically everything else imaginable (though not implemented because of pressure on IETF from Unicode), is statelessness of text stream.
Contrary to the popular belief, there indeed is no God.
You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal
I can do that if anyone will listen. The problem is, the actual problem that it will solve does not exist yet, its time didn't come. Multilingual documents, for all purposes, don't exist beyind demos. Unicoders are using this to create their own standard that definitely won't hold water if demand already existed, but they can with their propaganda flood everything involved with standars -- certain person, Martin Duerst, subscribes to EVERY mailing list that may in any way touch multilingual text handling and every time someone mentions Unicode, floods it with tons of messages in support, and fiercely fights against every argument against. I have no idea what else that person does beyond that, if any, and how many hours is in his day, but it's extremely hard to support any serious argument when one side is so active, and most of people are disinterested.
I have planned to do this when actually people will need multiple languages in their documents, and if someone can convince me that I have slept too long, and this time is now, I will happily start work, but otherwise it will be not just fighting with windmills, but fighting with windmills when there is no wind.
Contrary to the popular belief, there indeed is no God.
UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.
I merely tried to explain that UTF-8 is specifically designed to be used with any imaginable system -- what says nothing about its usefulness.
I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.
This is simply false. UTF-8 filenames and data can be used in any Unix if one wants to sacrifice functionality that people expect from a fixed-length characters representation (ex: regexps matching, cutting text at arbitrary offsets). However it's not a problem of Unix that users expect their encodings to be easier to use than a mess that UTF-8 is -- on other systems there isn't any counterpart to this functionality in utilities that are in common use.
On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.
Show me. I have seen a shitload of data, marked as UTF-8, yet used exclusively as ASCII, or even with different encodings actually in the data, but never -- actual multilingual database in UTF-8. Again, it demonstrates my point that Unicoders are trying to sneak their "standard" in while there is no demand and therefore no scrutiny for the quality of things being introduced.
> and I am Russian myself and have a lot of
> friends that speak Japanese.
Umm. And the relevance of that comment is what?
It means that I am in my own experience familiar with handling of multiple encodings, with what people use in the real-life texts handling, and their willingness to use Unicode, that happens to be below zero. You can claim that their reasons are irrational, and Unicode is still the best solution for them, however I still don't see, why opinion of almost everyone who actually knows about the subject from practice, and is supposed to benefit from what Unicoders are proposing, can be dismissed so lightly.
Contrary to the popular belief, there indeed is no God.
Because, gee, the need to communicate with someone in another language is new.
When people communicate, they choose one language for it -- usually one that both know best. No one speaks like "Ya odnowremenno trying goworit' po-english i russkomu, and esli ya by znal nihongo ya would simultaneously speak po-yaponski, too".
It's very important to see the distinction between the need to support "multilingual document" that contains multiple languages within one body of text and to support documents in multiple languages within one system or program. Also historically it happened that documents in all languages can painlessly include ASCII text, so non-English language + English is usually treated the same way as a text in non-English language, not requiring any special tools to be handled. One may claim that this is wrong, but this is how things happened to be developed over decades.
I've never seen VCR instructions in multiple languages
Those are multiple documents, not one document with multiple languages in it. There is clear separation between versions in different languages, and this is already being accomplished easily, even in MIME email.
, I've never seen a bilingual dictionary
Dictionaries are special cases, and they usually are distributed in either printed form, or as a database -- they almost never are seen as plain text documents. In both for-print-only formats and in databases there are plenty of ways to represent languages and charsets as metadata, and absolutely all computer dictionaries that I have seen chosen to use native encodings.
, and the EU driver licenses only have one language on them, not every language of the EU.
Again, I assume that the whole text of the license is repeated in multiple languages, not individual words are repeated in each language within one body of text, so the same definition of multiple documents applies.
Contrary to the popular belief, there indeed is no God.
So Reta Vortaro , an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.)
First, without any doubt it is a demo -- the set of languages to which trnaslations are available varies from word to word, and in real life one would never want to have translation into multiple languages to always appear, clogging the screen. Second, this is an application (even though a simple one), not a document, and there are plenty of ways for applications to handle multiple charsets even now. My point is, functionality that supports multiple languages within application is completely ortogonal to the support of multiple languages within a single document or string. Unicoders love to mix those two.
Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too.
Again -- I don't see why this particular application used UTF-8, however neither its design requires it, nor those files are for any purposes normal text documents -- even uncompressed, they have strict formatting and are even indexed, so they could use just any charsets/encodings possible.
And the Debian main page , where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.
Absolutely. This list of languages is obviously a gimmick that provides nothing that list of languages in English wouldn't provide -- everyone in the world, for whatever reason, knows how his language's name looks in English even if he can't read English. In the case of Debian page, if I was looking for Russian translation, I certainly would search for "Russian" string to find the link (it's interesting that the word "Russian" is the only one, where both "native" and English name of the language are mentioned in the Debian page -- I assume, because a lot of Russians actually use Russian translation but don't have UTF-8 enabled or supported in their browsers). Also, Debian home page automatically chooses the language if it's announced by the browser, so if I really wanted Russian version and set language preferences in the browser, I wouldn't even have to touch anything else. And lo and behold -- when I choose Russian, the page appears in koi8-r, what happens to be Russian local charset, not any form of Unicode.
Contrary to the popular belief, there indeed is no God.
You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode.
The simplicity of Unicode is only in its authors' imagination. Yes, it's easy to present Unicode to people who don't know the details as a simple solution -- the problem is, reality isn't as simple as it looks.
I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode.
When the goal is just to make a text that can be printed in pretty letters, anything is ok as long as it's implemented. This is why a lot of low-quality products such as MS Office are so popular -- in fact so popular that I often receive email with nothing but plain ASCII text as a MS Word file. However even in this case a complex typesetting system (that would most likely just use multiple fonts in whatever charsets they happen to be avilable because it cares more about fonts) would be more appropriate.
When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.
How deceptive. The implied assumption is that "obtaining" charset support is some kind of nonzero effort while using Unicode is smooth regardless of the language. Both things are incorrect -- in a system with multi-charset support the charsets support can be loaded automatically depending on the languages and charsets mentioned -- if someone wants to have support for everything Unicode supports at the extent Unicode supports it, he will only need fonts, and the amount of the information and resources used would be exactly the same as if he had their support in Unicode. However in practice usually the goal is different -- only few languages and charsets are in active use by the same user at the time, however he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.
Again, Unicode user still ends up having to somehow get something language-specific, except that his language-specific data and procedures also have to be designed to use Unicode, what differs from the procedures that are already in use, and often open source. Software vendors would love that -- they can either keep making localized versions of all software with Unicode support but with different language-specific procedures, or try to make tools that can handle all languages and spend man-millennia rewriting trivial things and then release them as the only way to use Unicode in practice. In either case they get their money because old software, Unicode-supporting or not, will not match the requirements for multilingual documents processing, and their new solution will be complex and therefore hard to reproduce.
My idea is that infrastructure for stateful text processing is as unavoidable as the existence of different languages and writing systems, so it would be foolish to try to decieve people into thinking that displaying pretty letters is the main problem of handling multiple languages or multilingual documents. I don't see how denying undeniable is justified. Most of people are ignorant about the details because at this moment the problem isn't evident, and problem isn't evident because the whole field of its application is not in any way related to their everyday life, however I don't think that every kind of ignorance deserves to be abused with such a long-lasting possible consequences.
Extending the idea that in multilingual text attributes that should be applied to substrings ("state" when text is treated as a stream) are necessary, I can say that since statefulness is unavoidable anyway, charset/encoding is just as good attribute as the language or, say, language-dependent parameter such as direction (for example, in Japanese left-to-right and up-to-down directions are both acceptable, even though modern texts use left-to-right). The implementation of "full unicode" text processing, even in a primitive display-only manner, is not any simplier, and certainly isn't any lighter on resources than a multiple charset support -- in fact multiple charsets support can be easily built on the top of any existing text displaying or printing procedure that supports multiple fonts and multibyte characters. The only "big question" is how to represent attributes in a text stream, but this is merely a question of formally declaring some decision to be standard -- one can design many of them easily, and almost everything that a sane human mind can create at this moment in history would be infinitely superior to iso 2022.
Contrary to the popular belief, there indeed is no God.
Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.
Then what was the point of your argument? If implemented in the display-only library and used for displaying/printing only, Unicode is just as "simple" as would be any other system, with or without multiple charsets. If program does anything complex, it should handle various language-dependent stuff anyway, however bare Unicode support provides no such infrastructure, and a reasonable infrastructure can be implemented either with or without Unicode. Then what is the advantage of Unicode? Being self-proclaimed status quo in standards' backroom-politics, that no one supports properly anyway, that is hard to segment into subsets, non-expandable, maintained by a closed standards body and requires more resources?
I don't claim that Unicode theoretically can't be used as the base for languages support -- in theory it can, but the problem is, it provides no advantage compared to multi-charset system if used as a part of multilingual text support infrastructure. I have already explained why such infrastructure does not exist now, however I believe that when it will become necessary, someone will have to implement it anyway. So now, when no one needs it, Unicoders are busy to claim this "piece of noosphere", just like some people tried to sell land on Mars -- just because it's there, and before it will become obvious that it's not theirs.
The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible.
By this logic it should use Microsoft Word or at least PDF -- both very widely supported, more wide than even plain text files in UTF-8 (yes, I know, Word can use unicode internally -- this isn't the point).
Unicode HTML and UTF-8 plain text are those formats.
Are they? Most of my boxes don't have them installed -- the one I am writing this message on is an exception, but only because it has Mozilla, what is still a bloatware. My handhelds most likely never will have them installed -- they don't have enough ram, and need rather nontrivial manipulations with characters size and formatting to keep texts in some languages readable, so plain stream of unicode text would be impossible to display without some heavy heuristics.
Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience.
I wouldn't dream to propose a non-open standard for this. However the trouble with open standards is that they never appear before they become necessary, and I, following the principle that standards and tools should be developed as the need arises, am not making any detailed proposals at this time. But when there will be a need, the standard that will be created must be open, expandable and easy to port and reimplement -- something that anything Unicode-based is not. If you mean that charset is "proprietary", I am not aware of any charset except, maybe, "klingon in private unicode" that was in any way declared to be someone's property. If multi-charset support infrastructure will be created, it would be reasonable to include some common facility into the libraries that will make it possible for users to allow programs, when they see an unknown language ar charset, to automatically download fonts, tables and even formatting/comparison/input methods/... source code automatically from some servers that keep directories of known charsets and languages, and this would be an open, expandable and flexible infrastructure, available to everyone. If someone wants his language that never had local charset in the first place to be represented by its range in Unicode, he should be able to do that, however in a system like that there should be no reason to prevent established language/charsets combinations from being used just because of someone's narrow view of the problem.
Maybe I am wrong is this traditionalism, and it will be better if I made an infrastructure for stateful text support just to demonstrate this point -- after all, even with all dynamic fonts/code/input methods/... it won't be in any way more complex than any other solution, merely useless because right now still almost no one uses multiple languages in a single document. But maybe the need to demonstrate the solution for a problem that no one experiences yet is now a good reason enough when someone else is trying to sneak in an impractical solution as the standard while no one is looking.
I see current advance of Unicode as something that may serve some simple need now, but can severely limit further progress if accepted as widely as Unicoders are trying to get accepted. That would not be "good enough" as TCP is "good enough", large SMP kernel lock was "good enough" or C pointers are "good enough" -- it's "good enough" as Windows, region codes, crippleware, etc. are "good enough" -- people accept them because those things are pushed, and the inconvenience they create isn't bad enough until it's too late, but when it's too late, people still use them because there is nothing else in sight.
Contrary to the popular belief, there indeed is no God.
Actually, the Unicode specification for UTF-8 places an artificial limit of 4 8 bit code units for variable length encoding as that is all that Unicode currently requires.
ISO 10646 defines UTF-8 as having up to 6 8 bit code units.
At 4 bytes, UTF-8 can only map to 0x10FFFF. At 6, it can map to 0x7FFFFFFF.
Of course, my math could be wrong.
The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.
UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.
Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.
UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.
A good document on this is available at UTF-8 And Unicode FAQ
The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
Just because you can't read other langauges doesn't mean multi-language support is useless. Oh, and inputting Kanji on a keyboard is quite feasable, try using the Windows IME sometime (It's built into 2000).
Down that path lies madness. On the other hand, the road to hell is paved with melting snowballs.
I read the internet for the articles.
It looks like the argument is that since ancient Chinese texts can't be fully reproduced in Unicode, that the standard is flawed. I disagree. There is already a four byte character set out there - UCS-4 I believe, which is ISO something or other - that will easily handle all characters as necessary. This set can be used for replicating all XX,000 old school Chinese characters. Thinks of it as the SGML of character sets. But for common applications, XML (ie, Unicode) will continue to do just nicely.
yes, I quickly noticed this when I spent a short time in Germany a few years ago. Sort of a very tall and skinny v, I'd say. Looked kinda like a greek letter I thought, but I can't remember which. need sleep...
I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.
Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.
He sure did do a good job when he slapped that old Tower of Babel bitch down.
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
Isn't this what UCS-4 is for? I can't imagine there are more than a billion characters. Of course, most of the Unicode software that deals with wide characters won't work with UCS-4. But any decent UTF-8 based program should support up to 6 bytes per character.
But I guess internally most programs use 16-bit characters, because it's easier to deal with, and just convert into more compact forms like UTF-8 when they want to save or transfer it.
Run a pencil-and-paper RPG campaign with your far-off friends: Gametable!
You have to install character support for those other languages because most fonts don't contain complete coverage of the Unicode character set. If you install "Arial Unicode MS" off the Office 2000 CD, you get character support for a lot languages. Sorry, I can't remember what option to choose in the Office 2000 setup. Don't forget, Win9x/ME is multi-byte only via code pages with Unicode being a per application thing, and WinNT/2K/XP is Unicode only but with little support as all the Windows applications try to be Win9x compatible with the least amount of effort.
Mark Twain wasn't the only one with a story like this.. There was a similar one I ran across in a
Astounding anthology.
"'Tis great confidence in a friend to tell him your faults, greater to tell him his." --Poor Richard's Almanac
UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).
The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.
Typically, one shouldn't apply font styles on a character by character b as iS.
Danny.
I have written over 900 book reviews
>>the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V No, the handwritten one in Germany looks more like the 1 in an Arial font. bye...
26 letters
You mean 26 uppercase and 26 lowercase.
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
Yes, only a small set of countries was considered, and only minimal support. But this claim on "no support for anything not USA" is false.
The cent sign was replaced with the caret. That is why shift+6 prints a caret, if you look at old typewriters that was how you printed a cent sign. This is in fact the main reason I think they considered European support, since from an American point of view the cent sign is more important. The fractions were what were replaced by the square braces. The curly braces, vertical bar, and apparently the tilde were added later (originally they printed as square braces, slash, and caret, and devices that totally ignored the lower-case bit were allowed, and the original tilde was changed to underscore because that character was missing originally).
"Extended ASCII" usually refers to the replacement of several of the punctuation marks with European characters. This was pretty useless because by then most OS's had assigned meaning to those punctuation marks (like the square brackets), also only 5 or 6 new characters were available. This died almost immediately when people started supporting the 8th bit as data rather than parity.
If this is the normal non-breaking space character (0xA0 in Unicode) then it takes 2 bytes in Unicode.
The question was "did they do this for a good reason? Ie: doing this allows formatting control that could not be achieved otherwise. Or were they just stupid/lazy, and if normal spaces were used with a slightly smarter program would it be just as good?
I personally don't know anything about Arabic so I cannot answer these questions. My guess is that this is reasonable if there is a place that "normal spaces" are used in Arabic.
I might add a few things:
In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!
In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!
The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.
Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.
However, in Unicode, Chinese, Korean, and Japanese all share the same codepoints and glyphs, so you can't grep for one language or another.
For instance, if you were searching in Korean for "Kim Il Sung", this string in Unicode would be the same as the Chinese characters for "gold" (jin), "one" (yi), and "star" (sheng), so your search would get hits from other sino-based languages in addition to Korean.
It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is different for the different Asian languages, even though they use the same characters.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
Yes, the author was overbroad with that statement. All languages work on a restricted set of phonemes; there are some 200+ identified, but no one language uses near that number. Hangul covers all the Korean phonemes, but not much else.
:-P.
Here's a good description of Hangul. If you check this page, you'll notice I was wrong about the vowels; they don't seem to describe their own pronunciations at all, but rather the yin and yang elements of their sounds
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
If you read the article, you'll find a decent description of Korean Hangul, which has around the same number of characters as English (IIRC, it has 24).
Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.
Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.
Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
> the allocation of characters is handled by a single [...] organization,
Slightly incorrect. It's handled by the Unicode Consortium AND the ISO 10646 standards group.
> not in any way open, organization
It's as open as, say, the ISO C++ standards group. That is, unless you're connected to the right corporation or country, you won't get a seat, but they still accept outside submissions and respect experts outside the group.
> Unicode consortium would be benevolent enough, that we all know that it is not.
Benevolent how? Benevolent enough for what? It took them less than a year to get LATIN CAPITAL N WITH LONG RIGHT LEG encoded, for a minor language with no political power (Lakota). They're constantly encoding new letters and scripts for groups with no political or economic clout (Z with hook below for Old High German, various Phillipine scripts in 3.2).
And no one's stopping you from hacking up your own multi-charset system, and using it whereever you want. But loudly claiming that you're being oppressed doens't prove that you are, and doesn't prove that your system would actually be superior to Unicode.
Part of the point of UTF-8 is that non-ASCII characters don't get encoded with ASCII characters. In your system, you can get an '/' or a '\0' or '\e' byte that doesn't represent that character, meaning that all Unix software needs to be changed to support your encoding. As it is, Linux accepts bytes for filenames without caring whether it's UTF-8 or some 8-bit code or some other multibyte code that obeys the same rule, knowing only that the byte '/' is uniquely the directory seperator.
> ISO 2022 is a very poor implementation of stateful multi-charset character stream,
You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal solution.
Why is the character order a problem for the Japenese, and not the Germans, the French, the Lithuanians, the Belarusians, and almost every other language in the world? Latin-* does not encode anything besides English in alphabetical order, and neither does Unicode. (It's theoritically impossible; the Lithuanians want the Y to precede the J, and the Danish and the Swedes disagree about where the a with ring above goes.)
If you go to the Unicode standard (found online at http://www.unicode.org/unicode/uni2book/u2.html ) they have an index with all the characters by radical and stroke. They also have an index with all the characters found in JIS sorted by their JIS index.
Because, gee, the need to communicate with someone in another language is new. I've never seen VCR instructions in multiple languages, I've never seen a bilingual dictionary, and the EU driver licenses only have one language on them, not every language of the EU.
Multilingual documents, for all purposes, don't exist beyind demos.
So Reta Vortaro, an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.) Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too. And the Debian main page, where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.
The fact that you chose to dismiss this stuff as demos does not change the fact that it's in actual use. Revo's author doesn't feel like changing the format of his dictionary because you don't agree with it. The web is full of gimmicks, but people like their gimmicks; why do you think Java took off? You can't just call it a gimmick and dismiss it; if that's what people to do, then that's what people want to do.
You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode. I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode. When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.
> The simplicity of Unicode is only in its authors' imagination.
Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.
> When the goal is just to make a text that can be printed in pretty letters [...] even in this case a complex typesetting system [...] would be more appropriate
The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible. Unicode HTML and UTF-8 plain text are those formats. Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience. Project Gutenberg has existed for 30 years. What "complex typesetting system" format can claim the same? How many "complex typesetting system"s that could handle it are available on many different platforms? At least 70% of the people on the net can read Unicode HTML, and many of the rest could with little work and no cash expenditure. What "complex typesetting system" can say the same? How is a "complex typesetting system" simpler than Unicode plain text?
> he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.
Nonsense. For "Old High German", I will map ALT-Z to ȥ in XEmacs. Spellcheckers don't exist for this language, I'm not going to sort the data, and it's just a z with a hook, so there's no special formatting rules. The people who read the book don't even need a way to enter the character, any more than reading the original book precipitated a need to enter it into the computer.
Displaying pretty letters isn't the end all and be all of multilingual computing, but it's a damn good start. The only registered character set (ISO 2022 registry or IANA registry) that supports Lakota or the Cherokee syllablary is Unicode. No, on most systems, they can't get decent support; handcrafted keyboard maps must be used, there's no spelling or sorting support. But they can type the characters in and send them across the net and print up papers, which is better than nothing.
The encodings are a temporary fix to a permanent problem, much in the same way that NAT "expands" the IPv4 address space. The real solution is to use a character set that can encompass all the world's characters at the same time
He explained it as "before the Christian era", no doubt for the benefit of those only familiar with B.C. and A.D., but did not define it as that, although anyone who needs it explained no doubt also needs it defined as "Before Common Era" (and should also be told that what comes after is "C.E.", or "Common Era", and that B.C.E. and C.E. correspond to B.C. and A.D., respectively), so he did screw up just a tad.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
You can't do that, since very often a simplified character maps to several traditional characters. Even if you can, it won't be a saving of 50,000 characters, only several thousand at best.
Just check out the latest Nokia phones sold in Asia to see how easy it is to type Chinese SMS messages. Or how about the input method they use on those ESDlife terminals around Hong Kong. Both uses less than a dozen keys to enter Chinese, with no need for prior training.
The same reason that you can't get a name brand PC without Windows preloaded.
Both kanas are derived from the Chinese characters that they "borrowed"; hiragana is the smoothing down of an entire kanji character, while katakana is more or less radicals taken from kanji.
Hiragana was used initially solely by women and was derived from Chinese character's "caoshuti" in Heian era (794~1192). It was initially called "onna de", or "women's character".
Katakana (literally, "side script") is derived from a Chinese character of the same sound, ignoring semantic meaning. It was invented by Kibi no Makibi (AD 693-755). They were initially used as pronunciation aids in Buddist scripts; but later became verb endings.
If I'm to believe my last professor, at the end of WWII, this all changed. Hiragana was changed to reflect verb endings and particles, while katakana was reserved for foreign words, usually Dutch, German, and Chinese -- and with the post-war American occupation, English.
Without you I'm one step closer to happiness without violence.
Yes, his anaology was a bit off. It would be more accurate to say "imagine if English-speakers were restricted to an alphabet which is missing characters like Æ or fi (the ligature)". While Unicode is missing lots of Chinese characters, the vast majority of the characters which are missing are characters that only historians use. One only needs to know about 2000 characters to be considered fluent in Chinese, and if you know 7000-8000 characters, you're way above average.
Chinese uses unicode by combinations of roots and the other parts of the characters
While many of the more complex Chinese characters do consist fo simpler radicals used in combination, they're not encoded that way in Unicode. For example, the word "ma" used at the end of many questions consists of the character for mouth and the character for horse. In Unicode, the encoding for "ma" is completely unrelated to the character for mouth and the character for horse though.
(i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)
Just because you can't read Chinese characters, it doesn't mean Chinese people can't.
The "syllabic system" you're talking about is probaby pinyin (or perhaps bopomofo, but it doesn't really matter - they're isomorphic). Converting Chinese text to pinyin actually results in information loss. It isn't a really viable solution. The Chinese people also like their linguistic system, despite what American public schools have taught you.
It sounds like it was traditional Chinese with bopomofo annotations. That's apparently a common way of teaching characters in Taiwan.
Bopomofo is essentially the same as pinyin, except pinyin uses English letters, while bopomofo uses non-roman characters. Mainland China uses pinyin, while Taiwan uses bopomofo. There's a 1-to-1 mapping between bopomofo and pinyin. Chinese is also tonal, so pinyin and bopomofo are also often augmented with "tone marks" or sometimes just numbers. Those would be the accents you saw.
There's a pretty good page with more info here.
When was the last time you met someone who left China and said "thank goodness I don't have to use that pictograph system [sic] those emperors put in place to keep everyone illiterate". Most Chinese continue to enjoy reading Chinese text long after they've left China and learned to read and write languages with phonetic alphabets.
And despite their many flaws, you can't really accuse China's communist government of encouraging illteracy. On the contrary, the communist government in China actually created simplified versions of a large number of the commonly used but complex characters (back in the 50's), and these became the standard character set in mainland China. (that's why there are both "simplified" and "traditional" Chinese characters) It's also interesting to note that Taiwan, which is much more democratic than mainland China, still uses the traditional character set -- the same character set supposedly used to keep everyone illiterate.
So why can't Unicode take this approach, and encode words in a similar phonetic fashion? Nobody expects a codepoint for every word in English, German and French.
I miss Meept.
Contrary to popular belief, and contrary to what most people will think after reading the article, Unicode can contain all characters in the world. What the article actually says, in its own roundabout way, is that the UCS-2 encoding of Unicode cannot encode all Unicode characters, and that some of the characters it cannot encode are fairly important.
I can only agree with that. UCS-2 is a silly idea, and UTF-16 is a bad bandaid for it. UTF-8 is great, and if you really must have an encoding with equally much space used by all characters, use UCS-4. UTF-8 is infinitely extendable and will never run out of characters, not even theoretically. UCS-4 can encode millions of characters; the measly 170,000 characters mentioned in the article do not create a problem for UCS-4.
The only problem Unicode has, is that Microsoft chose UCS-2 for some important things in Windows and Office. They are fairly alone in that stupidity.
Finally! A year of moderation! Ready for 2019?
You are so euro-centric it's not even laughable. As the article said, those who claim Unicode good enough for the masses are the same foreigners who would scream and howl if someone tried to remove redundacies from the English language such as pork and ham, or argue and dispute, or ...
I have read that an English language vocabulary of 300 words is good enough for most ordinary conversation. You are claiming the equivalent is good enough for ordinary use. You are mistaken.
Unicode is a classic case of (western) imperialism, in which the imperialists are completely blinded as to why it is imperialistic, and continue to mutter "it's good enough, and we know what's good for you smelly foreigners."
--
Infuriate left and right
Eventually, with much nudging along in the territories of high-resolution color and graphics, better input devices (such as the scanner, which can be thought of a fax machine for computers), better output devices such as the inkjet and laser printer, and even bastardized keyboards and software which could generate thousands of characters - if only one can remember each and every one of the input codes. Graphics tablets eased the pain of having to get something into and out of the computer. But none of this is yet fully satisfactory, and perhaps it will remain in this state until the advent of the intelligent, voice-understanding, "computer" finally comes into our daily lives.
I recently saw a story about Japanese reporters who send their stories in via a phone call and dictation on the other end, because it is so much faster that trying to get it into the computer for digital transmission. I don't think the common character representation is the only issue here. As the article states, some languages are just much, much harder to digitize.
It won't work if you try to write your Java in Mandarin... :p
I'd love to see the linux kernel coded in Python.
Does the artist formerly known as Prince get his own charcter space as well?
Will I need to download a new character set on windows to view it?
Know what I like about atheists? I've yet to meet one that believes God is on their side.
> The obsession with phonetic spelling is an unhealthy and rediculous pathology
Despite the fact that we move inexorably toward it anyway.
--
I've finally had it: until slashdot gets article moderation, I am not coming back.
I think we should move to Esperanto.
-David T. C.
If corporations are people, aren't stockholders guilty of slavery?
- Richie
Besides, translation software is coming along well enough that soon we will not have to worry about it too much.
How is this translation software supposed to work if there is no standard for interchange? Magic? How are we supposed to translate these characters that have no symbol for the computers to process?
There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer.
What makes you think that we can't encode all these characters? Are we going to run out of numbers? A 32-bit number can hold 4 billion different values, and if that isn't enough, we can use a 64-bit number. We certainly aren't going to run out of numbers.
by Mike Buddha -- Someday the mountain might get him, but the law never will.
I suspect Unicode is a lot more upsetting to
a "reference writer specializing in rare Taoist
religious texts and medical works" than to
ordinary Chinese users who want to run Photoshop
or put their wedding pictures on a web page.
Let me get this straight - you think people
should be prepared to accept having restricted
access to the literature that underpins their
culture in exchange for their very own
geocities.cn?
K.
-
-- Proud descendant of semi-nomadic cattle-herders.
Why exclude the Constitution? Let's put everything published before...oh, say 1800... in there. The Gutenberg Bible (in fact, all Bibles preceding the modern versions), Shakespeare, Chaucer, say goodbye to them all!
I bet you can't even name one major Chinese or Japanese text. Plenty of people still study them in high school or university, just as you'd study Shakespeare. Don't spout crap when you don't know what you're talking about!
And what is the point of that? You end up with Unicode characters of unequal length, which further complicates the whole problem (actually, these already exist...)
Part of the problem with the Chinese character set is that it is not an character set so much as a dictionary
Oh, bullshit. It's a character set just as much as ASCII is.
Thank you for deciding how 1.3 billion Chinese, 120 million Japanese and 50-odd million Koreans should write.
In news today: The Chinese have embarked upon a simplification of the U.S. Constitution, stating that "it's too hard to understand." The result is expected to be declared an ISO standard within the next three years, with adoption by the US expected to be completed by 2005.
Now go learn something about the languages of which you speak.
Jesus, what is it about this article that attracts a level of cluelessness normally only seen in "IANAL" threads?
Japan does not have only 2,000 Kanji. Chinese can not be written satisfactorily with 10,000 characters. And how would you like it if the Chinese told you you can't write Shakespeare the way he was meant to be written???
And you obviously have a Western-centric mindset. Not to mention a stunning lack of knowledge of how Japanese, Chinese and Korean are actually used on today's computers.
Sheesh.
I did indeed read your post, including the bit saying, Perhaps some people will have problems using it today. In that case those people should interact with the standards committee instead of whining, and get their characters into the next version.
What you don't seem to get is that the Chinese, Japanese and Koreans have been trying to get the Unicode Consortium to produce a sensible standard from day one and they simply refuse to do so. Perhaps you should go look up the history of Unicode; there's been a lot of serious discussion and objection among the CJK people that's never made it into the open.
Now go learn something about how to parse basic English sentences.
Perhaps you'd like to debate the subject on a Japanese forum? No? I thought not. Excuse me if I don't view monolingualism as some indication of superiority...
Oh wonderful. So now we're not allowed to do searches, either. Where'd you come up with that bright idea, buddy?
And in case you didn't know, there are perfectly acceptable ways of inputting as many characters as you like in Chinese, Japanese or Korean. Just because you don't know how to doesn't mean everybody else doesn't.
It's not "whining", you knuckle-dragging moron. Just as Shakespeare is best appreciated in the original English (have you ever read a translation of Shakespeare? Didn't think so...), classical Chinese or Japanese texts are best read in the original. You know why? Because the author can convery subtle nuances and differences in meaning by choosing a particular character over others that have similar meanings.
Go learn Japanese or Chinese, and try and read a phonetic transliteration of a classical text. See how far you get.
Sheesh.
*Sigh*. Actually, I have read Hofstadter, but if you want to be difficult, let me qualify my statement: "Have you ever read a translation of Shakespeare into a non-European language?"
There are _slightly_ more than 2000 kanji in Japanese, but Japanese printers, like my wife's father, don't use more than 2100 absolute tops.
Funny, my Postscript printer sitting beside me can do about 6,500 Kanji, in pretty much any Japanese font available.
Chinese characters obey Zipf's law on a near perfect logarithmic scale. As in, the first ten characters make up about 60% of written text.
For someone who's supposed to have worked with Chinese dictionaries, that's an awfully strange claim to make. Unless you meant the first thousand characters, not the first ten.
It was a course meant to take non-Japanese speakers from zero to college level in one year. Of course, that's impossible, but you can get by until you find your feet.
You titled your reply "Nonsense", but...
1) You say that there are, indeed problems with the conversion tables. When you're trying to convert megabytes of electronic documents, a problem that may seem "small" to you seems much bigger, believe me.
2) You admit that code unification has several disadvantages, and then say that the distinctions "the Japanese" wanted were ported over. Ummm... have you been to a Japanese discussion on Unicode issues? The only Japanese who are satisfied with the current standard are those who were paid by the Unicode Consortium to put their rubber stamp on it.
3) So why are there som many different encodings for Unicode, if UTF-8 is so great? Oh, by the way, the problem is that Unicode doesn;t allow many people to encode their languages fully.
So, if I may ask, how was my post nonsense?
Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.
In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.
Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...
That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."
Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.
Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.
After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.
Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)
The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.
Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.
That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:
- No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.
- A draconian unification of CJK characters.
The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.
- The ugly "extensions".
Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.
I could go on, but I should get some sleep...
Well most people's English is better than a lot of the rest of the world's Chinese - in Europe and the USA. I still think all character sets should be supported, but the internet is, as of now, mainly used by English-speaking (as their primary language) nations, and even those who it isn't their primary language (e.g. continental Europe), they are mainly all fluent in English. .. it's not going to happen though. Spanish is more likely to take over than Chinese at any rate.
You could say 'Well, those lazy-ass Americans/Brits etc. they should go learn another language'
--
Delphis
Delphis
Finally, imagine that a political body imposes a deadline on imported programs.. that they must support their new standard by such-and-so a date or it won't be permitted within the country. The Chinese did this, extending the deadline to Sept. 2001. I only found out about this yesterday.
The Chinese lately seem to just be trying to piss everyone off as much as possible.
Granted, people whining about prejudice when they don't understand the technical reasons behind it doesn't help at all. It's sad really. These people need to grow up and understand that large changes don't happen immediately, and that if the state of affairs is how it is then it doesn't necessarily mean that anyone is trying to 'oppress' them. Take a fucking pill, people. There's such a bandwagon for being the 'oppressed people' that people who feel begrudged immediately assume this without THINKING.
ARrrrrghh.. Can't we all just get along?
--
Delphis
Delphis
A.D. stands for Anno Domini, Year of our Lord in English. The problem is that for most of the world's population, Jesus is not "our lord".
Mea navis aericumbens anguillis abundat
Its a lot like the Oxford English Dictionary
versus Websters Collegiate- Chinese printers have
gotten by with 7-10K characters versus the 60-80K
in the full language. Synonyms and hononyms are
used for the more obscure words. The standard
modern Chinese dictionaries only have this smaller
number of characters.
So that world in Planetfall was actually Earth of the future? :-)
SEENIK VISTA
Xis stuneeng vuu uf xee Kalamontee Valee kuvurz oovur fortee skwaar miilz uf xatfaamus tuurist spot. Xee larj bildeeng at xee bend in xee Gulmaan Rivur iz xee formur pravincul kapitul bildeeng.
One of the author's main propositions seems to be that Communist Chinese and Taiwanese/Overseas Chinese want different spaces in Unicode for the same characters.
I don't see every Western nation asking for it's own encoding of "w" or accented characters. The author doesn't give any explanation for why we should pay attention to IMHO silly political whining in this particular case.
The author further implicitly assumes that it is reasonable to include the deprecated K'ang Hsi characters in addition to the official characters, but gives no justification for this view. I don't see unicode trying to include all possible historical graphings of Western characters.
Das rubbernecken sightseenen keepen das cotten picken hands in das pockets, so relaxen und watchen das blinkenlights.
--
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Mr Goundry seems to be ignoring the fact that UTF-8, which is for many purposes is the most useful representation of UNICODE characters, can compatibly and simply represent the full ISO-10646 character space of 2^31 characters, just by using more of the reserved prefix bits. I can't see what his objection is to this.
Plane zero (the first 65k characters) is supposed to be enough for most ordinary people speaking modern languages. The additional codeplanes exist to satisfy the valid needs of linguists studying archaic or obscure languages, and our author is welcome to use them.
65k characters seems like a reasonable limit for ordinary use: going beyond that is going to require a whole new level of complexity in font representation, character entry, and so on. Even being able to display all those characters only gets you halfway: you have to take into account ligature, composition, kerning and layout rules, and it's not at all clear many program authors will find this worthwhile for obscure dialects.
The elegance of UNICODE is that it offers a smooth migration path away from ASCII through UTF-8, captures a great majority of uses in plane zero, and can expand to handle more obscure cases.
Funny funny.... why does this post get lamed?
The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source
A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here.
Regards,
gleam
this
Actually,someone thought about hieroglyphs in UCS (this was mentioned in the quickies section some time ago):
n 16 37.htm
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1637/
I don't know whether it is/will be implemented at the end. Looking at the limited character space, probably not.
God, root, what is difference? - Pitr
The obsession with phonetic spelling is an unhealthy and rediculous pathology: to understand why, have a look at Justin B. Rye's Spelling Reform page (subtitled And the Real Reason It's Impossible).
Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.
I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.
Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.
There's a couple of well-known palindromes:
Snug & raw was I ere I saw war & guns.
Lewd I did live & evil did I dwel.
Imagine these in a list of palindromes. Substitute the & and their meaning remains, but their palindromeness - and hence reason for inclusion in the list - is gone.
Actually, if you have fonts that render the UniHan in a locale-sensitive way, it works perfectly. The UniHan in Japan show the kanji with the proper (culturally) way to draw them with a Japanese font, the UniHan in China show the hanzi correctly culturally, although you need different fonts for Simplified and Traditional, and similarly for Korea.
It is not so different from using different type faces - ie, in old Germany, most this were printed in an gothic-style font that was culturally correct, whereas printing in other countries at the same time used more "roman" based typesetting. However, the old German gothic 't' was unified with the Latin 't' - are we complaining? It is a similar issue.
-- John
This is incorrect, 44,946 surrogates were approved in March as part of Unicode 3.1.0.
Unicode 3.1 and 10646-2 define three new supplementary planes:
Supplementary Multilingual Plane (SMP) U+10000..U+1FFFF (1594 chars)
Supplementary Ideographic Plane (SIP) U+20000..U+2FFFF (43,253 chars)
Supplementary Special-purpose Plane (SSP) U+E0000..U+EFFFF (97 chars)
Or plane 1, 2, and 14. (from the Unicode 3.1 Technical report, #27)
-- John
Variable length encoding, 8, 16 or 24 bits depending on how common the character is
Lord Pixel - The cat who walks through walls
Lord Pixel - The cat who walks through walls
A little bigger on the inside than out
Indeed, but the character numbers were initially orderded in something resembling common usage. At least, all of the 1 byte characters correspond (very closely or exactly) to latin-1 encoding, of which the first half corresponds to ascii, of course
This is a pretty good assumption on probability of occurance today's Internet, I won't try to predict the future :)
Lord Pixel - The cat who walks through walls
Lord Pixel - The cat who walks through walls
A little bigger on the inside than out
The writer of this article has a clear bias, being a reference writer specializing in rare Taoist religious texts and medical works. He certainly doesn't seem concerned about encoding Egyptian hieroglyphs or sanskrit. Or what about mathematical or chemcial symbolic systems.
If the web handled day-to-day writing that would be pretty remarkable. Classicists specializing in arcane texts of certain era's may have to install special plug-ins to be able quote from original arcane works. Why should the day-to-day system be burdened with ancient languages that one small group of specialists use?
Wrong. You're confusing the UCS-2, the encoding, with UCS, the character repertoire.
To spell it out, "Universal Character Set - Two Byte Encoding" (UCS-2) is one of many encodings which can represent the "Basic Multilingual Plane" (BMP) subset of the "Universal Character Set" (UCS) character repertoire.
But my grandest creation, as history will tell,
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.
Oh gawd, just listen to the feelings on entitlement in that messages...
You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.
You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.
There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.
I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.
In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.
but "helping the poor" is an imperialist structure, it just is. It's imperialist to think that these 'poor' people (more often than not impoverished because of imperialist actions, say, building a dam to flood their farmland, woah, sorry, I'll try to keep my own bias under control...) want to be 'rich' in a Western sense. Before you can help someone, you have to ask him or her what would be helpful - ya know? It's wrong to assume that someone would want to learn english and make american dollars, maybe they would just like to get their farmland back and live the way they had for the last 1,000 years.
There are a lot of misguided or uninformed comments about Unicode here.
Just FYI, the Unicode mailing list have already read and dismissed the claims of this document -- the document has a lot of factual errors. For instance, Unicode 3.1 supports 1,000,000+ characters, not ~90,000.
The first thing to remember is that The Unicode Standard itself is really just a list of characters associated with codepoints.
Not all the Unicode codepoints have been allocated yet -- some of them never will be. Space has been left with most alphabets to encode other characters discovered or new characters.
In raw form, these code points run from 0x0001 to 0x10FFFF, in 17 planes of ~65536 code points each. A few of these characters are reserved, such as 0xFFFF. That makes over 1,000,000 characters, which should be enough for anyone.(Although 640K may come to mind for some...)
All the UTF-* encoding forms are just ways of representing these characters. ALL of them support the full range of Unicode codepoints.
UTF-32 represents the codepoints exactly in 4 octets.
UTF-16 represents the codepoints in 2 octets for plane 0 (0x0001-0xFFFF), and in 4 octets for the remaining planes, using 'surrogate pairs'. Surrogate pairs are two UTF-16 codes, the first from the range D800-DBFF, the second from DC00-DFFF, encoding code points (UCP) in planes 1-16 as follows: UCP = (surr1-0xD800)*0x0400 + (surr2-0xDC00) + 0x10000.
UTF-8 is very clever. It manages to encode European codepoints in an average of 1.1 bytes. If the top bit of the octet is 0, it represents a standard ASCII character. Go read the Unicode website (http://www.unicode.org/) if you want to know more about it.
There's also UTF-7 for the truly insane. And UTF-8s as used by Oracle and so on...
The Unicode Standard is not perfect -- what is? -- but it is definitely the only standard out there that even comes close to approaching the goal of supporting all the world's characters.
Or perhaps you are being very subtly sarcastic? Or trolling?
The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here.
Unicode may have its problems, but this is not one of them.
-jfedor
it should be pretty obvious that unicode is for people who speak that language.
unicode's purpose is not to teach everyone how to read japanese or to make google's search engine read it (the latter could be done, though). it doesn't try to solve the problem of different languages on the planet.
as an english speaker, i find it pretty convenient that i can type and use english characters on my computer. as opposed to, say, japanese kanji - even though kanji is not all that hard to learn. what a concept... my own characters! wow!
not true. the purpose of Unicode is to have one unique number (unique... maybe that's where the name is coming from) for each character on the planet.
e.g. all of chinese, korean, english (doesn't make a big dent), arabian (whoops - dunno what it's really called), etc has to fit in there.
there are not unicodes for every language. there is only one for all of them. i think the reason was to make things more simple.
... they should have used 32 bits to begin with...
The 64,000 should suffice. Ideographic scripts, like Chinese are were the problem arises. The number of characters in Chinese is not fixed, unlike the number in most alphabets. I have a Chinese novella which was written in just 300 characters. 10,000 would be a good place to start, a few thousand more would cover all but specialized texts. Japanese could fold into Chinese, since there are only 2000 kanji characters and a few hundred kana. .) add a few hundred tops. Even the special linguist marks and punctuation don't add much.
Throw in Arabic, Cryllic, Sanskrit, Dravidian, Hangul (Korean) and Navaho and you still add only a few thousand. The odd European characters (the 'ss' in German, the extra Danish vowels, . .
If you have to double the Chinese, now you run into trouble. Its classical characters vs. simplified. The later is for the PRC. If you also bloat the number of characters required so that specialized religous characters are required, now you start to push the system. 64K would be fine if a special marker character could be used which signify's that the next character is from the special table. Unicode has resisted this effort.
So long and thanks for all the fish . . . !!!
But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.
I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.
Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.
I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.
And on the lighter side. IMHO it would be better to code the document with a tag to represent the symbol set and not worry about being able to stuff all of the worlds symbols in a single symbol set. Although the fight to be language '1' would be interesting. BTW, we have lost so many languages/symbol representations already.
make Linux, not Microsoft. sin(beast) = -0.809016994374947424102293417182819
Nordic Runes.
It would end with Tengwar, as mentioned above. A "made up" language that is truly phonetic, designed by a linguistic. You can put any language into it, and people who don't know the language can still "read", or at least pronounce, what is written.
Of course, actually using it for that purpose is as popular as writing esperanto in it.
The article did make the claim that Hangul was "designed ... to be able to describe any sound the human throat and mouth is capable of producing in speech".
Hangul is missing representations for a variety of trills, fricatives, pharangeal sounds, uvular and glottal sounds, clicks, and a variety of other sounds not present in Korean speech.
Perhaps not, but there is such a creature as "Extended ASCII". Many in fact. It would be wrong to call them standards (except possibly de-facto standards) but equally wrong to suggest they don't exist.
You seem to be suffering from a number of misconceptions. First, although Kanji may be entered phonetically, there is then a procedure by which the input method allows the user to specify which of the (generally numerous) Kanji which have that pronunciation was supposed to be entered. This can either be entered using the Roman alphabet or Hiragana. However, what appears on the screen when all is said and done isn't just phonetic, and with good reason; there is otherwise a lot of ambiguity, far more than results from homonomy and homophony in English.
It would be just as jarring to a native speaker of Japanese to try to read Japanese purely in Hiragana as it would be for a native English speaker to read something written in a phonetic alphabet. Furthermore, a lot of meaning is lost, since there are so many "homophones".
As for your assertion that it's only CJK which prevent a universal character set in 256 code points, you seem to be forgetting Cyrillic, Greek, Thai, Arabic, Hebrew, and any number of other languages which use completely different alphabets.
The UniHan controversy is best viewed as a philosophical difference betwen whether the various language variants are different characters (controversial) or just different glyphs (uncontested).
By way of analogy, there is a pretty good mapping between the greek alphabet and ours, but that doesn't mean it would be trivial for most English speakers to read English which had been transliterated into Greek. It's more than the difference between the letter "A" in Times New Roman vs Arial.
There are any number of extended ASCII character sets. ISO Latin-1, but there are many others; any of the ISO Latins, The DOS extended ascii set, the Mac OS character set. Hence "Many in fact"
I didn't mean to imply that the post was bad in any way. I thought it was pretty good, but it was obviously inspired by Twain, so I thought others should read the Master Troll.
I'm a leaf on the wind. Watch how I soar.
Go read the original story here, by Mark Twain.
I'm a leaf on the wind. Watch how I soar.
What this post (and the article) seem to not grasp is that the simple 2 byte encoding is not the only way to encode the unicode character set. It is not even a particularly good way of encoding the character set. ;)
Other encodings give you far more than 2^16 characters.
In fact, using the surrogate system you can get over a million characters into even the common 2-byte encoding. (I'm assuming this has not changed in the later specs, I've only got the Unicode 2.0 spec to hand.)
Unicode does seem to have been well thought out, and these sort of problems have been anticipated.
There does seem to be a huge amount of misinformation and misunderstanding in the article and this discussion. However, this is probably not helped by the Unicode standard not being freely available (as far as I can tell).
Gee, I thought it already exists - they call it APL
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent.
It's totally irrelevant for poor rural populations, true. But as more and more of the world's population moves towards being centered around urban areas this is indeed relevant. It is relevant to those who desire the full functionality of the Internet in their native character set. I believe (and this is a belief, not a fact) that one way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible. One way to do this is by making sure they can use the latest technology in their native tongue, lowering the slope of the learning curve.
Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language.
This is indeed tragic, but it quite simply cannot be helped. It's so common as to be a cliche: "Life Sucks", or "Shit Happens", or even "C'est l'vie." I hope that there are linguists and philologists who are archiving these languages for future generations and our general cultural awareness. BUT: People must eat, and they have a strong desire to make themselves and their families prosperous. If, when all things are considered, making sure that you live your life only speaking language X turns out to be counterproductive, then that language will become less important. There have been many languages that have come and gone throughout the millenia; humanity continues to advance. Would the world be a richer place if all those languages were still around? Certainly. But it would also be more confusing. And remember: If people can speak to each other, there is less of a chance they'll start killing each other. (LESS of a chance, mind you.)
I'm a Taoist at heart in matters such as this. For every yin, there is a yang, for every good, there is a bad. Life goes on.
- Rev.ASCII is, and always was, a 7 bit standard, which encoded 95 printable characters and 33 control codes. 'high-ascii' just does not exist, and never did.
Funny, it's always looked like more of a backwards 's' with a backslash to me, but hey, whatever floats your boat.
--
"It's tough to be bilingual when you get hit in the head."
Imperfect != "does not work"
Unicode is not a "16-bit character definition". Unicode is a "character coding system" for assigning code points to abstract characters. i'll hereby suggest that the author of this piece has confused Unicode itself with one of the encoding forms of Unicode, that is, ways that characters are expressed as bitstrings. please to shoot this down.
a "character coding system" (drawing on http://www.unicode.org/ and my copy of the standard 3.0 here) is a system for assigning characters to code points. Unicode 3.1 assigns some 94,000 odd characters, and the roadmap for allocations (start at http://www.unicode.org/pending/pending.html) will assign more in the future. these assignments are just that: an abstract character to an integer value in the Unicode repertoire. this assignment does not dictate how to represent the character as data in any way.
There are a variety of encoding forms of Unicode, each for ways of representing characters in the repertoire as data (not at all "on screen", that's glyphs, and that's a whole other issue). The different encoding schemes have different strengths and weaknesses. UTF-16 is a form that uses fixed-width 16-bit sequences as the base unit (though through a concept known as Surrogates, two such scalars adjacent to each other can represent a value normally not expressable with just 16-bits). UTF-8 is a different form that uses a variable number of 8-bit sequences to represent characters. There is a UTF-32 form, a UTF-EBCDIC form, believe it or don't. These are just encoding forms, they make no restrictions on what or how many characters get assigned. If the Unicode Consortium wanted to assign abstract characters to values that exceed the limits of current encoding forms, we could certainly do something about that, but it isn't the horrible catastrophe the author makes it out to be.
this is just the thing that leaps out at me. thoughts?
or you want to scan the string backwards
UTF-8 can indeed be scanned backwards. You could also locate the start of the current character given a random pointer into a byte buffer. RTFM. UTF-8 can also directly encode 2 billion characters. UTF-8 is the right general solution to data interchange, and this is why it's catching on.
ISO 10646 != Unicode, and UCS-1 != UTF-8.
UCS allows 31-bit character codes, Unicode however only allows up to 0x10FFFF, which is a little over 2^20. UCS characters may occupy up to six bytes, but according to this page, "All three encoding forms [UTF-8, UTF-16, UTF-32] need at most 4 bytes (or 32-bits) of data for each character."
BTW, it's important to recognise the difference between scalar values, which are the numerical values assigned to characters, and encodings (UTF-8 and so on), which are just ways of encoding those scalar values with different levels of memory efficiency, ease of parsing etc. Every encoding covers the same range of scalar values (ie. all of them).
(Unfortunately the official Unicode standard is only available in dead tree form, so it's kinda hard to give relevant links...)
I'd like to make a couple more points about this: suppose the characters have been encoded by arrogant know-nothing westerners, and it were the case that logically distinct characters have been unified, the real solution is to get onto the committee and have new characters assigned. If, as the article suggests, around 170,000 code points are genuinely needed, then fine - Unicode can handle that many.
If there are any characters not included in Unicode, it's because it doesn't need them, or it doesn't support them yet. As has been pointed out by many posters, characters are being added all the time. Just last month, version 3.1 of the standard was published which just about doubled the number of assigned characters over 3.0.
"Fishes" is also a valid plural form of "fish." "Fishes" refers to a group of different species, while the plural "fish" refers to a group that is all of the same species. The plecostomuses (sp?) and cichlids in my tank at home are fishes; the trout in a pond are fish.
(Your point that English has tons of rules and even more exceptions to those rules still stands, though.)
A bunch and a group are both singular, though some Brits would disagree (their usage used to treat a group as a plural object ("and the crowd are going wild!"), but that is starting to change in more recent usage).
20 January 2017: the End of an Error.
Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Unicode was never meant to hold one spot for every character. It was meant to be used as a set of code pages much like ascii was. But it had to be larger than 256 to hold a reasonably representative set of one language at one time (such as Japanese, or Chinese (two dialects), etc).
Most documents consist largely of one language, so you start the document by stating the code page you're using. Very few documents need more than one set of 65,536 characters, but you can intersperse sets if needed.
But the idea of having one universal character set is ludicrous. There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer. Sure, we could use 4 bytes per character, but is it really necessary? Absolutely not! Talk about inefficient. The only case where that would be more efficient than code pages is when the majority of documents extensively use more than 64k characters within each document.
Besides, translation software is coming along well enough that soon we will not have to worry about it too much.
-Adam
This sig 80% recycled bits, 20% post user.
I hate it when I browse a Star Trek web site and I can't read that Klingon.
What he meant was that Chinese characters are not an attempt at representing Chinese sounds.
Here is an example of such a page:
http://wolf.project-w.com/chess/pieces.html
Of course, to view this your browser will need to support Unicode encoding, and have the appropriate Unicode fonts.
I have also created a test page for various operating systems and browsers to view Unicode text: here.
My opinion on this debate? When loading this page, I didn't expect to see 75% of it being Americans saying, why doesn't everyone use English (!) A better solution, IMO, would be to pick a character encoding that can a) write all possible characters with a LOT of redundancy (who would ever need 2^31 IP addresses?), and b) not take up too much storage space for simple / common characters (I don't want to use 1K to write one sentence in a 4-byte charset).
Then, this encoding should be verified with all governments and, pending acceptance, made an ISO standard.
I guess I am misunderstanding you.
How do you indent to encode 170,000 possible characters in 2 bytes worth of data?
Are you suggesting that unicode use additional bytes, or is there already an "escape" code which allows multicharacter encoding.
Isn't multicharacter encoding what unicode was meant to eliminate?
# (/.);;
- : float -> float -> float =
nt
# (/.);;
- : float -> float -> float =
I worked in Japan for 4 years as a programmer, and I am somewhat fluent in Japanese. I also was project lead for a commercial Japanese computer dictionary.
Don't be fooled, this is not a technical article, but instead a political rant. I've talked to some of the designers of Unicode, and they tried to be very, very respectful of Asian wishes, But all of the nations of Asia refuse to cooperate on most things including character encodings.
While I worked in Japan officials claimed that Japan could not import beef because Japanese have evolved longer intestines and therefore can't properly digest red meat. This is laughably not true, but many Japanese still believe it. They also claimed that ski gear couldn't be imported because Japanase snow (hence the laws of physics) is fundamentally different in Japan. And of course they claim they couldn't possibly import another character set because their characters are unique.
One thing that was never mentioned in the article was the difference between a glyph (how a character looks) and what it means. So for example the letter "A" and "A" are the same characters but they have different glyphs.
What the unicode designers did was to identify all of the unique characters in all the mainstream languages. It turns out that Japan, Taiwan, Korea and China share a large number of characters. This should not be terribly surprising because all of these countries directly imported their characters from China. What does differ from country to country is how those characters are represented.
A lot of Asians, seem to really hate the idea of using the same code point for the same characters mainly, I think, because they don't really like the idea of sharing *anything* with the other countries. It is a political and cultural thing, not a technical thing. From a technical point of view it is just gross to think of assigning multiple codepoints for the exact same character.
Which is not to say that Unicode is perfect. Unicode, for better or worse, solves just the problem of encoding the unique characters, it has very little to say about the font problem. That however is still a wonderful thing to solve -- you no longer have to worry about losing meaning, at worst the characters might end up looking a little funny.
And Unicode works. In creating our Japanese dictionary we were forced to use Shift-JIS (one of the Japanese standards), and it was just horrible because there were so many Chinese characters outside the standard Shift-JIS encoding that we needed. Unicode would have greatly simplified the problem for us.
I worked as a programmer in Japan for 4 year, and I've also done several projects in Unicode.
There are couple of things I would like to point out:
>>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.
>>The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.
Have you ever tried to program in shift-JIS? It is horrific. Basically they mix one byte and two byte characters. The problem is that if you jump into the middle of the string there is no way to know if you are looking at a one byte character or the second byte of a two byte character. You also can't do tell the number of characters in a string simply by looking at the length. It is a *terrible* standard.
Actually his original nalogy was flawed and designed to yank people's chain... A better analogy of what's going on would be to say that the Germans and the French wanted to have their own Unicode code point for the letter "A", since obviously the German A is very different from the French A. Repeat for all the letters in the alphabet. The excuses for saying that the German "A" should have a different value than a French "A" is (a) The Germans and the French hate each other, and (b) French tend to use a sans-serif'ed font. When told by the standards committee that font issues were independent of Unicode assignment, the response was this was obviously anti-European imperialism....
That's basically what's going on here with the folks who are complaining about Han Unification. Many Asian languages are desended originally from Chinese, just as many European languages are descended from Latin and Germanic roots. So it's not surprising that the systems of orthography share a lot in common. The difference is that each Asian country refuses to share any codepoints with any other Asian country, because They Hate Each Other, and there seems to be some widespread belief that doing so would somehow be causing their national language to lose face.
As someone who's Chinese, I think I can safely say to those people who like to bitch and moan about Han Unification..... Grow up!
First UNICODE gives only advantage to the US and english speaking countries, because UTF-8 is compatible with ASCII (American Standard Code for Information Interchange).
Germany has the problem that UNICODE uses the same code points as ISO 8859 Latin 1, but UTF-8 encodes all characters above code point 127 different.
UNICODE 3.1 differentiates code points and encodings. There a range of encodings UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. This way it is possible to extend the code space to the 170.000 code points characters requested by the author of the article. At least another supplementary plane has to be used. It should be noted, that the policy of UNICODE forbids the assignment of the same character to several code points unless there exists an already a standard defining the characters. So it might be that less than 170.000 code points are necessary.
UNICODE made the mistake to assume that 16 Bit are enough to encode all the characters of the world. But they have corrected that mistake without breaking the current implementations introducing code point and encoding semantics.
I believe that UNICODE ist the best thing we have now and everybody criticising UNICODE should make proposals for improvements or another scheme. As far as I understand, the definition of a world wide unique character code is far from easy.
Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?
You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.
It actually works pretty well. Their keyboards pretty much look just like ours.
I don't mean to have the a-accent-grave in French be mapped to a plain English a. Certainly you should keep the accents. But do the g and r and c and e need to be different characters?
Mapping these things is a nightmare. Imagine someone writing say a BIOS, or something else with limited storage and code, who wants to display a startup message...do they need to store a separate Unicode message for each language just so they can say "Dell Computer" or whatever properly for each one? And then you have to store all those glyphs for multiple fonts, so you would probably wind up mapping them all back on top of each other. Gack.
- adam
UTF-8 is very nice because 7-bit characters encode as one byte. Also it is defined so there won't be a NULL or a hex 01B (decimal 27 -- the telnet escape character) anywhere in the data stream, even in the second or third byte of an encoded character. So it will generally be passed through correctly by programs expecting straight 8-bit ASCII. UTF-8 is also encoded and decoded via a trivial algorithm, as opposed to the DBCS used in Windows which needs lookup tables.
One negative of UTF-8 is that Unicode characters at 0x8000 or above (using more than 11 bits) encode in UTF-8 as 3 bytes, not 2 as in Unicode. I think that range includes things like Arabic and some Indian written languages. But I think that tradeoff is worth it.
- adam
Out of those 65,536 possible characters, 20,000 characters or so are reserved so that we can use pairs of words to double the set. In fact, the specification allows us to add to it almost ad infinitum, continuously adding more characters. Just two words covers over 100,000 characters, three will do the lot.
Okay, the writers of unicode may have been slightly short-sighted, but also they probably considered the problems of using a 32-bit character set and decided against it (and 24-bit for that matter). They have added an extending property to the 16-bit unicode standard and that should cope with much. I don't know how the chinese/japanese/korean population deal with their HUGE character sets now (you couldn't have a keyboard big enough) but they must have a shorter, simpler method of coping with everyday data input. Surely, this double and triple pairing of UTF-16 will do?
Here's the Chinese perspective. How do you write the number one?
Theres:
1) The western way "1"
2) The common Chinese character
3) The complex Chinese character used for legal documents, cheques (when not in English), etc
4) The Chinese character used in markets
Now, are they the same? Well, in legal documents etc, you MUST use (3) and none of the other characters.
(4) is ONLY ever used in the markets, however, in the markets (1) and (2) may also be used, but never (3).
OK, what about brand names or product names? There exists a magazine using (3), and I'm sure there'd be lots of crap flying if some reporter decided to talk about it using any of (1),(2),(4).
So then, are they the same, or not? The answer is NO. They all represent the number 1, sure, but there are concrete differences in where and how they may be used. Swapping them randomly would surely be unacceptable in certain circumstances.
---
There is one small problem with the Chinese written language (not the English converted language, but the original). 170,000 characters would not be enought to express the whole language. (I think it maybe over 1,000,000 characters/symbols, but nobody knows the exact count.) Granted it has been a long time since I have done anything with Chinese, but I do remember there is a lot of characters/symbols. When I last checked, it didn't have an alphabet. This means virtually every word has it's own character, or combination of characters.
If I remember corectly, Unicode had a few different ways to represent characters. It depended on the langunge you were talking about some characters were meant to be combined with other to form the actual characters. They used this to extend the characters.
It some sense it is not pratical to represent some nonalphabetic launguages with computers. The number of characters would sky rocket. The only way I could think of to represent the entire Chinese language would be to a symbol for each posible brush stroke and combine them to form characters/symbols.
There will always be some launguages that will not be practical to completely represent on a computer.
At the next eco-hypocrisy-meeting, count the private jets used to get to the meeting. Should be interesting to see that
"It's a damn poor mind that can think of only one way to spell a word!"
---
If by real you mean a natural language, then I don't have any great suggestions, although I know that French is much more regular and has a smaller vocabulary.
On the other hand Lojban is a real language. Like esperanto, Lojban is regular (the rules of the language have no exceptions), but it has only 6 vowels, 12 consonants, and 3 semi-letters. Other benefits are an unambiguous grammer based on principles of logic, culturally neutral, simple to learn, and it uses phonetic spelling.
Christopher
Mozilla
CloudWarrior
Lets see...you used "10,000" and "100". Those are idographic representation of "ten thousand" and "one hundred". There are hundreds of idographs in common US English yet someone wants to harp on a language that uses idographs for 95% of their written word?
The point is that any character encoding should have been robust enough to encode any language used at any point in the history of mankind(okay...encoding things like Ancient Latin might be more acedemic than anything).
In a study by Zhou Youguang published in "Zhongguo Yuwen Zongheng Tan" in 1992, he gives the following stats:
number frequency
100090.0 00%
240099.0 00%
380099.9 00%
520099.9 90%
660099.9 99%
There we go, that's the facts on frequency.
The other thing is that there are currently two different systems for writing Chinese on the net -- GB (guobiao) from mainland China and Big5 (dawu) from Taiwan.
Using the results of frequency studies, the GB format was made to only include a certain set -- 7237, I believe. This is what almost every Chinese from mainland China uses right now, and it's working pretty damn well.
Big5 has something like 1X,000 characters, and that seemes to work just fine as well.
If you ask me, the largest problem that folks can face is that they receive email with scrambled codes or don't have a Big5 converter or something, not that there aren't enough codes for folks to adequetly express themselves.
(I haven't read the article, but I believe that before they planned to use escape sequences to make these more un-used characters -- and I'll betcha 99.99999% of users will never run into them, and not need the fonts, etc.) /.)
(that class was boring as all hell -- can't believe it came in useful on
there is no thing
what else could you want?
Don't confuse the Latin alphabet with the English language! In Czech, the Latin alphabet (plus a few accents) is used phonetically.
--
--
Artix
Your Linux, your init.
Did ya consider getting her Hooked on Phonics? :)Or was the trouble of her reading throug your comic collection because like every other sane person, Marvel's NEW UNIVERSE, STUNK!
;)
Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)
Special letter forms don't need to be coded into unicode to be viewable. SVG, Postscript and other languages do a perfectly good level of presentation. So unless you can convince me that a Korean/Chinese person will be trying to do a word search through an historical Japanese/Taiwanese/Vietnamese document and will always inadvertently find the Korean ACK/Chinese SPOO when what he was really looking for was the Japanese FOOFLE/Taiwanese FLUM.
Personally I can't understand why anyone in the world would want to search in a character set of more than 60,000 characters. I'd personally be pissed off if the UNICODE committee started adding special letter forms for US product trademarks (so they would render correctly) when as a user I'd rather just have them be findable.
Really, the author needs to understand the use of the ALT tag.
LibBT: BitTorrent for C - small - fast - clean (Now Versio
Phonic based encoding simply lists letter (say, 6x8=48 bits for a typical word "letter") Why can't brush strokes be used to describe ideograms?
I assume (with our knowledge) that common ideograms are of limited complexity due to simplification over time and that seldom used ones tend to remain complex. With a stroke encoding scheme you should end-up with a unique string of bits not too much longer than a roman word.
True, this would play hobb with the functions in the stanard libs, but in real life you really are searching for words (multi byte symbols), not single chars.
-Scott
This article is technically illiterate. UCS-2, which he references heavily, basically doesn't exist any more and hasn't for a while. UTF-8 and UTF-16 are perfectly adequate encodings each of which can handle all the of the extended characters, up to a million or so in number (17 planes of 64k, to be precise).
He's correct that the ability to do computing in an Asian environment has lagged behind Western-language capabilities. However, as of Unicode 3.1 (in fact, as of Unicode 2), the support for what you need to do *business* computing has been pretty well there.
The job of collating and organizing all the tens of thousands of characters required to handle the classical texts is under way but will take a while to finish. Then there's the really hard problem of building quality fonts to support all these things.
But the title and premise are wrong. You can use Unicode on the net today just fine, lots of people are doing it, and anyone who builds a significant application today and *doesn't* build in support for international character handling is just out 'n' out stupid. It's not that hard.
Cheers, Tim Bray (tbray@textuality.com)Excuse me, but Unicode isn't suppose to describe fonts. As bad as Unicode is, every character in every language can be represented in way, way, less than 65,000 bits. Korean has around 50 characters. The Japanese use less than 1000 Kanji in practical use. You couldn't find a Chinese, or even an an American who has more than a 10,000 word vocabulary if you tried.
What's more, if accents, umlats, or whatever were used as separate characters, everything except chinese and japanese kanji could fit in 256 characters.
Most japanese get by using 2-character combinations, on a variation of a standard keyboard, which is a lot easier than trying to use a 10,000 letter typewriter. Every word in japanese can be represented with 50 hiragana.
the chinese character set can be at least partially blamed for the high level of illiteracy in china. It is ancient. It has lasted so long mainly as a tactic specifically to keep the general populace ignorant.
And we don't need to fit Mayan sculpture into unicode.
Anyone who speaks english and cant figure out what fishes are ise fit to be hung, in bunches.
Then how do they sing danny boy?
Tell that to Tommy Makem
Sorry to rain on your parade but English is not a particularly simple written language.
The western tradition uses 2 complementary (but distinct) alphabets - the Latin, Majescule or upper case alphabet and the hunnish, Miniscule or lower case one.
These 2 alphabets have a 100% redundancy between them, and about a 50% overlap and their mixed-usage is context dependant and purely conventional and dates from the rennaisance. Their usages prior to that were in substantially non-overlapping geographical areas (and/or time periods).
In addition to this the English tradition chucks in an ideogram set to represent numbers, except that unlike the latinate or hunnish alphabets, this ideogram set reads right to left like the Arabic from whence it was bodged.
So, let's recapitulate, 2 alphabets with 100% semantic redundancy and 50% overlap of form which read left to right, and an ideogram set that reads right to left. Simple? Or just what you are used to?
http://scottish.politicaldiscussion.org
We Bynari take issue with this.
With much grief and gnashing of teeth do we stoop to use this ill-conceived and bloated Latin based alphabet with 26 characters to respond to this bigoted viewpoint in a way that your feeble minds may understand.
Our alphabet has exactlytwo letters.
"Provided by the management for your protection."
So what if you've been using Unicode for ages, Unicode can't handle Chinese in a way that can simultaneously satisfy mainland and non-mainland Chinese.
#karma_whore
M$oft is what most people use. Doesn't make it right though.
- You seem to be expecting Unicode to do some magic semantic translation for you.
- Dropping the word "hiragana" into a posting doesn't make you an expert linguist.
- You've entirely ignored the real issue that this article raises.
--"There's something wrong with our bloody moderators today"
Admiral_Jellicoe@slashdot
The smallest alphabet is that of Solresol, with 7. The "letters" (or segmental phonemes, if you're being picky) in Solresol may be represented in several ways, not just written, and it's fundamental to the language that they're all identical. It's often called a "musical language", because of their ancestry from the Western chromatic scale, but they have equally valid written and spoken forms, even to the tone deaf. Solresol is interesting for several reasons, although I'd not claim that it has a particular significant future.
And of course, it's in Unicode too.
On the downside, it's just French with squeaky noises.
Hawaiian is probably the "naturally evolved" human language with the shortest alphabet.
Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.
The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.
The history of encodings is roughly:
1. There was chaos.
2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.
Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.
China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)
The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.
Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.
This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.
But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.
It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.
These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.
If you're going to troll, at least make an account for it. Posting at 0 doesn't do you much good.
Yes, you're right. MBCS strings can't easily be scanned backwards because it's a little tricky to figure out whether the preceding byte is the trailing half of a double-byte character, but that's not true of UTF-8, which guarantees that the leading byte of a character will never have its high bit set, while all other bytes will. So when scanning backwards through a string, you just back up your pointer until you find a byte with a cleared high bit, and that's the start of the preceding character.
No, because the aliens are all so technologically and socially advanced that they've standardized on Esperanto.
It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.
And UCS-2 is not the only way to encode Unicode. You mean Unicode is not the only way to encode UCS-2. UCS-2 is a character set, unicode is an encoding of this character set.
UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16. Whoa! UCS-2 is the character set. You can encode UCS-2 using either UTF-16 or UTF-8. Once again, Unicode is an *encoding* and UCS is the *character set*. Big difference and you seem to be reversing them.
being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters.
So, add a byte or two per document as a language ID...
Anybody feel like joining me at Milliways?
sig fault
*sigh*. No.
UTF-8 is an encoding format, which specifies a means of encoding Unicode characters using variable-length byte sequences. The number of bytes it uses to encode characters does not dictate how many characters Unicode supports.
Unicode, as I've stated elsewhere, supports a little over a million characters. There are ~50,000 characters in Plane 0, and 2^20 (~1 million) in Plane 1. Plane 1 is made up of surrogate pairs, which are two special characters next to one another (a high surrogate and a low surrogate). There are 1024 of each, leading to 2^20 Plane 1 characters.
ZFS: because love is never having to say fsck
No, the private use area is inappropriate for this sort of thing. Private use characters are (as the name implies) not intended to be visible to other applications; they are for encoding weird data within a single application.
There is a much larger block of public code points, which allows for over a million characters (none of which have been assigned yet, but the code points are there).
ZFS: because love is never having to say fsck
Not sure where you got Planes 1, 2, and 14 from. It's just Plane 0 (normal characters) and Plane 1 (surrogates). No characters whatsoever are assigned outside of Plane 0, although some are pending approval.
UTF-8, UTF-16 (UCS-2), and UCS-4 (not just UTF-16, as you say) all allow Plane 1 to be addressed, as would any other encoding which covered the surrogate codepoints (although none such exists, to my knowledge). UTF-8 allows this either through discrete encoding of two separate surrogate characters, which takes six bytes, or a special 4-byte encoding which encodes the Plane 1 character directly (rather than as two surrogates).
ZFS: because love is never having to say fsck
It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).
Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).
So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.
ZFS: because love is never having to say fsck
I'm really sad that real Westerner's attitude prevails right here in slashdot. I'm not surprised, even emacs rmail writers think MIME as a useless thing. that's the Westerner's attitude, so ignorant about I18N. probably most of you MERKINs don't know what I18N is in the first place.
/.ers (and most merkins) don't bother to find out what it is, so much for actually reading it. the standard is a hefty volume of dead trees (paper) and costs a hefty fit from your purse. before the standard itself is available on web FUD's like Mr. Caroll is spreading won't be stopped in english nations, most of all US of A.
i n-it)? I expected that much from /.ers. maybe I expected too much.
Mr. Caroll is just wrong in everything he claims. even the most classical and even a little bit absurd, not proven to exist and pure theoretical Hangul (Korean script) glyphs are included in the block starting in U+1100. I won't repeat at each FUD's he is spreading since the reply from Unicode above sums them up very well.
I'm a Korean. for us the ISO-10646 and the Unicode is the Right Thing(tm), not only a Good Thing(tm).
well some of the Japanese seem to hate Unicode. so be it. but let me tell you this. the very notion that a coding system must be defined in a lexicographical order is just OBSOLETE. that's why you have LC_COLLATE in POSIX locale.
and about ability to code fictional script in Unicode, you can use the 31-bit space in UCS-4, just set the MSB 1. you can do whatever crazy thing in that space. that's why ISO-10646 is a 31-bit representation, not 32-bit.
the REAL PROBLEM in Unicode is that the standard itself is unavailable in web, so most
but wouldn't you must RTFM before you throw flame on everything that makes you feel shit, mostly because you have to pay (storage space) for things you don't want (languages other than english the language so-much-perfect-for-everything-even-jesus-speaks-
ignorance IS the human anyway.
Check this link to see why unicode characters won't work on the internet:
http://dábliü.ämêricõ.îñamè.com/índiçý.html
--
This space left intentionally blank.
Hm. log(170000)/log(2) = 17.4, so at least 18 bits is needed, as I cursorily understand this, to encode present human languages. Clearly a 3-byte unicode standard is needed. Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).
To-do List: Receive telemarketing call during a tornado warning. Check.
I wonder how difficult it would be to make the next version of unicode be 24 bit? It would break all existing implementations of course, but since unicode doesn't solve the problem it was designed to solve, continued existance in its present form is certainly not beneficial...
Maybe it should be 32 bit just to make sure...
--
If you had super powers, would you use them for good, or for awesome?
This is actually more-or-less true... and within my experience the only language with easier rules of pronunciation is Spanish as spoken in Mexico and Central America.
It helps that Cyrillic denotes ten vowel sounds and a pseudo-vowel with ten characters (accounting for almost a third of the characters used).
One has to learn how stress falls in a word to know how to pronounce it properly, but there's a certain rhythm to that.
Where Russian kills is not with the pronunciation (once you've gotten used to it - Russian has a lot of sounds that English speakers are never taught to make) but with the grammar. It's not that there are a lot of exceptions, but rather that there are six cases (where German has three and Latin seven). There's also a lot of ambiguity where verbs are concerned, almost as bad as the ambiguities in English verb usage. The lack of articles (a/an/the) in Russian takes some getting used to, but inflection generally helps there.
My biggest gripe about Russian, though, has to do with prepositions. More on that if anybody asks.
...When in doubt, think for yourself.
wont, dont, ill (as in sick), or I'll :)
Just a comment on the "Straight Dope". I don't know if the info on the Chinese Typewriter was valid several years ago, but I know it's no longer true. Both Chinese and Japanese keyboards have multiple glyphs on each key, because both Chinese and Japanese have phonetics syllabaries (alphabets). In Chinese, it's Pinyin, in Japanese it's Katakana or Hiragana (same sounds, slightly different drawings). Either way, you input complex characters using multiple keystrokes, but not in english characters.
So, for instance, "Flower" in Japanese is pronounced "hana". That is two characters, "ha" and "na". If you type ha-na I think you will see a menu pop up with possible Kanji (pictographs) and you can choose from them. I have only used a chinese keyboard but I assume it's very similar.
Derek
Don't Panic...
Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.
In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.
There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.
The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.
Plenty of room.
Cheers,
Toby Haynes
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
No, I propose that folks use readers that interpret Unicode appropriately for the language they wish to use, whether that's a browser, an email proggy, or whatever. The articles point was that Unicode won't work because it can't display every single character that anyone ever came up with. My point was that it doesn't have to. My inbox is cluttered with all sorts of spam that says something like &*^%*&%UYGVKNB&^$*^%#^%$FCJUY%^$&^%U^TRU&^%#$^$@#^ %$&YT. Why should a non-english speaker expect otherwise?
Under capitalism man exploits man. Under communism it's the other way around.
In other words, Unicode doesn't need to account for every single character in the world!
But of course, this was posted on the internet, so it MUST be true...
Under capitalism man exploits man. Under communism it's the other way around.
The author allows his enthusiasm to carry him away more than once. For example,
Yes, Hangul is a remarkable invention, but try asking a Korean to say "Flushing" some time. Who cares? What does that have to do with Unicode, which has absolutely nothing to do with the physical representation of the glyph? "PersaCom"? I've never seen or heard it rendered that way, and I really doubt that "persacom" is technically considered pronouncable Japanese. (I've always seen it rendered "pasokon".) And it still has nothing to do with Unicode. And printer manufacturers really made 8-pin printers so they could print hiragana and katakana, and they invented modes so they could print more complex characters but they sold them to Americans as "graphics modes", and, and, and...a whole flood of undocumented irrelevance.Actually UTF-16 can't represent the same range as UTF-8 or UTF-32, it's a bit weird. UTF-16 uses surrogate characters to represent the 16 UCS-4 planes 0x00010000 through 0x0010FFFF as a pair of 16-bit words.
As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.
One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.
It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)
Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.
Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)
With this in mind, one can argue against the points raised in the article:
A little bit of research by the article author would have made the article unnecessary.
References:
Unicode 3.1 document;
CJKV Information Processing, Ken Lunde.
PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.
Far more "technical people in Japan" are in favor of Unicode than are opposed to it, and the percentage opposed appears to shrink every month as the feared "dangers" somehow don't materialize but the benefits do.
Re: your little list...
-- The conversion tables differ only very slightly, and *almost* everyone uses the tables at the Unicode.org site, either directly or by calling converters in the OS. Still, there are potential tiny differences, as you see in all cases of matching massive character sets across borders, though in the case of Unicode the problem is much smaller.
-- CJK Unification has the disadvantage that you can't be certain of picking a font that is guaranteed to be acceptable based on the code point alone. In practice, this rarely turns out to be much of a problem, but it can happen. On the other hand, there are some nice benefits of the unification that more than make up for that one problem. The problem you cite isn't a problem. The character distinctions that the Japanese want to make have been made by the Japanese in the JIS X character sets. Those distinctions were then directly ported over to Unicode.
-- Certainly you're referring to UTF-16 surrogates, but calling it "Unicode" to make the problem sound larger. In fact, UTF-8 is "Unicode", too. It's the greatest method for text data exchange ever created, and it has no 64K issue, no endianness issue, is self-synchronizing (if you miss a byte in a stream, only one character [code point] is lost), and many other nice features. The greatest of all is, of course, that it can encode virtually every language in the world in the same encoding.
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
ISO 10646 itself is now restricted to the range described by UTF-32. ISO has agreed to close down the state space and to never define any code points that can't be reached by UTF-16 surrogates, which is where the UTF-32 boundary came from. UCS-4 is now obsolete, even at ISO.
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
English phonetics is only "ridiculous" when it's learned with one eye on the written form. If English were written using, say, the IPA, the consistencies would be much more apparent; as it is, however, they're obscured by the massive kludge of historical accretions that passes for modern written English. How many vowels does English have, for example? Five? It's actually closer to twenty.
And there are obscure rules such as '"their" may be used in place of "his/her"
Not in my classes. Any student of mine who tries to use third person plural as a substitute for "he" or "she" once doesn't make the mistake a second time.
the difference between "lend" and "borrow"
This is no more difficult than the difference between "bring" and "take" or "come" and "go", and strikes me as inexcusably ignorant. If it's not a problem for my seven-year-olds, it shouldn't be a problem for the educational elite.
What about UCS?
Support for up to 31bits per character and backwardly compatible with UTF-8
0x00000000 - 0x0000007F: 0xxxxxxx
0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
This still leaves us with 0xFF and 0xFE as escape characters!
Looks like you've just stuffed yourself...
The point of the transformations hard c -> k and soft c -> s is to free the letter 'c' to stand uniquely for the 'tsh' sound of church and Medici.
Will I retire or break 10K?
rough translation of this Quenya Elvish phrase which is a derivative of the Tengwar elven language
Script != language. The word tengwar is Quenya for "letters." Calling the tengwar script a language is like calling the cyrillic script (used for Russian), the katakana and hiragana scripts (used for Japanese), or the latin-1 script (used for many Western European languages) a language.
Find Tolkien's tengwar and more in the conscript registry, which uses the 'private use' area of the Unicode space for scripts invented in modern times (all scripts are invented at some time or other). And there are "surrogate" codes in Unicode UTF-16 for a million additional code positions.
Will I retire or break 10K?
Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.
If you like hangul, you'll probably also like J.R.R. Tolkien's tengwar. Regular changes to the shapes of the consonants denote stop/fric/nasal and voiced/less. The structure of the script is such that unused letters (after t series, p series, and k series) can be used to represent sounds unique to a given language. It's available in both vowel-pointed (like devanagari and biblical hebrew) and vowel-letter (like greek/latin/cyrillic) modes.
I'm not 100% sure about the legal status of a post-1923 script. Can a script be copyrighted or trademarked? Probably not. (Patents don't apply; it's been more than 20 years since the entire system was disclosed in RotK.)
Will I retire or break 10K?
The letter thorn looks like (U+00DE; Alt+0222; capital) or (U+00FE; Alt+0254; lowercase).
Thus, "Ye olde ..." is a kind of a typo; the first letter wasn't Y, but was close enough visually that it started at some point to be thought to be Y...
Except DisneyCo (famous for buying bad legislation) actually does the opposite: using instead of y in the corporate logo.
Will I retire or break 10K?
if someone tried to remove redundancies from the English language such as pork and ham, or argue and dispute
C. K. Ogden once did just this, reducing the English vocabulary to a set of 850 basic English words, but the result has (some foreigners claim too many) idiosyncratic idioms and turns of phrase.
Will I retire or break 10K?
Likewise, in Unicode, English, German, and Finnish all share the same codepoints and glyphs, so you can't grep for one language or another without using META headers or something similar.
For instance, if you were searching in English for "gift", this string in Unicode would be the same as the German characters for "poison" (Gift), so your search would get hits from other latin-based languages in addition to English.
It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is a bit different for the different European languages, even though they use the same characters.
Translation: Language collision can be avoided by exact phrase matching ("perpetual copyright" wouldn't return many matches for non-English documents) and specifying the natural language of a document either in the document or in the headers.
Will I retire or break 10K?
So then, are they the same, or not? The answer is NO.
Unicode would distinguish among these four forms because they are distinct characters, but it would not distinguish among similar forms of the SAME character. Unicode does not distinguish sans-serif from roman from italic from fraktur from monospace; that's the job of the stylesheet.
To answer another common objection: When two characters look the same but are not the same, they are assigned separate codespaces. For instance, Latin capital letter A, Greek capital letter Alpha, and the Cyrillic equivalent look exactly the same. Chinese 'yi' (one) is the same character with the same origin as Japanese 'ichi' (one), but it is not the same character as hyphen is not the same character as em-dash.
Will I retire or break 10K?
As long as C programs have to be written in ASCII
Supporting UTF-8 variable names as an extension to C and to C++ would not break any standard because, by definition of UTF-8, any valid ASCII string equals its UTF-8 representation.
english will be the standard
Programming languages use English as the standard for keywords because more programming language designers can speak English than any other language.
Limit use of 'to be' verbs to add power to your English.
Will I retire or break 10K?
Perhaps not, but there is such a creature as "Extended ASCII".
Say not "extended ASCII" or "high ASCII" but "ISO-8859-1" or "ISO Latin-1." Latin-1 happens to use the same characters at codepoints 00 to 7f as ASCII, but that of itself does not make it ASCII. Unicode uses the same characters at codepoints U+0000 to U+00FF as Latin-1, but...
Will I retire or break 10K?
I always wanted to have greek letters AND hebrew letters AND a smiley face in my email address.
--
Je t'aime Stéphanie
Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English.
Sounds suspiciously like perl. How many times have I explained about scalar and list context.
If you choose this view, then yes, most european languages have much more logical spelling than english. One exception might be french, which is not at all written as it is spoken, although there is admittedly a system to it.
Accent marks and diacriticals doesn't make the writing system more difficult, it simply makes it possible to write more phonetically. I would prefer the writing systems of german, danish, norwegian or swedish any day before english. I don't know any east-european languages, but I would be very surprised if most of the accents and diacriticals weren't there for a good reason, and I doubt they can be much worse than english.
From what little I know of russian, it has a very simple writing system that is even clearer and simpler than e.g. german, danish or norwegian.
On the other hand, if someone makes a truly simplified and logical spelling of the english language popular, e.g: "I thought my bones were breaking during the fight" -> "Ai thokt mai bowns wer breiking diuring the fait", it could eventually become as simple as most other european languages (or those written with the cyrillic character set).
Of course, most languages has some kind of idiosyncrasies when it comes to spelling, but english is certainly not among the easiest. And the few added letters in some european languages is laughable. German adds a few umlauts and ß, danish adds æ and ø, norwegian adds æ, ø and å, swedish adds å, ä and ö, and so on... No big deal! Besides, none of the above mentioned languages makes any use of x or z except in foreign words. Scandinavian languages never use w except in foreign words. The same is true for c in norwegian. So the letter count is mostly similar, as is true for cyrillic.
Old English, by the way, did have more letters than are found from modern english ("thorn" letter for "th", and couple of others). Thus, "Ye olde ..." is a kind of a typo; the first letter wasn't Y, but was close enough visually that it started at some point to be thought to be Y...
And Old English was, alas, easier to pronunce than modern english. Thanks a bunch, latin-loving grammaricians, who bastardized spelling of words like "island", "herb" and n+1 others (idea was to emphasize the origin of loan words, independent of whether spelling was consistent with pronunciation). Syntax and grammar were more complex, though (with germanic inflictions... of which 'bewitched' and 'awaken' are remnants)
On an unrelated note, letters 'j' and 'u' were not part of european languages (that's why romans had funny habit of using 'v' everywhere...) before being invented few centuries ago (ie. "i" was used for both "i" and "j", "u" for "u" and "v").
Oh and finally; it probably was a coincidence in sense that if computer science had bloomed in some other country (say, Germany), it would most likely have contained the local character additions (which in general in west Europe isn't all that many really... some languages do use diacritics more heavily, many do not)
I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
I'd recommend David Crystal's "Cambridge Encyclopedia of English Language" (or whatever title was, I don't have the book at hand right now). I'm not a native speaker, and found it very interesting reading (and it's rather complete in explaining history of english language).
In nutshell; english is a germanic language, derived from 'old german'; oldest non-germanic influences from celtic languages (but very little) and roman. More influence (loan words mainly) from vikings (Norse is a germanic language, so not much grammatical changes). Major changes thanks to french conquerors; tons of loan words (many originally from Latin), messed up spelling. Both grammar and spelling further complicated by scholars who loved Latin so much they changed lots of rules... just because they thought Latin grammar "was perfect" and a model for all civilized languages.
Of course, english has word loans from dozens of languages (surprisingly many from, say, portuguese and dutch... even one from finnish).
I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
"sauna"... what a surprise! :-)
(AFAIR, the source was one of the big encyclopedias)
I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
In the WWW, doesn't the HTTP header contain character set information, so the client knows which of the many character sets/languages to use? Then, only the size of that one character set is important (which will always be FAR less than 64K).
Here is your first lesson if Slashdot goes global.
French = Première Distribution
German = Erster Pfosten
Italian = Primo Alberino
Portugese = Primeiro Borne
Spanish = Primer Poste
Unfortunately these would all be rejected by the Slashdot editors.
...and isn't explaining B.C.E. as "before the Christian era" defeating the object? The reason I use BCE (before the common era) and CE (common era) instead of BC and AD is to remove the references to religious myth.
--
mrBlond
CowboyNeal for president!
"Hit any user to continue."
Interestingly (because Islam uses a lunar calendar, and Christianity a solar) Mohammed will one day be "older" than Jesus :)
--
mrBlond
CowboyNeal for president!
"Hit any user to continue."
What you are talking about is UTF-16. Unicode can support up to 2^31 character codes, but they are not all reserved yet.
most of the cruft you mention is as irrelevant as calligraphy. upper vs lower case could be called a stylistic difference.
thats why the original implementations of english-machines were all caps. and this reply works fine in all lower.
and sorry, the numbers are written and read left to right. 123 is "one hundred twenty three" not "three hundred twenty one"
now, if you mention grammer, spelling, etc, youd have a point.
One upside of it is that that is almost no cost for english/ascii, which will remain 1 byte per character. You dont even have to recompile most apps to support it- only those that format character glyphs.
Almost every other european language I have seen uses some set of accent marks or diacriticals. And having studied japanese and vietnamese, they have orders of magnitude more complexity. Even esperanto has a larger alphabet than english.
Is it just a coincidence that the simplest writing system was the first to be digitized? Too bad pronunciation of english isnt equally simply.
quote: All possible 2^31 UCS codes can be encoded.
The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.
The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...
The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.
The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.
On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand (
Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.
So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.
The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.
This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)
In the biginning there was ASCII. It's a 7 bit code which means you only have room for the common 127 english characters. This didn't do any good for forigners so they made up language specific code pages like Cp437 or the MS Windows Latin-1 encoding Cp1252 that just redefined what codes corresponded to what characters. This was a little ugly because you could not easily use characters from different languages together. So then someone come up with ISO-8859 which was backwards compatible with ASCII, meaning all the lower codes were ASCII. So this was a step in the right direction but the extra 127 characters gained from that extra bit didn't give you much; you still needed language specific versions like ISO-8859-1 is Latin-1 for the US codes, ISO-8859-2 is for Europe, etc. You see, the barrier here is the dependance on fitting character data into an 8 bit byte. Anything more and you really screw up existing kernels, libraries, and programs that depend on a character bing one byte like terminal drivers, strlen, and your ini file parser ...etc. Finally, both ISO and the Unicode consortium, at first independantly, decided to come up with a universal character set. Both standards resulted in what amounts to a set of tables that defined exactly the same codes for all the characters in every language. At first I think they thought they could get away with a 2 byte code. This was called UCS-2, which is the route Microsoft is going and I belive Java as well. Now this expanded the number of possible charaters considerably, however this still didn't solve the existing dependancy on 8 bit character strings. For that they came up with UTF-8. The clever trick here is that they cannobalize the last bit to indicate that another byte gets tacked on. That gives you two bytes to play with. If the first three bits are on in the first byte then there are three bytes to store your large UCS code corresponding to some exotic character. But this still wasn't enough. The characters started push the envelope of two bytes and so they upgraded to UCS-4 which now has 4 bytes and will hold all the characters of every language including the languages of yet-to-be-discovered alien civilizations. But now you have sofware, like from MS, that favors the somewhat more effiecient and practical two byte UCS-2 codeset so you need to extend the UTF-8 concept to give you UTF-16. Well, that's about where we stand and there's a lot I left out.
Interesting?
Read this: http://www.cl.cam.ac.uk/~mgk25/unicode.html
One way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible.
One way to do this is by making sure they possess the knowledge and skills of the modern economy. One of those skills is the dominant language. If you want to be rich, you learn to act, speak, think, like the rich people. Preserving the "native language and culture" is the province of romantic idealists.
Don't go calling me a cultural imperialist, now. I actually read, speak, and write three languages, and could easily add a couple of more. I love the differentness of distant cultures. I am a "romantic pragmatist." I would love to see this differentness preserved, but I recognize that its passage is inevitable. The fact is that all these languages and cultures sprang up because the world was so vast. Groups of people were totally isolated from each other.
The "western" world isn't that large anymore -- it's actually smaller than it's ever been. When Alexander the Great ruled the world, it was months from one end to the other. Now, the western world is maybe a day from one end to the other.
The natural circumstances under which those languages arose simply do not exist any longer. They are fish out of water, and they must naturally pass on -- it's just the way of things.
There may be room for many different languages when the human race colonizes the solar system, but I suspect that even then, the communications delays will be low enough that a single culture will be maintained, more or less.
I mean, imagine how much pooerer you would be if you had been unable to read the epic poems of early Anglo-Saxon culture in their original form! Or the early Judaic and Greek writings on which much of our more recent culture is based.
You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?
Now imagine how hard this would be if your computer didn't support the full character set in which they were written.
The simple fact is these guys are totally ignorant. They confuse a particular 16-bit implementation of the Unicode "basic plane" with Unicode itself. If they'd done any research at all, they'd know that there are 16 planes, with support for about 1 million characters. Plus some there are "private spaces" so people can create their own extensions of Unicode. There's already the ConScript registry (which supports Shavian and Klingon).
I'm reminded of people who thought computers would never catch on because keypunches were too bulky.
Another ignorant assertion: that 1.5 billion people "speak" Mandarin. Mandarin is the standard dialect of Chinese, but only about 800 million people actually speak it.
__
A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.
Even if a language does die, there is often still a need to work with it. For example, I specialize professionally in various pre-modern languages, some of which have not survived to the present. I still need a way to encode these languages as I use computers to produce dictionaries and online corpora.
If you substitute the other presentation forms of the ampersand here, the sentence still means exactly the same thing (i.e., there are no situations where the sentence is TRUE with one form of the ampersand, and FALSE with another).
Contrast "I gave him a snack". Substituting "m" for "n" does change the meaning ("I gave him a smack"). There could be cases where I gave someone a snack without giving him a smack, and vice versa.
There's probably variation among Germans, but at least some Germans do write a "1" as an upside-down V.
There is in fact a group working on Unicode encodings for the Egyptian heiroglyhic character set. The codes will go in the "surrogate characters" range of Unicode. Regular Unicode uses the codes between 0 thru 2^16-1; the surrogate range runs from 2^16 thru 2^32-1, and has been designated by the Unicode Consortium for exactly this kind of case, i.e. large, rarely used characters sets.
I don't see a need for special software to display runes. It's just a matter of having a font architecture which allows you to create and install a font for an arbitrary subrange of the Unicode space.
This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.
For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.
That was really hillarious. Loudest laugh I ever remember having.
It was particularly clever in that it transformed english into the correct language: early 60's sitcom version of German.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!
Personally, I think Unicode is not enough. What if I want to digitize the whole collection of the Beijing library, which contains millions of texts from thousands of years? How am I going to represent all the characters with Unicode?
You may think that chinese orthography has a tradition of simplication and variants, but this only applies to modern use of certain characters. These simplications and variants can't replace the characters in ancient texts, or they will totally alter the meaning of the texts.
I think Unicode is developed by a for-profit corporation, which tends to oversimplify without doing thorough resarch into a specific culture before trying to encode the language.
a) This is (adapted?) from Mark Twain, it's in
...but it keeps reminding me of 'Allo Allo?'
most fortunes.
b) No matter how funny it looks, if you read
it aloud its prefectly understandable...
c)
--
GCP
The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".
In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.
There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.
In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.
By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!
Remember, You are unique...just like everyone else.
I think, frankly, that this report is rubbish. The purpose of Unicode is NOT to provide a full listing of all possible glyphs; it is to provide a list of characters. The author of the report appears to me to have made a reasonably common mistake when reading through the Unicode spec; he sees one of the Unified Han characters, says "Ha! That looks nothing like the character in !" and assumes that Unicode is some Western pigheaded colonialist rubbish.
:-( see here
For a more complete discussion, which summarises more accurately the way to use the Unified Han character section of the Unicode specification, trot off to here. Particularly read the section on "why were the characters unified". Unicode isn't perfect, but the Unified Han system is a good attempt to minimize bloat in the character tables.
p.s. Those dudes from the Klingon Language Institute have been trying to get themselves a spot in the Unicode tables for ages and have recently had their application rejected
--- Nick, hard at work
Let's just all learn to speak binary. We can seak in long and short dashes built from words gathered from all languages. Morse code can become popular again. -1010011 1001100 (SL)
My $0.02 will always be worth more than your â0.02, so
Consider English literature and ASCII. If you look at a reproduction of Beowulf in the original Old English, you find lots of characters that aren't present in ASCII. That doesn't mean ASCII is worthless, and it doesn't mean anyone had to accept restricted access to literature. It just means there was room for improvement because ASCII wasn't suitable for all purposes.
The Unicode designers got bogged down trying to create an encoding suitable for every possible purpose. If the goals had been more modest, say to allow Chinese language URLs, there would have been faster ways to go about it.
>>>>> Yeah, let's get down to the lowest common denominator and make laws that require all internet content to be at least in X number of languages. A bit like the silly EU regulation because of which every fucking document, web site and audio recording concerning the Union must be available in at least. You have to be kidding right? Laws that goverent the internet? isant that dumber then multipul languages?
>>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.
While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.
Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?
Have you ever tried to program in shift-JIS? It is horrific.
I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.
On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.
--
BACKNEXTFINISHCANCEL
French isn't simpler than english. It has many exceptions to just about every rule and its vocabulary is as large as the english language. It's much harder to learn than english.
I speak both, and anyone who does can agree with me.
JP
--- Worst tagline ever.
"English is a Germanic language, with the influences of: Latin, French and Scandinavian dialects. French is a Romantic language with far less influences."
It is harder to learn than english only because of the exceptions. And it's the language of poetry. And has even more vocabulary than english (tons of beautiful synonyms...)
Alors VIVE LE QUÉBEC!!!
Ancient Latin only used upper case. And one bit of punctuation.
F IRST.SENTANCE.END.IS.THIS.A.THIRD- ------------
I.SAY.BRING.BACK.THE.GOOD.OLD.DAYS.WHERE.DID.THE.
--------------
'No rational religion claims "supernatural" exists, that's an atheist slander.' - seen on slashdot.
Well since the language used for Irish uses just 18 of the 26 letters of the english alphabet and gets by just fine I assume it wouldn't matter that much. For the record the missing letters are j, k, q, v, w, x, y and z
Slashdot: Proof that a million monkeys at a million typewriters can create a masterpiece
Actually I think there is some discrimination, but it is not motivated from hate but rather ignorance. The computer industry is still strongest in the US, and most OS software is still written by US-based companies. Why don't some Chinese software developers come up with their own language standard and write a bunch of software with it? Then the Western software industry will be forced to deal with the situation.
I find myself somewhat frustrated with the viewpoint that western programmers "discriminate" against other cultures because the culture has too many characters, where "discriminate" implies a political, social, or personal conflict. The problem, frankly, seems more technical in nature to me.
Many operating systems have a design that uses a smaller character set, if for no other reason than to help conserve space. Take your average file system; the character set doesn't permit Unicode characters in most cases, and even the C++ STL doesn't have a spec for streaming files with wchar_t names.
Then consider that you have several evolving programs that have to be modified to use a different character set. From experience, I can tell you that, particularly for complex programs, this is not a trivial job.
Finally, imagine that a political body imposes a deadline on imported programs.. that they must support their new standard by such-and-so a date or it won't be permitted within the country. The Chinese did this, extending the deadline to Sept. 2001. I only found out about this yesterday.
It doesn't make a job easier.
And so it goes.
You have obviously never worked a help desk.
same thing goes for dealing with contractions, a la dont, wont, ill, and so on.
I'm sorry, are you ill?
"And like that
Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.
:-)
Ill is not a homonym for I'll. You're talking about how things sound, and the discussion centers on how English LOOKS.
"And like that
While it's a noble and practical goal to eventually allow every language to be rendered as part of a website to allow for maximum access, I don't think that this limitation will really be much of a problem in the long run for two reasons.
Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent. And in many of these places, political and economic instability caused by civil wars, corporate greed and a lack of resources will mean this situation will continue for some time.
Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language. Cultural assimilation has proceeded at a brisk pace, with western countries only too willing to help with the "modernisation" of other cultures, which invariably results in a loss of their original heritage and linguistic uniqueness. And already globalisation is turning English into the de facto second language of the world.
By the time the 65K limit would become a problem, I estimate it won't be a problem any more - there will be far fewer languages around, and only a subset of those will require online access. If all else fails, many of the majority remaining will speak English anyway.
In using OSX from the start, it has always had problems mapping characters. Even the "normal" weird ASCII characters get mapped strangely. Upside-down question mark is one. I could deal with it changing, but what frustrates me is that it doesn't change back whenever you go back to other UNIX systems. For instance, downloading a text file with weird ASCII characters with OSX's scp will make things go awry. But then transferring that file back up does not switch it back. Weird stuff!
Check out Althea for a stable IMAP email client for X. Now with SSL!
I'm stating my impression about what they do accept (that Chinese users and standards bodies are far less troubled about Unicode than is the author) and speculating on why that might be (that out-of-the-box support to edit ancient texts in Word is more important to a scholar than to the vast majority of users).
Unsettling MOTD at my ISP.
However, that's not to say that Mac OS X is truly uncrashable. (Yet.) We appear to be somewhat lucky on the stability end, whereas some other hapless customers are not. For instance, take Tony Smith over at The Register; the poor man nearly reached his wit's end trying to keep his Mac OS X-loaded blue and white G3 from taking frequent and unplanned trips to Crashville. (Spookily enough, Tony's crashes left him with "nothing but a blank, mid-blue screen"-- is Apple hard at work reverse-engineering Microsoft's Blue Screen of Death?) After multiple reinstalls, he eventually figured out what was causing his grief: an aftermarket PCI ATI Radeon graphics card, which he determined was not supported. Replacing it with his original OEM Rage 128 card left his system solid as a rock. Or so he thought.
Once he got around to reinstalling his third-party fonts, his crashes came back. And so, by adding one font at a time, he was eventually able to isolate the real cause of all his woes: "a single Star Trek symbol font... OS X doesn't like it one little bit." So while Mac OS X is able to use his zippy Radeon card after all, Tony will sadly have to boot back into Mac OS 9 whenever he wants to stick the Starfleet Insignia into one of his party invitations. Now that's a problem that Apple's really going to have to fix before Mac OS X will ever catch on as a mainstream operating system.
In fairness to OS X, I think it was actually application crashes -- the font wasn't bringing the system down.
Unsettling MOTD at my ISP.
Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
and I decided to skim the rest.
To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.
Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.
Unsettling MOTD at my ISP.
Bush bolts GOP to join Democrats, fires entire Whitehouse staff
Linus Torvalds to join Microsoft as OfficeXP advocate
NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"
Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts
RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"
Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets
Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway."
-- .sig are belong to us!
All your
A feeling of having made the same mistake before: Deja Foobar
-- .sig are belong to us!
All your
A feeling of having made the same mistake before: Deja Foobar
It's not new, and alas not surprising.
When they did ASCII, it was a standard by the US, for the US, the mess it created in the high-ascii range (128-256) is still not resolved and I'm talking diacritical characters like those used in western european languages (French, German, Spanish etc...) nothing fancy or very exotic. Problem was, of course the europeans were not implied in the process.
Now they do a universal standard that should correct all problems and surprise, they don't actually bother to check with the implied persons. Even if they did, it would make sense to have provisions for a few unknown character sets (like ancient civilisations or the myriad of small groups of people living in lost parts of the world).
Anyway, if computer history has told us something, is that a 16bit range is never sufficient for practical uses. Well, just another sad example of one size does not fit all... But I suppose the slashdot response will be - why the hell don't they all speak/write english...
I disagree. I don't see Unicode (or its alternatives) as a way to resolve language barriers. Rather, it defines a framework within which all programmers can use the same libraries and programming languages to develop applications using their own language.
To use a gardening analogy: it doesn't make us all plant the same things or help us understand the meaning of what someone else has planted; it just lets us all use the same tools for working in our gardens.
What do you mean they cut the power? How can they cut the power, man? They're animals!
The guy obviously has an anti-western mindset.
But to simplify, the crux of his argument seems to be that in order to read ancient works from the Chinese/Japanese/etc, they need about 40,000 to 50,000 characters each.
But in reality, the average Japanese person would use less than 10,000 characters. In fact, probably much less.
Besides -- it is mostly a moot point until you can show me a keyboard capable of entering 50,000 unique symbols efficiently.
His solution seems to be allocating 32-bits of storage per character, rather than the 16-bit Unicode standard we have now.
For the forseeable future, it would seem that Latin-esque alphabets have the upper hand. It just makes more sense, especially in terms of programming and protocols. Do we really need web servers that understand how to read "GET / HTTP/1.1" in thirty different character sets?
-- russ
Natural != (nontoxic || beneficial)
isnt that why there are all those different encodings, so there IS no overlap? i dont think there is any language that cannot use the 65K characters to construct its written language on a computer.
:P)
kanji has what, 3000 characters? Chinese uses unicode by combinations of roots and the other parts of the characters (its kinda complex, but it works!
So what is the problem? I see no conflicts in any encoding scheme, where even really complex ones like Chinese still work?
(i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)
----
"I would say that 99 per cent of what my father has written about his own life is false." - L. Ron Hubbard Jr.
>Unicode, the semi-commercial equivalent of
>UCS-2 (ISO 10646-1)
How is unicode commercial, and if it is, how does that effect it's use with software-libre? Who owns unicode, what kind of license does it have? A quick look at www.unicode.org isn't very informative on this subject.
Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:
This gives 2^(7 * n)possible characters of length n:- 128.
- 16,384, cumulative 16,512.
- 2,097,152, cumulative 2,113,664.
- 268,435,456, cumulative 270,549,120.
- 34,359,738,368, cumulative 34,630,287,488.
- 4,398,046,511,232, cumulative 4,432,676,798,720.
...
As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
On the strength of one bean
He'd fart God Save the Queen
And Beethoven's Moonlight Sonata.
I don't know where these plans for conversion to a phonetic written language stand now, though I'm sure it wouldn't be hard to find out.
OK,
- B
--
http://www.bradheintz.com/
- updated
Is everyone who wants to publish on the web or otherwise electronically owed a simple solution to all their problems? If your language is not representable in a particular code set DON'T USE IT. If you need to publish in your native language, SOLVE YOUR OWN PROBLEMS! Or better yet, USE PAPER!
Problems are solved by those who need to do so. They solve their problems, and if it catches yours too, good for you. If not, get to work.
- Sig this!
It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:
Dear Mr. Carroll,
I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."
Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.
Here are some specific comments on items in the article which are either misleading or outright false.
Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The
Hangul system was closely tied to the Old Korean sound
system. It has a rather small number of primitives for
consonants and vowels, and then mechanisms for combining them
into consonantal and vocalic nuclei clusters and then into
syllables. However, the inventory of sounds represented by
the Jamo pieces of the Hangul are not even remotely close to
describing any sound of human speech. Hangul is not and never
was a rival for IPA (the International Phonetic Alphabet).
In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.
Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.
*Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."
Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:
http://www.unicode.org/unicode/uni2book/appA.pdf
The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.
Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.
Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.
And your assertion that many Westerners have a "tendency ..
to dismiss older Oriental characters as 'classic,'" is
also a fantasy that has nothing to do with the reality
of the encoding in the Unicode Standard. If you would
bother to refer to the documentation for the Unicode
Standard, Version 3.1, you would find that among the
sources exhaustively consulted for inclusion in the
Unicode Standard are the KangXi dictionary (cited by
Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai,
the Chinese Encyclopedia, and the Siku Quanshu. Those are
*the* major references for Classical Chinese --
the Siku Quanshu *is* the Classical canon, a massive
collection of Classical Chinese works which is now
available on CDROM using Unicode. In fact, the company
making it available is led by the same man who represents
the Chinese national standards body for character encoding
and who chairs the Ideographic Rapporteur Group (the
international group that assists the ISO working group
in preparing the Han character encoding for 10646
and the Unicode Standard).
Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.
Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.
The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my!
In fact, nothing of the kind is going on, and these are
completely misleading metaphors.
The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.
And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.
Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.
And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".
In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.
Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.
--Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
Technical Director, Unicode, Inc.
Co-Editor, The Unicode Standard, Version 3.0
--
--
A member of the first GPL-ed software project in my country
Please mod this one up some more. Please? TIA.
If the guy who wrote this article really care so much about these archaic heritage, let him return to writing with the original (fan-ti) characters, okay?
Societies change. People change. "yi" is inevitable.
This should be obvious to anyone who has ever looked at a unicode chart or has had to click "Cancel" when asked to install character support for any of the myriad languages that need language packs to be displayed in Windows. Ok, so they built a way to theoretically support all of these characters. This does not mean that I can read Japanese, however, and making it possible to see it in my browser will not change that fact, nor will it make Google searchable in Japanese, cause IRC to support katakana or hiragana characters (and just freaking forget kanji unless you want to chat with a graphics tablet). Unicode has purposes (besides making it easier to hack web servers, that is), but the hopes and dreams built around it are a classic case of throwing tech at a social barrier to try and make it go away.
For your security, this post has been encrypted with ROT-13, twice.
They made it work for Java, I'm sure a interpeter for HTML, ASP, could be worked out using unicode.
Download once, read anywhere
Slashdot Hypocrisy at work?
The author of that article did something fundamentally wrong (equated Unicode with UCS-2) and then proceeded with perfect logic to product garbage output. As others have pointed out, UCS-2 is just an _ENCODING_ of Unicode, and not even the most general one. UTF-32 can handle 4 BILLION code points. Even the more common UTF-8 can handle over a million. Now if the author of that article could see beyond his anti-western bias, he would have learned that people who work routinely with Unicode addressed his underinformed problem years ago.
If you think even every human language when put together needs more than 4 billion code points, you live in a different universe than I do.
Don't think so.
The Unicode standard supports surrogates, which are pairs of 16-bit code points. These pairs defines about an additional 1 million code points within the standard. A "code point" is a unique value for some character.
There is plenty of room in the Unicode space for all the characters.
Spelling and pronunciation not perfect, but it means "A star shines upon the hour of our meeting!" - rough translation of this Quenya Elvish phrase which is a derivative of the Tengwar elven language built by JRR Tolkien. And yes, I have actually used it with close friends before.
Naturally you'd have to do something about homonyms (I'll sounds just like aisle, anyway). Probably best to just work around them.
cryptochrome
time to get ill
---If you can't trust a nerd, who can you trust?
For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.
And besides which, the point of the article is moot. As this article states:
ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.
The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?
The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.
cryptochrome
P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "
---If you can't trust a nerd, who can you trust?
The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.
We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.
cryptochrome
P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.
---If you can't trust a nerd, who can you trust?
In Japanese house has 10 strokes. The problem is that the strokes cannot be mapped to individual "characters" since the individual strokes have no standardized position, direction or size. To encode the strokes would take a lot of information (a bitmap would probably be easier).
No. See the glossary at www.unicode.org - UCS-2 and UCS-4 are encoding forms of the unified character set defined by the ISO/IEC 10646 standards, which now include at least 10646-1 and 10646-2. Unicode is mostly a different name for the ISO/IEC standards, but also include additional information about the use of the characters.
See my other post below. ISO/IEC 10646 and the Unicode standards define the character sets. UCS-2 and UCS-4 are encodings of those characters sets. UTF-7/UTF-8/UTF-16 are transformation formats that allow variable length encodings of the UCS-2 and UCS-4 encodings.
UCS-4 is also quite common, and allows for the new extensions.
UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.
Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.
Klingon into Unicode? I knew those people were obsessed, but that's just asinine! Fictional languages shouldn't even be considered, where would it end?
"What are we going to do tonight, Bill?"
www.lucernesys.comHorizon: Calendar-based personal finance
First of all, I think the editor (not the author) is right: "We're not in the same room". Therefore, 16-bit should be enough to encode even all the 50,000+ chars of K'ang Hsi dictionary. Moreover, if we try to encode ALL characters in the world, how redundant it would be. Surely Hindi speaking people won't speak Chinese and Hindi at the same time.
Moreover, we have "Content Language" and "language" tag in HTML, don't we? If we ever want to encode two or more different languages, we can simply include these tags and be done with it. The browser can then pick the appropriate fonts and voila!
Of the claimed 170,000 characters from the Orients, many of which can be unified since they are the same (in Japanese Kanji, Simplified, and Traditional Chinese). Simplified and Traditional Chinese share a lot of similarities. Even the simplified writings of a particular character often look nearly the same as the traditional one. Thus, the encoding for these two can be unified, only the font bitmap is different. Moreover, it won't be logical to use both simplified and traditional characters in the same article (except if they are exactly the same). So, these can save 50,000 characters.
Japanese kanji, also shares a lot of similarities in both Traditional and Simplified Chinese (more to traditional than simplified). So, the encoding can be simplified too. Save another thousand characters.
--
Error 500: Internal sig error
Note to moderators: this is not flame bait, it's funny!
Maybe if people didn't try to get character sets like Klingon, Cirth and Tengwar added into unicode we wouldn't have this problem!
Help find a cure for cancer!
You mean that song with words written by an English lawyer, using the tune from Londonderry Air, and marketed most successfully in the United States?
It is an Irish style song, not an Irish song.
--
:) Oh yeah! The king of wacky, terse, symbolic programming. You've gotta love it.
--
And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.
--
Right. Check your own faq:
in UCS, up to 6-byte long UTF-8 sequences are possible to represent characters up to U-7FFFFFFF
Reboot macht Frei.
50k? The numbers i got from the Japanese Ministry of education were closer to 900.
Reboot macht Frei.
A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.
How about:
"To delimit a path, *nix uses as slash, whereas MS* uses a backslash; if you get these confused it helps to remember that ampersand is a rounded "E" with a slash through it."
Contrived, I will admit, but I think it answers your challenge.
--MarkusQ
Having worked on internationalisation in some compagnies and now working in China, the unicode standard turned out to be a very good thing. First of all it might not be perfect but it works very well. It is a lot more easy to have one table instead of different encodings which there used to be. It makes life as a software developer a LOT easier. Companies can easily tell their developers to write in unicode in ascii (C wchar_t works fine) and if they ever want to write a multilanguage version then it is SO much easier. This will improve the number of translated applications in the future. Second. Unicode is allready being used majorly. Every new OS out these days uses unicode internally. Win2k and symbian epoc are good examples. The last is also the reason why all mobile phones work perfectly with Chinese characters. The IETF also works on standardisation of foreign characters in DNS names and also uses unicode for that. (Note.. does not restrict to unicode.) As mentioned in http://search.ietf.org/internet-drafts/draft-ietf- idn-requirements-07.txt
Conslusion
Thanks to the unicode standard for making the life of a software developer a LOT easier!
Just because English is the most popular language on the Internet at the moment, that doesn't mean that either other languages were not used or that other languages might not take over that role in the future. If, for example, the growth of Internet accessibility in China keeps up at that rate, Chinese will be language #1 in the Internet by 2007, especially since Chinese will be read and understood by Koreans and Japanese as well.
There is absolutely no reason to panic.
The idea behind Unicode is to have a uniform encoding for all the world's scripts, not for all the world's languages. The necessity of this is evident for anyone who has experience with the insufficiencies of the individual codepage systems (Windows CPxxx, ISO 8859-x, ISCII etc.) currently in use. Have you ever tried to send an Arabic e-mail through a non-Arabic mailserver or run a program with German character support on a codepage 450 windows? Unicode is designed to programs and data interoperable regardless of either's language encoding.
Just because you don't know Japanese it doesn't make the rendering of Japanese pointless. Just because you don't have a clue how a Chinese or Japanese Kanji input system works doesn't render the idea of being able to chat in IRC using Japanese characters entirely pointless.
There is absolutely no reason to panic.
Why not use unicode for everyday use, and a PDF'ish format that could have every character of said language for special purposes, i.e. historical documents.
One the one side, you have a country with a pretty but otherwise messy, outdated, and unwieldy writing system, unwilling to move to a more convenient alphabetic writing system (rightly or wrongly). On the other side, you have a large collection of western corporations that desparately want to sell lots of equipment there without the cost of doing specialized software development. This ought to be an interesting fight.
The best solution, in my opinion, rather than to come up with a global standard, is to let different countries work out their own coding schemes and then come up with a way of encapsulating those schemes in an 8bit code. That way, people who don't need Chinese or Japanese don't have to pay for the overhead resulting from the complexity of those writing systems.
Mixed language editors would continue to be the specialty software they are and have to come up with their own representations.
Of course, US software vendors hate that because they would have to spend a lot more money on customizing their software to particular target markets; they can't just translate a file of message strings. They might even have serious competition from local vendors.
Well, if we want to have the "standard" language be "Chinese", you'll first have to decide which one you want.
China has 7 main dialects, according to my Chinese language class teacher. People in Shanghai speak a language that can almost be considered completely different than the one in Beijing. They use the same characters for writing, but use them to mean different things. At the very least, you have Mandarin and Cantonese.
Also, while Chinese is a grammatically simple language (no conjugation, no pluralisation, etc.), it is less fun to write, because there is no alphabet. Yes, there is a different character for every word. Yes, there is a rhyme/reason to the characters, but that doesn't make it all that much less difficult to learn all of them. Oh, and you have to decide whether you want simplified or traditional Chinese characters to be the "standard", too.
Finally, while the population of China is certainly the largest in the world, do they really have the most people _online_? I have no statistics, I'm actually curious.
Sotto la panca, la capra crepa
sopra la panca, la capra campa
I am Chinese, Chinese has only one charset, and everyone can speak in mandarin to interactive with each other. of course they can also talk in local native language with same charset with mandarin!
enjoy!
Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?
No offense, but your post reeks of the naive, self-absorbed Western arrogance that the entire world hates. Have you actually LEARNED or even had tiny experience with a non-phonetic written language? There are COMPLETELY different ways of communicating ideas or expressing emotions. Forcing billions of people to convert to your culturally imperialist straitjacket isn't just infeasible, it ignores an entire dimension of humanity.
I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.
Um... Ever read a great piece of literature in one language, then read the translation? It's NEVER the same. Languages are much more than communication -- they embody ways entire modes of thinking, of cultural assumptions. Any modern linguist (Noam Chomsky comes to mind) could tell you that. In the case of, say, poetry, you will always lose the meter and rhythm. In the case of, say, political works, you will always substitute in words that have the wrong connotations or sound funny in the new language. Translators ALWAYS struggle with these issues.
While the author of this original article may be misinformed on the particulars of Unicode, or may be flawed in asserting Unicode should accomodate every single character system that exists, it's the overall message from your post that is most misinformed -- that the language YOU use is the best, and the rest of the world should convert to YOUR way of thinking and communicating without your even having to try to understand them. Anyone who actually has studied the issue would not reach the sadly shallow conclusion you have.
I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.
Who's whining? People who point out the horrendous exclusion and bias in the Western-dominated technological conferences that dictate standards of communication, or people like you who don't understand anything past what you grew up with?
This is nothing like the metric system. Ditching the inch and pound carries is nothing like ditching the bedrock texts of your culture.
Personally, I do not fully agree with the article's author. Of course ancient documents don't need to be represented in the character set intended for everyday use by businesses or other entities. However, the attitude expressed in your post is what causes so much resentment and pain in the first place.
I'm sorry if this post sounds rantish. I'm just ashamed to belong to a community that produces posts like yours.
Language is about what people think, how people live, and at root, their culture. Language is not about the undefinable concept of "efficiency."
-Brendan
Actually, if you wanted to, you could write English/German/French/Spanish using Chinese! It would actually be fairly simple, one Chinese character == one English word. Just have the display program figure it out, ie translate the Unicode Chinese into the English/French/etc.. Achieve instant 50% or greater data compression. It's the perfect compression solution for us bandwidth-sucking Westerners! No verb tense or plurals? Don't need it! It's all fluff anyway! I'm only 1/4 joking...
Much of this sounds like the old evil empire Microsoft conspiracy theory out to squash the good cowboy Linux true blue we want to save the world from evil story.
What this *really* has to do with Unicode isn't clear. The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.
As for multilingual text and statelessness, was kann ich Ihnen sagen? Comment pourrai-je réparer ma bêtise? Oops! Sorry, I guess I couldn't do that in Unicode, could I, or Code Page 1252, or Latin-1 for that matter?
Stateful language processing has its place, in multilingual text or monolingual text, even. But how you construct that stateful processing is not dictated to you by Unicode, any more than it is dictated to you by having Latin-1 implementations on Unixes. XML defaults to Unicode, but you can use it with any character set you choose to mark. And if you use it with Unicode, you can span mark any statefulness you want into it.
But in any case, feel free to go off and invent your systems of language and charset tagged substrings handled "transparently as sequences of bytes" and come back to show us all when you have your better mousetrap working.
And to further support your point against Alex, I would like to point out that I have attended nearly every Unicode Technical Committee meeting, since its inception, and to the best of my knowledge, *never* has an interested participant or observer been turned away at the door, whether they were formally a member or not.
Also, unlike ISO, which restricts primary membership to accredited national bodies (but does, however, allow expert participation in the working groups, regardless), the Unicode Consortium memberships are open to anyone who wants to pay the dues. In the history of the Consortium, there have been cases of an individual person forking out for a full membership because they wanted voting participation on a particular issue, and the Consortium has not only commercial corporations as members, but also national governments, state governments, libraries, academic institutions, whatever. Anyone who wishes to participate is welcome.
Anybody in the world can and does join the open discussion list, unicode@unicode.org, hosted by the Consortium, and is free to discuss or browbeat on whatever Unicode-related topic concerns them.
So unless those who claim that Unicode is a closed cabal mean by that that the Consortium should be subsidizing free memberships (it is a registered non-profit corporation) or should be holding its deliberations on public-access TV, I fail to see what the knock is on the Consortium.
> Unicode is made under the slogan of total
> statelessness of text, so while applications'
> file formats may allow this, arbitrary
> substring in a text can't.
You keep harping on this "statelessness of text" issue as if this is something that Unicode caused that is destroying the capabilities for decent multilingual processing. But in fact, the same assumptions, as regard text representation, underlie ISO 8859-1 (Latin-1), Code Page 1252, or nearly every other character set in widespread use in the world today. You can use Latin-1 to mix English, French, German, Spanish and any other of dozens of languages, but you cannot do tagging of charset or language in arbitrary substrings of Latin-1 without the use of a higher-level markup language, any more than you can in Unicode.
All character encodings work that way -- except for 2022, which itself is just a framework for implementing switching between the other character encodings in stream, and doesn't have the kind of language tagging for arbitrary substrings you seem to be advocating, anyway.
So what is the basis for the knock on Unicode here?
The PUA is presently used for such things as corporate logos, which are not accepted into Unicode, and for many writing systems which have not been worked out in sufficient detail for formal encoding. This includes real, historical languages, and also Tolkien's Cirth and Tengwar. I hear Klingon is out there too.
So what do you claim that you can't do?
"A knot!" said Alice, ever ready to be useful. "Oh, do let me help to undo it!"