Why Unicode Won't Work on the Internet

← Back to Stories (view on slashdot.org)

Why Unicode Won't Work on the Internet

Posted by ryuzaki0 on Tuesday June 5, 2001 @03:11AM from the -Linguistic,-Political,-and-Technical-Limitations dept.

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

4 of 416 comments (clear)

Min score:

Reason:

Sort:

Unicode Character Set vs Character Encoding by Jordy · 2001-06-05 00:03 · Score: 5

The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

A good document on this is available at UTF-8 And Unicode FAQ

--
The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
Re:UTF-8 should be fine for almost any application by spitzak · 2001-06-05 02:39 · Score: 5

Thanks for some more intelligent discussion about UTF-8.
I might add a few things:
In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!
In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!
The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.
Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.
Some errors by BJH · 2001-06-05 00:02 · Score: 5

Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

- No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

- A draconian unification of CJK characters.
The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

- The ugly "extensions".
Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

I could go on, but I should get some sleep...
Re:You bring up a good point by gleam · 2001-06-05 00:17 · Score: 5

The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source

A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here.

Regards,
gleam

--
this .sig is not a .sig.