kwhistler · Slashdot Mirror

Sure it is on Why Unicode Will Work On The Internet · 2001-06-09 07:02 · Score: 1

Granted that a universal character encoding with the scope of Unicode is very complex, and a *full* implementation that does justice to all parts of it is beyond the capability of all but a few large software companies.

But there are several mitigating points you may be missing.

First, conformance to the Unicode Standard does not mean you have to actively support the repertoire of all the characters. It is perfectly compliant to just pay attention, say, to the Ethiopic characters, and do the best-in-the-world Ethiopic word processor or whatever, while simply passing through and essentially ignoring all the rest of the characters. In this sense, the Unicode Standard is not inhibiting local best-of-breed development, but rather enabling it without diversion down the path of having to start off with local character encoding standards (often 8-bit font hacks) that don't, in turn, interoperate with anybody *else's* software.

Second, most serious software development these days is modular anyway. You depend on other people to provide generic platform services, or to develop general libraries of routines that you turn around and use. Much Unicode development falls into that category. If Windows (or some other platform) does a good job of implementing Unicode, other developers can turn around and make use of the API's those platforms provide to build applications on top of those platforms. Or you call into libraries that specialize in these issues. Nobody much goes around building their own graphics routines nowadays, for example -- you depend on the platforms or specialized libraries to provide such services and get on with concerns about the rest of your application.

> Why should programmers for any one market have
> to deal with the complexities of the other
> writing systems?

Well, in principle they should not, unless their concern is explicitly with rendering and writing system support.

What you may be missing here is that the alternative to Unicode is having to deal with the complexities of character encoding support for hundreds of existing character encodings. That is far more of a generic burden on application development than having a *single* encoding (you usually pick either UTF-8 or UTF-16 and stick with it) for the character handling. There is a reason why Java just defined its strings from the beginning in terms of Unicode, and why that model took off so quickly.

And who specifies the languages, pray tell? on Why Unicode Will Work On The Internet · 2001-06-09 06:39 · Score: 1

Sorry, but this is just a goofy idea.

Beyond the problem that nobody yet has a foolproof, standardizable listing of the 6000+ languages in current use on the planet, let alone the thousands more historical languages and all the dialects, having a character encoding that requires language identification on a character-by-character basis couldn't work in practice. How do you deal with borrowed vocabulary? How does a user input this stuff -- maybe they don't even know? How do you deal with conversion of text that isn't identified this way? And on and on.

There are good reasons why character sets are built the way they currently are, and why language identification is treated as an issue for markup of text, rather than for character encoding.

Re: Troller on Why Unicode Will Work On The Internet · 2001-06-09 06:22 · Score: 1

It was news to me, too. *hehe* Since I live in Berkeley, California, a well-known hotspot for white supremacist agitating (*rolls eyes*), I guess I'll have to check with my black, Hispanic, and gay neighbors to see where I got this reputation.

It was just a troll, and an anonymous one at that.

--Ken Whistler

Re:Conspiracy Theories and Unicode on Why Unicode Won't Work on the Internet · 2001-06-09 06:04 · Score: 1

> While there is a lot of effort to shoehorn
> Unicode into Unix and Unix software, the actual
> results are beyond miserable, precisely because
> Unicode does not work.

Ah, I understand now. Not that I am going to praise the Unix vendors' support of Unicode as the best and most usable around, but I suggest you try making that claim directly to the Unicode representatives working for Sun, Compaq and others, and see if you can pull off such a claim.

> ... thus getting blessed by Unicode consortium
> as compatible.

Wrong again, Alex. The Unix vendors added Unicode support because they perceived it to be in their commercial interest to support a universal character encoding standard that other vendors and standards were starting to make widespread use of, and which growing numbers of customers started to ask them to support.

The Unicode Consortium doesn't "bless" any vendor, and doesn't have any certification program that anyone needs to pass in order to be declared "compatible". People claim themselves conformant to the standard if they choose, and if their implementation is defective or non-conformant, they get beaten on by disappointed customers, not by the Unicode Consortium.

> UTF-8 can be "supported" in that way even by
> abacus, if that abacus is long enough and
> has at least 8 stones in a row, ...

Well, most of us also took elementary computer science, and learned that any algorithm can be implemented on a Turing machine. So I guess we should go to the NOAA weather modelers, when they run a weather simulation on a supercomputer, and let them know they could use a Turing machine, instead, eh?

UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.

> I have never in my life seen a filename in UTF-8
> outside of Unicoders' demos...

I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.

On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.

> and I am Russian myself and have a lot of
> friends that speak Japanese.

Umm. And the relevance of that comment is what?

> So, again, Unix vendors' support of Unicode is
> in fact a lip service, ...

Implying that you think it is a cynically added feature to get a checkmark or a brownie point somewhere, and that they all think it is really doomed to the trashheap of history like OSI. I'm hardly going to take your word for it. I suggest you get some international architects for the Unix vendors to come on list and support your contentions.

Statelessness of text on Why Unicode Won't Work on the Internet · 2001-06-09 05:37 · Score: 1

> Unicode is made under the slogan of total
> statelessness of text, so while applications'
> file formats may allow this, arbitrary
> substring in a text can't.

You keep harping on this "statelessness of text" issue as if this is something that Unicode caused that is destroying the capabilities for decent multilingual processing. But in fact, the same assumptions, as regard text representation, underlie ISO 8859-1 (Latin-1), Code Page 1252, or nearly every other character set in widespread use in the world today. You can use Latin-1 to mix English, French, German, Spanish and any other of dozens of languages, but you cannot do tagging of charset or language in arbitrary substrings of Latin-1 without the use of a higher-level markup language, any more than you can in Unicode.

All character encodings work that way -- except for 2022, which itself is just a framework for implementing switching between the other character encodings in stream, and doesn't have the kind of language tagging for arbitrary substrings you seem to be advocating, anyway.

So what is the basis for the knock on Unicode here?

Unicode closed to participation? on Why Unicode Won't Work on the Internet · 2001-06-09 05:22 · Score: 1

And to further support your point against Alex, I would like to point out that I have attended nearly every Unicode Technical Committee meeting, since its inception, and to the best of my knowledge, *never* has an interested participant or observer been turned away at the door, whether they were formally a member or not.

Also, unlike ISO, which restricts primary membership to accredited national bodies (but does, however, allow expert participation in the working groups, regardless), the Unicode Consortium memberships are open to anyone who wants to pay the dues. In the history of the Consortium, there have been cases of an individual person forking out for a full membership because they wanted voting participation on a particular issue, and the Consortium has not only commercial corporations as members, but also national governments, state governments, libraries, academic institutions, whatever. Anyone who wishes to participate is welcome.

Anybody in the world can and does join the open discussion list, unicode@unicode.org, hosted by the Consortium, and is free to discuss or browbeat on whatever Unicode-related topic concerns them.

So unless those who claim that Unicode is a closed cabal mean by that that the Consortium should be subsidizing free memberships (it is a registered non-profit corporation) or should be holding its deliberations on public-access TV, I fail to see what the knock is on the Consortium.

Conspiracy Theories and Unicode on Why Unicode Won't Work on the Internet · 2001-06-07 09:49 · Score: 1

Much of this sounds like the old evil empire Microsoft conspiracy theory out to squash the good cowboy Linux true blue we want to save the world from evil story.

What this *really* has to do with Unicode isn't clear. The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.

As for multilingual text and statelessness, was kann ich Ihnen sagen? Comment pourrai-je réparer ma bêtise? Oops! Sorry, I guess I couldn't do that in Unicode, could I, or Code Page 1252, or Latin-1 for that matter?

Stateful language processing has its place, in multilingual text or monolingual text, even. But how you construct that stateful processing is not dictated to you by Unicode, any more than it is dictated to you by having Latin-1 implementations on Unixes. XML defaults to Unicode, but you can use it with any character set you choose to mark. And if you use it with Unicode, you can span mark any statefulness you want into it.

But in any case, feel free to go off and invent your systems of language and charset tagged substrings handled "transparently as sequences of bytes" and come back to show us all when you have your better mousetrap working.

Re:Unicode's reply on Why Unicode Won't Work on the Internet · 2001-06-07 09:11 · Score: 1

> there is always some need to represent, in some
> consistent and unambiguous manner, text in
> languages that can't be possibly accepted into
> Unicode, such as fictional languages

Well, fictional *languages* are easy to represent in Unicode, if you use one of the existing scripts in the standard. Pig Latin, whatever. In fact this is exactly how Klingon works -- its all done in Latin transliteration anyway by the Trekkies and the official Klingon Language Institute (I kid you not), so it already works in Unicode.

If you are talking about fictional *scripts*, then in fact the most important, most studied and cited of those are Cirth and Tengwar, the scripts invented by Tolkien. Guess what, those *are* roadmapped for inclusion in the Unicode Standard. You might want to actually take a gander at the official roadmaps for Unicode and 10646 before mouthing off about what cannot possibly be included in the standards:

http://www.egt.ie/standards/iso10646/ucs-roadmap .h tml

> they can be easily handled by any expandable
> charsets-handling system...

And Unicode is not expandable? It is already planned for expansion to include Egyptian hieroglyphics, Sumero-Akkadian cuneiform, Limbu, Buginese, Avestan, and dozens of other minority and historic scripts you've probably never heard of. There are 882,373 code points still available for that kind of expansion, which is something like 800,000 more than all the known requirements of all the known writing systems current and past. And beyond that, there are 137,468 private use characters permanently set aside for anyone to define anything they damn please with. And if goofy expansion systems are your cup of tea, then your private use of Unicode private use characters could be to define them in pairs (for example) to create over 18 billion encodings of things, or in triples to create 2,597,794,797,367,232 (that's 2 and a half quadrillion) encodings of things. *That* should keep you busy.

> Unicode supporters do everything that is
> possible ... to prevent any competing system
> from being developed.

Well, that is some pretty hyperbole. No one is holding any guns to anyone's heads on this, figuratively or literally. The main reason no competing systems are having little success is that universal character encoding schemes are *enormous* undertakings and commitments of resources. Try looking at the Acknowledgements page of the Unicode Standard: 5 pages long in small print! You try organizing hundreds of people to work on a project for a decade, and then get hundreds of companies and dozens of other standards to implement what you come up with. Most competing efforts simply founder quickly on the sheer amount of work involved.

> The problem is, Unicode is being used for things
> it is inadequate for..

Such as? Perhaps you could be more explicit in stating an example, so it would be possible to evaluate what you are talking about.

You seem to distrust the universality of the Unicode Standard. But for use on the Internet, and as the backbone of XML, HTML, Java, and other standards, it is the universality which is the attraction and the big advantage. What are you proposing instead? Use ISO 2022 with Escape switching to hundreds of individual encodings, many of which may have totally incompatible models of text handling and which thus would have little or no chance of being correctly handled or rendered on any average system that might encounter them? Do you think that computer systems just magically deal with some arbitrary, idiosyncratic local encoding because somebody, somewhere thought it was a better idea for whatever language they are familiar with?

Re:Nonsense on Why Unicode Won't Work on the Internet · 2001-06-07 06:47 · Score: 1

1. There are crossmapping difficulties between all large East Asian character encodings. This kind of problem predates Unicode, and Unicode has inherited many of the inconsistencies already present. For accurate mapping between particular vendor implementations (e.g. Code Page 932 on Windows) and Unicode, the right thing to do is to use the vendor's own mapping of their code page to Unicode. (Those tables are also often posted on the Unicode website, or can be obtained from vendors themselves.)

And what is your *alternative* anyway? Do you think you can find more authoritative and less problematical tables for converting "megabytes of electronic documents" between, say CNS 11643 and JIS X 0208, without making use of Unicode?

2. What the previous poster was pointing out is that all character distinctions made in the official JIS national standards are also made by the Unicode Standard. I might also point out that the number one OS in Japan (Windows) and the number one word processor in Japan (Ichitaro Dasshu) are both Unicode-based. Most people in Japan are perfectly satisfied with such products, as regards their character handling, and neither know nor care that they are based on Unicode inside.

The Unicode Consortium has not paid anyone to put a rubber stamp on anything. Put up or shut up on a claim like that.

And yes I do attend Japanese discussions on Unicode issues. (Although I don't hang out on boards devoted to TRON or Giga, which tend to uninformed Unicode-bashing.) I have been personally acquainted with the head of the Japanese national standards body delegation into the ISO committees for a number of years now. He is hardly a stalking horse for the Unicode Consortium! But he and JSC2 have cooperated for years with ISO SC2/WG2 in the development of 10646 *and* in the Han unification that that implies--the same Han unification used in the Unicode Standard.

3. UTF-8 is great for some purposes. UTF-16 is great for others. And UTF-32 is great for yet others. Since all of them represent the same characters in the standard, and all those three forms interoperate easily (the conversion code is posted on the Unicode website, for anyone who cares), where's the problem? The reason there are 3 encoding forms is because the software vendor community demanded it: UTF-8 for 8-bit API compatibility and UNIX stream/file transparency; UTF-16 for size and processing efficiencies for most text; UTF-32 for UNIX 32-bit wchar_t implementations of character processing.

And then you toss off a total non sequitur: "...the problem is that Unicode doesn't allow many people to encode their languages fully." Unicode doesn't encode *languages* -- it encodes characters from scripts. End users don't "encode their languages" -- they represent text in their languages on computer systems that make use of encoded characters. But now that we've got some terms straightened out, would you care to specify an instance of a language that Unicode doesn't represent fully? It's funny how the Unicode Consortium seems to have convinced experts from the Library of Congress, the Research Libraries Group, the European Community, and so on about this, but hasn't been able to convince you!

So, since you asked, that was how your post was nonsense.

Slashdot Mirror

User: kwhistler

Comments · 9