Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
Re:♬ First musical post! ♫
By the way, have you ever considered using Byzanitine musical notes, or glyphs from the full set of notes?
-
OT: iso-8859-7 characters in your signature
-Ãéñãéïò Ôóßñïò (you need iso-8859-7 and respecive font to view that correctly)
Use Unicode instead of iso-8859-7 in your signature and everyone with good software will see your text, with no need to write "you need iso-8859-7 and respecive font to view that correctly". The Greek characters starts from U+0391 (here's a PDF chart and Named and Numeric HTML Entities). The Greek characters are very important in Latin based languages for mathematic formulas, so they are usually installed by default in modern operating systems and they even have named HTML entities, so you can write α β γ δ Ψ Ω in your sig or comments and get . I don't know what software do you use, but I know that under Debian GNU/Linux which I use, the unicode Greek fonts are installed by default and Mozilla displays them also by default (as well as lots of characters from many exotic scripts). Hope it helps.
—
-
OT: iso-8859-7 characters in your signature
-Ãéñãéïò Ôóßñïò (you need iso-8859-7 and respecive font to view that correctly)
Use Unicode instead of iso-8859-7 in your signature and everyone with good software will see your text, with no need to write "you need iso-8859-7 and respecive font to view that correctly". The Greek characters starts from U+0391 (here's a PDF chart and Named and Numeric HTML Entities). The Greek characters are very important in Latin based languages for mathematic formulas, so they are usually installed by default in modern operating systems and they even have named HTML entities, so you can write α β γ δ Ψ Ω in your sig or comments and get . I don't know what software do you use, but I know that under Debian GNU/Linux which I use, the unicode Greek fonts are installed by default and Mozilla displays them also by default (as well as lots of characters from many exotic scripts). Hope it helps.
—
-
Re:Terminology whine
How do you sort stuff alphabetically if you can't just do an integer comparison?
Unicode Sorting Algorithm.
Would be really slow to use some funky custom sorting routine.
What are you running? There are massive databases that use binary compare, and bitty boxes that use binary compare, but even my 386 should be able to do decent sorting in a negligable amount of time.
I don't know of many character sets that put the characters in sort order. ASCII doesn't work for English, because capital letters and lower case letters don't sort together. Latin-1 puts all its characters after ASCII, when some of them should sort with the ASCII characters.
As for why, the fact is it's not an option in a multilingual enviroment. Lithuanian sorts y after j; Swedish, German and Danish use some of the same accented characters, but sort them differently. The whole concept of binary sorting fails for some languages; Maltese and traditional Spanish both sort two letters ("ch" and "ll" for Spanish) as if they were one, and German sorts one letter ("ß") as if it were two ("ss"). -
Re:My problem with OS X
Indeed, right-to-left language support is currently very weak on OS X but this is rumored to change with Jaguar. Certainly there's been talk of it already from Apple people before (eg. Peter Lofting @ the ATypI Conference Copenhagen last year) You can of course use many right-to-left languages on MacOS 9 and with XWindows on OSX already (eg. AbiWord). From an implementation perspective, because of system-wide Unicode support it's certainly easier than ever - just get the language sets out the door at Apple. Additional language support should be quicker than with pre-OSX systems, and more uniform. Nice to see an another international ISO technology that Apple and Xerox started coming back to help them. I never cease to be amazed at how innovative those two companies really were back in the 'day. And to a greater or lesser extent still are.
-
Re:TLC/Discovery Special -- Question ...
Shuttle? Are you smoking something? My entire comment was two sentences, and made no mention of anything specific, let alone a shuttle. What I was saying is that you seemed to imply that since something is maliciously-planned it cannot be foreseen. I proceeded to say that you obviously dont (or shouldnt) be working in any security-related field, as most things one has to foresee are malicious in nature. And thats all I said.
And why dont you check the HTML source before implying that because Im typing something correctly I must be using Windows. First off, Windows does use typographically-correct quotes (so-called curlies) but uses the wrong codes: it uses Windows Code Page 1252 codes: 146, 147, 148, 149 (decimal). Codes from 128 to 159 are undefined in both ISO-8859 (an 8-bit superset including ASCII that nearly parallels CP1252) and ISO-10646 (Unicode, UTF-8, etc.), and thus dont work on any systems unless the page charset is specified as windows-1252 or the browser assumes what they mean (the Macintosh MSIE does this, naturally).
However: I type punctuation using the correct Unicode values which should work on any standards-compliant browser. These are U+2018, U+2019, U+201C, U+201D (hexadecimal). You can type codes like this by typing &#xNNNN; for hex codes, or just &#NNNN; for decimal. You can also use the ISO-8879 names for these characters lsquo, ldquo, rsquo, rdquo but a lot more browsers have problems with those. -
*Bzzzzt*, wrong!
"Opera will be a good browser when it supports all the latest HTML/XHTML standards and CSS. Until my (100% properly coded and W3C validated) websites render as perfectly in Opera as they do in Mozilla and IE, Opera can't really be classified as 'the best browser out there.'"
I'd say that you're either a lying sack of shit, or someone who don't know what they're talking about. (Take your pick!)
Opera supports HTML 4.01, XHTML, XML, CSS1 and most of CSS2; and has for a long time. Opera 6 also support PNG, Unicode, ECMA-262 2ed (that's "JavaScript 1.3" to you, idiot), and most of ECMA-262 3ed, plus some JScript-methods in IE-mode. However, Opera does not support DOM fully just yet. They're working on it though. -
Why Arabic numbers are the way they are
Your reasoning about the digit order in Arabic is wrong because it is completely irrelevant for addition whether your MS digit is on the left or on the right.
Arabic numbers are written LS right, MS left because of the way numbers are read in classical Arabic. Classical Arabic (unlike modern standard arabic) reads numbers LS digit first. Since Arabic is written right-to-left, the LS digit comes first, i.e. right. That's why Arabic numbers in Arabic script are written the way they are.
Since numeral ordering is a relatively script-independent thing, the order of the numerals was retained when the Arabic digits were adopted into the latin script (probably in medieval Spain). This is convenient because most Indo-European languages pronounce their numbers MS digits first.
BTW The Arabic numbers weren't even invented by the Arabic. Arabic numbers were originally invented in India and written in the Sanskrit language and the Devanagari script which runs left-to-right. Sanskrit numerals are pronounced MS digit first, so it makes sense that way as well. In Arabic, the so-called "Arabic" digits are called Indian digits even today. -
Re:My pet hates with Java
The surrogate codepoints (0xD800-0xDFFF) have long been reserved so that UTF-16 can encode codepoints out to U+10FFFF. Codepoints beyond U+FFFF haven't been assigned yet, but a few scripts using them have already been proposed for Unicode 3.2. It's almost certainly going to happen, and if java.lang.String still lets you chop a character in half then it's seriously broken.
-
Re:My pet hates with Java
I guess it's pointless to argue with trolls, but:
Java's 'char' type is supposed to represent unicode characters. However it is only two bytes wide so does not support codepoints beyond U+FFFF, such as various Chinese characters.
So you didn't know that the Unicode standard quite explicitly states that "Unicode character values are 16 bit"? Hence, no "codepoints" beyond U+FFFF exist!
You can't pass them by reference.
I have seen exactly one method that is dragged out as example of why you need to pass primitives by reference, and that's swap(a,b), which is just of academic interest.
A function cannot take variable arguments of arbitrary type.
Then use Object[]. "Variable arguments of arbitrary type" is a hack that works in C because C isn't properly strongly-typed, and has no place in Java.
You can't create a list anonymously.
As others have pointed out: Yes, you have been able to since JRE 1.1 (prior to that, only in declarations).
The syntax is intentionally inflexible. [...] Compare C, where anyone can write a reasonable assert macro.
Gee, how "standard" that sounds, so now I have to learn some nincompoop's particular fancy if I take over their code. Macros suck anyway because they ignore the type system, and most C++ literature I've seen wisely discourages their use.
As for flexibility, are you opposed to standards? Then stop using C because its syntax is also "inflexibly" specified in a specification.
byte is signed in Java
All numeric values except the special-purpose char are consistently signed. No need to go and check every time you use a numeric variable. This is a good thing.
-
Re:Use of �
That thing is used in some logic textbooks as the NOT symbol, and AppleScript (Macintosh scripting language) uses it at the end of a line to signal that the code continues on the next line (like how a \ is used at the end of a line in shell scripts):
set d to (display dialog "What the hell is this?"
with buttons { "OK", "Cancel" })
set x to button returned of d
In Unicode, it is U+00AC, and is called the not sign and an angled dash in the documentation [PDF].
Why did you mention UK keyboards; is that thing some kind of British symbol that I am unaware of? Or did you mean to type the pound sign and my browser is displaying it wrong? (I see a sideways L-like thing, FYI.) -
Adobe will never drop the Asian market.Adobe will never stop supporting the Asian market - though they might very well shift their business focus. Adobe is a major driver of internationalization standards. They are extremely active driving the Unicode standard, and the standard book for Internationalization, CJKV, is written by Ken Lunde, an Adobe employee who is treated with great care.
Adobe might very well stop internationalizing some of their products (like Photoshop), but I am sure they have an agenda for the Asian market.
-
Re:So what ASCII value will the Euro be?
It doesnt have an ASCII value ignore the nitwits below you, as ASCII is only 7 bits (127 characters maximum), and theyre giving you the Windows-specific 8-bit value 128. For example, on Macintosh, it has a different location: 219 (Option-shift-2).
Its Unicode value is 0x20AC, which is a standard. It can be written as € in HTML/XML, or € for the few browsers that support the symbolic name (mine doesnt seem to). -
Compression Scheme for Unicode
Check this standart for unicode compression.
It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!! -
Re:Unicode my ass
Dump out a Word document. You will see each character takes 16 bits. The original Unicode standard specified 16 bits for each character. For example, here is Unicode version 2.1.8. Note that each character is 16 bits. That's not just another "encoding", that is the official standard. The whole point of Unicode is a unique code for each letter.
Other standards have cropped up to deal with the PITA of dealing with 16 bit chracters, including UTF-8. But UTF-8 is more of a compression encoding.
Now, they may have moved to 32 bits, I'm not sure. I haven't paid attention to Unicode for a number of years. According to the unicode.org site, they've recently added a whole slew of new symbols, which brought them up to around 94000 symbols which obviously exceed 16 bits.
-
Re:So...
I agree in principle, but then therell be an even worse schism between MSIE and the OSS browsers than now. At least the two sides try to implement the same standards now. If the OSS browser coders intentionally decide to start using their own standards, it will only get worse. We have to be able to implement the same stuff the commercial browsers do unless were trying to marginalize ourselves.
FYI, the IETF and ISO have been standardizing on proprietary standards for years. POSIX, for example (ISO 1003.6). And, UTF8, 16, 32 are all standards but are owned by Unicode, Inc. -
Re:*nix has this stupid fixation with case tooThat sounds like a good reason to use a unicode file system, and is completely orthogonal to filenames that are case sensitive.
Yeah, a unicode file system sounds great, but my point is that it has everything to do with case sensitivity. Case folding of unicode text is complicated and resource intensive. I don't want my kernel doing it.
It's a pain in the @$$ to have to manually edit 5,000 source files.
Yeah, sucks for you, sorry
:-) -
Re:Glyphs versus characters in Castillian
In Castillian Spanish, 'ch' and 'll' are characters that require two glyphs to print. However, for alphabetization purposes, 'ch' and 'll' are distinct characters
Well, in general this is what should occur. If it does not, then it is because the national standards for those languages had already been treating them as two separate characters, with two encoding points.
http://www.unicode.org/unicode/standard/where/
Says:
For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point called a composite character. For example, the i with 2 dots in naïve could be presented either as i + diaeresis (0069 0308) or as the composite character i + diaeresis (00EF).
That's were Unicode bends from pure idealism to practical matters. They'd preffer to have one char, but if Spain is already using two characters for 'ch', then the Unicode consortium would obey their practice.
simple-minded comparison routines that compare character codes report erroneous comparison values because they doesn't realize that 'cg'
Of course, the same goes for any Unicode sorting. No Unicode sorting can be assumed to work based just on the character values. That's why all software libraries that allow for sorting Unicode have functions for the programmer to use, based on the language/locale in effect.
-
Re:Glyphs versus characters in Castillian
In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online. This things are in chapters1 and 2.
About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".
--
-
Re:Glyphs versus characters in Castillian
In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online. This things are in chapters1 and 2.
About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".
--
-
Re:Glyphs versus characters in Castillian
In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online. This things are in chapters1 and 2.
About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".
--
-
Re:Glyphs versus characters in Castillian
In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online. This things are in chapters1 and 2.
About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".
--
-
Re:yes, unicode works, but is unnecessary.
Also, if there's redundancy in Unicode, I imagine most of that space could be saved with gzip, which also has good support over the web, though like Unicode is far underused.
Well, one may also try the Standard Compression Scheme for Unicode.
--
-
Fun Unicode demos
For a couple of cool demos of the kind of multilingual Web pages that Ken Whistler is talking about, see the announcement for the Tenth Unicode Conference or "I don't know, I only work here." Both of these pages demonstrate Han unification, in which the same code points tagged as different languages get different visual presentation in a compliant browser.
-
Unicode's reply
It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:
Dear Mr. Carroll,
I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."
Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.
Here are some specific comments on items in the article which are either misleading or outright false.
Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech,
..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.
Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.
*Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."
Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:
http://www.unicode.org/unicode/uni2book/appA.pdf
The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.
Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.
Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.
And your assertion that many Westerners have a "tendency
.. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.
Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.
The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French
... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.
And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.
Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.
And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".
In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.
Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.
--Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
Technical Director, Unicode, Inc.
Co-Editor, The Unicode Standard, Version 3.0
--
-
Re:Unicode Character Set vs Character Encoding
Yes. Here's a chart.
-
Misconceptions in article
As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.
One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.
It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)
Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.
Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)
With this in mind, one can argue against the points raised in the article:
- The unification scheme, allows the representation of the 170,000 characters the author calculates in 70,000 or so codepoints. Which it now does with Unicode 3.1. The use of external context is still necessary for correct rendering, but if the document has no structure for representing language context, there are Unicode language tags that can fill this role. Similarly, context would be required for the presentation of different calligraphic variants of Roman characters (e.g. fraktur.)
- Unification is quite unlike the analogy described 'in Western Terms'. 'M' and 'N' could not be identified, as they semanticly distinguish words (e.g., 'rum' and 'run' have very different meanings.) Traditional characters and their simplified analogues are not identified under Unicode, so even if 'Q' were simply a fancier 'C' (which of course it is not), it wouldn't be given the same codepoint.
- Unicode is not limited to 16 bits as stated in the introduction to the article. There are over 2000 million available codepoints in UCS-4 and UTF-8, and UTF-16 can represent approximately 1 million of these. There is plenty of room - even in UTF-16 - to encode more characters as the need arises.
- With the exception of calligraphic variants in CCCII, Unicode can already faithfully represent characters in the major Chinese, Japanese and Korean character encoding standards.
A little bit of research by the article author would have made the article unnecessary.
References:
Unicode 3.1 document;
CJKV Information Processing, Ken Lunde.PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.
-
technical critique
Unicode is not a "16-bit character definition". Unicode is a "character coding system" for assigning code points to abstract characters. i'll hereby suggest that the author of this piece has confused Unicode itself with one of the encoding forms of Unicode, that is, ways that characters are expressed as bitstrings. please to shoot this down.
a "character coding system" (drawing on http://www.unicode.org/ and my copy of the standard 3.0 here) is a system for assigning characters to code points. Unicode 3.1 assigns some 94,000 odd characters, and the roadmap for allocations (start at http://www.unicode.org/pending/pending.html) will assign more in the future. these assignments are just that: an abstract character to an integer value in the Unicode repertoire. this assignment does not dictate how to represent the character as data in any way.
There are a variety of encoding forms of Unicode, each for ways of representing characters in the repertoire as data (not at all "on screen", that's glyphs, and that's a whole other issue). The different encoding schemes have different strengths and weaknesses. UTF-16 is a form that uses fixed-width 16-bit sequences as the base unit (though through a concept known as Surrogates, two such scalars adjacent to each other can represent a value normally not expressable with just 16-bits). UTF-8 is a different form that uses a variable number of 8-bit sequences to represent characters. There is a UTF-32 form, a UTF-EBCDIC form, believe it or don't. These are just encoding forms, they make no restrictions on what or how many characters get assigned. If the Unicode Consortium wanted to assign abstract characters to values that exceed the limits of current encoding forms, we could certainly do something about that, but it isn't the horrible catastrophe the author makes it out to be.
this is just the thing that leaps out at me. thoughts?
-
technical critique
Unicode is not a "16-bit character definition". Unicode is a "character coding system" for assigning code points to abstract characters. i'll hereby suggest that the author of this piece has confused Unicode itself with one of the encoding forms of Unicode, that is, ways that characters are expressed as bitstrings. please to shoot this down.
a "character coding system" (drawing on http://www.unicode.org/ and my copy of the standard 3.0 here) is a system for assigning characters to code points. Unicode 3.1 assigns some 94,000 odd characters, and the roadmap for allocations (start at http://www.unicode.org/pending/pending.html) will assign more in the future. these assignments are just that: an abstract character to an integer value in the Unicode repertoire. this assignment does not dictate how to represent the character as data in any way.
There are a variety of encoding forms of Unicode, each for ways of representing characters in the repertoire as data (not at all "on screen", that's glyphs, and that's a whole other issue). The different encoding schemes have different strengths and weaknesses. UTF-16 is a form that uses fixed-width 16-bit sequences as the base unit (though through a concept known as Surrogates, two such scalars adjacent to each other can represent a value normally not expressable with just 16-bits). UTF-8 is a different form that uses a variable number of 8-bit sequences to represent characters. There is a UTF-32 form, a UTF-EBCDIC form, believe it or don't. These are just encoding forms, they make no restrictions on what or how many characters get assigned. If the Unicode Consortium wanted to assign abstract characters to values that exceed the limits of current encoding forms, we could certainly do something about that, but it isn't the horrible catastrophe the author makes it out to be.
this is just the thing that leaps out at me. thoughts?
-
Re:Unicode Character Set vs Character Encoding
Morgo is correct. Unicode is only capable of representing a sub-set of ISO 10646-1:2000. This is detailed in the UTF-32 definition among other places which says: UTF-32 is restricted in values to the range 0x000000 to 0x10FFFF, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16.
-
Re:Unicode Character Set vs Character Encoding
Nice summary. Although UTF-32 only implements a subset of UCS-4 due to compatibility issues. You can find more information on unicode at unicode.org, in particular their faq is very helpful, especially the sub-faq on UTF-16 and the BOM.
-
Re:Unicode Character Set vs Character Encoding
Nice summary. Although UTF-32 only implements a subset of UCS-4 due to compatibility issues. You can find more information on unicode at unicode.org, in particular their faq is very helpful, especially the sub-faq on UTF-16 and the BOM.
-
Re:Unicode Character Set vs Character Encoding
Nice summary. Although UTF-32 only implements a subset of UCS-4 due to compatibility issues. You can find more information on unicode at unicode.org, in particular their faq is very helpful, especially the sub-faq on UTF-16 and the BOM.
-
Re:Unicode Character Set vs Character EncodingUTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF
No, Unicode only allows character values up to 0x10FFFF (the 16-bit basic multilingual plane, plus 2^20 surrogate pairs). This conveniently means that all characters can be expressed in four bytes in UTF-8, as noted here.
That FAQ, as I pointed out in another thread, is not 100% accurate.
...but it can handle the entire Unicode character set.
All encodings can handle the entire character set. They'd be pointless if they couldn't!
-
Re:Mrrp, wrongSorry, but you're wrong, as is that FAQ.
ISO 10646 != Unicode, and UCS-1 != UTF-8.
UCS allows 31-bit character codes, Unicode however only allows up to 0x10FFFF, which is a little over 2^20. UCS characters may occupy up to six bytes, but according to this page, "All three encoding forms [UTF-8, UTF-16, UTF-32] need at most 4 bytes (or 32-bits) of data for each character."
BTW, it's important to recognise the difference between scalar values, which are the numerical values assigned to characters, and encodings (UTF-8 and so on), which are just ways of encoding those scalar values with different levels of memory efficiency, ease of parsing etc. Every encoding covers the same range of scalar values (ie. all of them).
(Unfortunately the official Unicode standard is only available in dead tree form, so it's kinda hard to give relevant links...)
-
Characters, not glyphs!
I think, frankly, that this report is rubbish. The purpose of Unicode is NOT to provide a full listing of all possible glyphs; it is to provide a list of characters. The author of the report appears to me to have made a reasonably common mistake when reading through the Unicode spec; he sees one of the Unified Han characters, says "Ha! That looks nothing like the character in !" and assumes that Unicode is some Western pigheaded colonialist rubbish.
For a more complete discussion, which summarises more accurately the way to use the Unified Han character section of the Unicode specification, trot off to here. Particularly read the section on "why were the characters unified". Unicode isn't perfect, but the Unified Han system is a good attempt to minimize bloat in the character tables.
p.s. Those dudes from the Klingon Language Institute have been trying to get themselves a spot in the Unicode tables for ages and have recently had their application rejected :-( see here -
This is so wrong
The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here.
Unicode may have its problems, but this is not one of them.
-jfedor -
More Flamebait :)
Maybe if people didn't try to get character sets like Klingon, Cirth and Tengwar added into unicode we wouldn't have this problem!
-
Duh.
This should be obvious to anyone who has ever looked at a unicode chart or has had to click "Cancel" when asked to install character support for any of the myriad languages that need language packs to be displayed in Windows. Ok, so they built a way to theoretically support all of these characters. This does not mean that I can read Japanese, however, and making it possible to see it in my browser will not change that fact, nor will it make Google searchable in Japanese, cause IRC to support katakana or hiragana characters (and just freaking forget kanji unless you want to chat with a graphics tablet). Unicode has purposes (besides making it easier to hack web servers, that is), but the hopes and dreams built around it are a classic case of throwing tech at a social barrier to try and make it go away.
-
Re:Is this a big deal?One problem Konqueror has is with font handling. For example, any page encoded in UTF-8 is butt-ugly in Konqueror until you set the font size to be precisely 15 points.
Pages encoded in Japanese are ugly unless you set the font size to 16 points.
The way Mozilla works around this problem is by, instead of performing a "we will scale the closest sized font we have, even if it looks ugly", they perform a "we will resize the size of the letters rendered to the closest font available, without scaling". Which results in much more satisfactory results.
- Sam
-
Good, but they'd be better using unicode
From the site (to be precise, http://users.shore.net/~ndm/java/mmexplorer/mmset
. html):You will need a browser that supports the "font face" HTML command and has access to the Symbol font. [...] The formula "j R j" should show up as "phi arrow phi". If you see "jRj" or if you see some kind of dark diamond between two phi's then you will not be able to view these pages properly.
This is a particularly bad way of displaying mathematical formulae, because the meaning of the text depends in a very messy way on the layout (i.e., what font it is in). It shouldn't be the case that just looking at a formula in a different font renders it completely meaningless.
The pleasant way to use mathematical symbols online is using Unicode. The unicode character set, which is supported by all common web browsers including Netscape 4, contains all the symbols a mathematician could want (indeed, arguably, all the symbols anyone could want), such as GREEK SMALL LETTER PHI, RIGHTWARDS ARROW, DIAMOND OPERATOR, LEFT NORMAL FACTOR SEMIDIRECT PRODUCT etc..
If a browser doesn't have a particular symbol, the user will see a mark that shows that a character is missing. What they won't see is characters which are semantically different, like "R" instead of RIGHTWARDS ARROW. If the user saves the page as a text file, the maths symbols will still be present and retain their meaning.
For more complicated mathematical expressions, the way to go is MathML. However, since most browsers other than Mozilla can't support this yet, though you may be able to get plug-ins. Nevertheless, anything has to be better than encoding semantic information through font choice.
-
Question about Keyboard PDAs w/UnicodeMyself, I'd use a pocket computer mainly for use in libraries, since I'm a scientist and in the library it's a real pain in the rear that you have to lock up your notebook in the reading room because it's too clumsy to carry around and you I want it stolen. So I need a pocket device that I can carry around comfortably with which I can type text comfortably as well, which rules out most pen-type ("pocket pc"-esque) PDAs.
Since I'm in Islamic studies, I make extensive use of non-Roman characters, including Arabic as well as Roman transcription for Arabic. Hence, I need a PDA that supports Unicode as well, and in a useable version (i.e. with a unicode-based word processor, which rules out EPOC and PalmOS)
The only pocket OS that does Unicode fairly well and is in the market already is Windows CE, I'm ashamed to say, and the only CE PDA with a keyboard is the $800 HP Jornada 700 series (the older 680 and 690 as well), all of which are damn expensive.
So, does anyone know of another keyboard PDA with an operating system that supports Unicode that is either available already or under development? Will the pocket versions of Linux support Unicode? (The Qt-based developments should, I think.) Has anyone ever used Unicode on a PDA? I'd be really interested.
-
Question about Keyboard PDAs w/UnicodeMyself, I'd use a pocket computer mainly for use in libraries, since I'm a scientist and in the library it's a real pain in the rear that you have to lock up your notebook in the reading room because it's too clumsy to carry around and you I want it stolen. So I need a pocket device that I can carry around comfortably with which I can type text comfortably as well, which rules out most pen-type ("pocket pc"-esque) PDAs.
Since I'm in Islamic studies, I make extensive use of non-Roman characters, including Arabic as well as Roman transcription for Arabic. Hence, I need a PDA that supports Unicode as well, and in a useable version (i.e. with a unicode-based word processor, which rules out EPOC and PalmOS)
The only pocket OS that does Unicode fairly well and is in the market already is Windows CE, I'm ashamed to say, and the only CE PDA with a keyboard is the $800 HP Jornada 700 series (the older 680 and 690 as well), all of which are damn expensive.
So, does anyone know of another keyboard PDA with an operating system that supports Unicode that is either available already or under development? Will the pocket versions of Linux support Unicode? (The Qt-based developments should, I think.) Has anyone ever used Unicode on a PDA? I'd be really interested.
-
Question about Keyboard PDAs w/UnicodeMyself, I'd use a pocket computer mainly for use in libraries, since I'm a scientist and in the library it's a real pain in the rear that you have to lock up your notebook in the reading room because it's too clumsy to carry around and you I want it stolen. So I need a pocket device that I can carry around comfortably with which I can type text comfortably as well, which rules out most pen-type ("pocket pc"-esque) PDAs.
Since I'm in Islamic studies, I make extensive use of non-Roman characters, including Arabic as well as Roman transcription for Arabic. Hence, I need a PDA that supports Unicode as well, and in a useable version (i.e. with a unicode-based word processor, which rules out EPOC and PalmOS)
The only pocket OS that does Unicode fairly well and is in the market already is Windows CE, I'm ashamed to say, and the only CE PDA with a keyboard is the $800 HP Jornada 700 series (the older 680 and 690 as well), all of which are damn expensive.
So, does anyone know of another keyboard PDA with an operating system that supports Unicode that is either available already or under development? Will the pocket versions of Linux support Unicode? (The Qt-based developments should, I think.) Has anyone ever used Unicode on a PDA? I'd be really interested.
-
Question about Keyboard PDAs w/UnicodeMyself, I'd use a pocket computer mainly for use in libraries, since I'm a scientist and in the library it's a real pain in the rear that you have to lock up your notebook in the reading room because it's too clumsy to carry around and you I want it stolen. So I need a pocket device that I can carry around comfortably with which I can type text comfortably as well, which rules out most pen-type ("pocket pc"-esque) PDAs.
Since I'm in Islamic studies, I make extensive use of non-Roman characters, including Arabic as well as Roman transcription for Arabic. Hence, I need a PDA that supports Unicode as well, and in a useable version (i.e. with a unicode-based word processor, which rules out EPOC and PalmOS)
The only pocket OS that does Unicode fairly well and is in the market already is Windows CE, I'm ashamed to say, and the only CE PDA with a keyboard is the $800 HP Jornada 700 series (the older 680 and 690 as well), all of which are damn expensive.
So, does anyone know of another keyboard PDA with an operating system that supports Unicode that is either available already or under development? Will the pocket versions of Linux support Unicode? (The Qt-based developments should, I think.) Has anyone ever used Unicode on a PDA? I'd be really interested.
-
Re:Character Sets...
What about character sets? I'm sure Unicode won't be able to handle all the alien characters. We'll probably need 32-bit characters.
We already have UTF-32. -
Find a Unicode editorPersonally I like Bell Lab's Plan 9 sam (Unix and Windows versions are avaiable) editor -- it handles Unicode text using a nifty ed-like command language. You'll need a Unicode editor, because according to the Unicode pipeline, several musical scripts are purposed for inclusion in Unicode:
- U+1D000..U+1D0F5 - Byzantine Musical Symbols
- U+1D100..U+1D1DD - Western Musical Symbols
FYI, other Unicode editors for Unix are available, e.g. yudit. Good luck!
-
Re:Note
Actually, the original APL symbols are in Unicode! Starting with the "I-beam" character at hex value 0x2336. The section they are under is called "Miscellaneous Technical".
-
Re:Writing obfuscated Perlit's not APL if you don't have the domino and triangle in your character set
Fortunately, the APL symbols are included in Unicode! Starting with the I-beam at hex 0x2336. The code charts are here.
-
How will USians type them?
Every keyboard and OS in the world supports ASCII (positions U+0000 to U+007F of Unicode 3). Not every keyboard and OS supports Unihan (U+4C00 to U+A000 or something). One generally has to buy CJK input support for common consumer operating systems.
<O
( \
XPlay Tetris On Drugs!