Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
Re:Looks like this
Note that according to the Unicode roadmap, tengwar and cirth are tentatively allocated in the range 0x13000-0x130ff, rather than 0x1cc00-0x1ccff as in the proposal.
-
What XML REALLY is....
It's is not a data format.
It's not a framework.
XML is a badly-formed roman numeral.
It should probably be written "MXL".
But even that might be a problem. You might need to use the Unicode Standard symbols: 2169,216F,216C -
Re:my reasons.......
You can use a 64k lookup table. Fast and easy.
No you can't. As previously mentioned, many languages don't have 1:1 mappings for case conversion. German, for example maps the lower case es-set to capital "SS". That's two glyphs. There are some which require three.
It also doesn't solve problems of languages sharing glyphs, which were also previously mentioned. One might expect, for example, lower-case English "a" and lower-case Greek "alpha" to be different variables, but they map to the same upper-case glyph. Then there's the special Turkic rule for uppercase I and dotted uppercase I.
This is all covered in section 5.18 of the Unicode 4.0 standard.
Interestingly, there are also case folding rules for Deseret which is outside the basic multilingual plane. I think most Mormon programmers speak English, though, so we can probably ignore this for now.
-
Re:wtf? This wasn't automatic?
Take for instance the Unicode standard. It's an open standard, and quite important for internationalisation in our digital age, but you'll pay $74 to get it.
Or just download the free PDFs from the Unicode Consortium's website. You aren't permitted to print them, but I for one can cope with missing out on the dead tree format. -
Byte != character
This is an unfortunate misconception propagated by C/C++. A character today is a 16-bit unsigned integer.
-
sounds like
Sounds like a job for Unicode.
Unicode.org -
Re:Translation
Gee... I wish I could speak whale.
Well, it wouldn't really help much in this case.
You just lose too much in the transliteration, since Slashdot doesn't support the "Whale, Northwest Pacific" Unicode character set. -
Re:c# and Stdin/Stdout anyone?
Unicode will remain limited to U+0
Definitely forever; it's inscribed on stone tablets kept in a hidden vault deep below the secret ISO headquarters on the dark side of the moon. .. U+10FFFF (the highest codepoint UTF-16 can represent via surrogates) for a long time, possibly forever.
UCS-4 is absurdly wasteful (at least eleven bits per character are never used) so UTF-8 and UTF-16 are both reasonable compressed encodings...
Two points:
1. UTF32 (what the Unicode consortium calls what you call UCS-4) is more efficient than the smaller encodings. Your arrays of 16-bit characters (if you're using UTF16 internally) are likely to be aligned on 32-bit boundaries anyway, so you're actually wasting space using UTF16 if you're using characters outside the BMP. Besides, for most applications, strings aren't that large anyway. Does it really matter if your data is using 100 kb instead of 60 kb?
2. UTF-8 is absurdly wasteful. Not only is it slow to encode and decode, but it also uses more space than UTF16 for any far-Eastern script.
See also the Unicode FAQ on UTFs. -
Re:Abolish "intellectual property".
The idea is to create standards that everybody will use so that we can communicate with each other.
And this particular standard has uses far, far beyond just the Internet.
I work at a fulfilment compnay. So obviously we've done a lot of work integrating our internal systems with our clients' preferrred couriers - DHL, UPS and so forth. And when we exchange address and customs-value information, we use ISO 3166 3-character countries (e.g. USA, GBR, DEU, FRA) and ISO 4127 Currencies (USD, GBP, EUR).
Now clearly both we and the couriers and the clients who supply us with order information benefit from the availability of these agreed codes, that's the point of standards.
But the amount of re-infrastructuring work we'd have to do to move away from using the ISO codes would be prohibitive (i.e. it could be a company-killing non-profit expense), which hardly seems fair recompense for the amount of work thatgoes into being compliant in the first place.
I also note that the Unicode Technical Committee also has concerns about these satndards remaining royalty free and notes that a lot of third-party pre-existing work was used in creating them in the first place.
TomV -
A pretty keyboard doesn't necessarily solve thisI hope you realize that you can paste all the happy stickers on your keys that you want, or even get all the keyboards with exactly those glyphs that you want already on them, yet still find yourself with nothing usable. What precisely are the codes being delivered by those keys, and how exactly will your system interpret such codes?
Imagine you want to write out Jean-Baptist Moliere's name correctly--and in all caps, to boot. Now, that first e should carry a grave accent. So do you just find a keyboard with a capital e+grave on it? Let's say that your system interprets a keypress there to mean character number 0xC8. In the ISO 8859-1 (Latin1 for Western European languages) eight-bit encoding, this number is indeed a LATIN CAPITAL LETTER E WITH GRAVE.
So you might appear to be all taken care of. But you aren't. Tomorrow, you decide you'd like to write "correctly" the famous name of the inventor of robots, Karel Capek (aka Karla Capka). That C there should carry a caron, because it's not pronounced "Kapka", but "Chapka". So you go find yourself a Czech keyboard, and lo and behold, it has the proper character!
Are you set? Not at all; to the contrary, now you're I in trouble. Because you might well find that the character generated by that key, as recognized by your computer, is also number 0xC8. In the ISO 8859-2 (Latin2 for Eastern European languages) eight-bit encoding, that same 0xC8 is now taken to mean a LATIN CAPITAL LETTER C WITH CARON.
See the problem? If you look at Karel's name in your trusty Latin1 locale, it will be screwed up, and if you look at Jean-Baptist's under a Latin2 locale, then it will be screwed up. You can't win.
Now, as for the Euro symbol, you're going to have even more (none-)fun, because you aren't going to find a suitable ISO eight-bit encoding that includes it. The 8859's just aren't going to do it for you.
Of course, were this but in ISO 10646 (that is, in Unicode), these particular problems do go away. There, the LATIN CAPITAL LETTER E WITH GRAVE is at U+C8 (yes, really; the same as in Latin1), but the LATIN CAPITAL LETTER C WITH CARON is at U+10C, a completely distinct numeric code point. This is as it should be, since those really are different glyphs, so they shouldn't share the same numeric representation. On the matter of the Euro for your keyboard, under Unicode, you've even got EURO SIGN sitting there at U+20AC for you.
Even if you tried to go this route, I suspect that you're probably just exchanging one set of problems with another. After all, how well is your system truly set up for you to use Unicode? Can it map keyboard events into appropriate code points? And what about the tools you're using? What are you going to do with it once you have it? Consider the multiplicity of external encodings for the same code points, such as for disk storage, network transfers, etc, that you find in UTF-8, UTF16-LE, etc.
So, I don't think there are answers to the submitter's query that are at all so simple as others have presented the matter here. For the curious, here's a good reference on the mess we're in now, called appropriately enough, ISO Alphabet Soups.
--tom
-
Re:Unicode
Once those non-fictional languages for which our understanding is in a state that can support Unicode are done,
Why? Why is it more important that we chase down every script once invented by a missionary who managed to translate half the book of Luke into it for a now extinct tribe before we start encoding a fairly well-known and commonly used script?
[and until] we have some idea of the scope of room that will be necessary for the encoding of the remaining current repertoire
We do. Look at the Unicode Roadmaps. Notice that after they've placed every script they could concieve of encoding, there's still large spots open on SMP, and they don't have the foggest what's going into the planes 4-13. Space is not a problem.
Let's say that a Chinese writer is born who is at least as important as Tolkien. In his works, he uses unencoded (new or not) standard Chinese characters. Are you saying that it is more important to get Tolkien's fictional scripts, which are not the actual medium of his literary work, but are in fact part of the "message" of his literary work, encoded than it would be to get the new characters from the hypothetical Chinese writer, which as postulated WOULD BE part of the actual medium of his work, encoded?
I think your distinction is without point. Any encoding of the Lord of the Rings needs Tengwar and Cirth for the title pages and indexes. And actually, I would think that Tengwar would be more important, as there's people out there writing stuff in Tengwar, whereas depending on the use of this word, it may never appear outside the context of his work.
I realize you don't care what people are using to write unless they have a college degree writing for academic pursuits, or if they happen to live in the wilds of Africa, but actual use is important.
Honestly, if this Chinese Robert Heinlein invented the Chinese word grok, would you be so quick to offer him a new character for a fictional word? Why?
they tend to be "we don't understand the repertoire well enough" or "we don't agree that the proposed repertoire properly represents the script," not "hieroglyphics should never be encoded in Unicode."
So why should Tengwar, a well understood script wait on something that we don't know enough to encode, and frankly, if two hundred years hasn't done it, possibly we won't ever know enough to encode? -
Re:This is the reason Unicode is so screwed up
"Feature creep?" You mean like fiction writers inventing new alphabets and languages like Elvish? It's Unicode that's trying to bring some uniformity and saneness to this human condition of Babel.
Does supporting fiction writers inventing new alphabets and languages justify the increased complexity of Unicode? According to a retrospective on a decade of Unicode , increasing the fixed char size to 16 bits was good enough for real world practical work (as opposed to "play"). I put stuff in bold for emphasis:
Lee Collins, now at Apple works with Davis' new character encoding proposals for future Apple systems. One system includes fixed-width, 16-bit characters, under the name "High Text" (in opposition to "Lower Text" ASCII). Collins investigates: ... ...
[Collins says:] "At Apple, we were not easy converts, however. We had some serious issues, both technical and practical. On the technical side:
* Would the increase in the size of text for America and Western Europe be acceptable to our customers there?
* Could the Chinese, Japanese, and Korean ideographs be successfully unified?
* Even then, could all the modern characters in common use actually fit into 16-bits ?
Our investigations, headed by Lee Collins, showed that we could get past these technical issues. ...
And, in terms of character count, when we counted up the upper bounds for the modern characters in common use, we came in well under 16 bits.
Moreover, we also verified that no matter how you coded it, a mixed byte character set was always less efficient to access than Unicode was.
We ended up satisfying ourselves that the overall architecture was correct."
So from a practical standpoint, 16 bits could do the job.
Remember also that the FSF, in implementing wchar_t for the GNU C++ compiler, simply added another 16 bits, to make it 32 bits, anticipating a possible change to Unicode. Little did they expect the arrival of variable width encoding.
Something must be said in favor of fixed-width char sizes. Is it not faster to index and process them? Is is not easier to set a fixed buffer size and check buffer overflows, as opposed to constantly fragmenting the heap for the sake of each character?
Are these considerations irrelevant for programmers working with embedded devices, which have small amounts of memory? Give them "Unicode input", and those devices will choke. -
Re:This is the reason Unicode is so screwed up
Am I the only one unhappy with the current Unicode? The problem is that there's just not one Unicode -- there's THREE (UTF-8, UTF-16, and UTF-32).
No, there is just one Unicode. There are three different ways to represent Unicode data, UTF-8, -16, and -32 as you mentioned.
The Unicode organization seems pretty disciplined to me! To encode all the asian characters in active use, they needed a character space bigger than 2^16. So, they've chosen a 21-bit system that maps very easily to 16-bits for most of the characters. And UTF-8 maps the whole Unicode space down to 8-bits for most US-ASCII characters.
If a filename is encoded in UTF-8, there's still a maximum number of bytes allowed in the name, it's just that in Unicode that could be a variable number of "characters", though strictly less than the number of bytes.
Then there's always UTF-32, which just directly maps the 21-bit Unicode character number onto a longint. Easy to deal with those...
and again, UTF-16 was good enough
You mean a previous version of Unicode that was limited to a 16-bit representation? That'd be way back at Unicode version 1. Unicode version 2.0 brought with it the UTF-x encodings and the 21-bit character space. It's been that way ever since!!!
The point is, 16 bits was not enough for the world's actively used languages.
Once expanded past 16 bits, there was enough room for all the active languages, and also room for some lesser used scripts, like Linear B.
They're not even taking ALL possible scripts and languages, even if the proper channels have been used to propose them. For example, Klingon was rejected, even though it can be argued that it is a scholarly language worthy of study.
No, Unicode is a very well planned and thought out standard, and it is now THE standard for international text.
- Peter -
Re:This is the reason Unicode is so screwed up
Am I the only one unhappy with the current Unicode? The problem is that there's just not one Unicode -- there's THREE (UTF-8, UTF-16, and UTF-32).
No, there is just one Unicode. There are three different ways to represent Unicode data, UTF-8, -16, and -32 as you mentioned.
The Unicode organization seems pretty disciplined to me! To encode all the asian characters in active use, they needed a character space bigger than 2^16. So, they've chosen a 21-bit system that maps very easily to 16-bits for most of the characters. And UTF-8 maps the whole Unicode space down to 8-bits for most US-ASCII characters.
If a filename is encoded in UTF-8, there's still a maximum number of bytes allowed in the name, it's just that in Unicode that could be a variable number of "characters", though strictly less than the number of bytes.
Then there's always UTF-32, which just directly maps the 21-bit Unicode character number onto a longint. Easy to deal with those...
and again, UTF-16 was good enough
You mean a previous version of Unicode that was limited to a 16-bit representation? That'd be way back at Unicode version 1. Unicode version 2.0 brought with it the UTF-x encodings and the 21-bit character space. It's been that way ever since!!!
The point is, 16 bits was not enough for the world's actively used languages.
Once expanded past 16 bits, there was enough room for all the active languages, and also room for some lesser used scripts, like Linear B.
They're not even taking ALL possible scripts and languages, even if the proper channels have been used to propose them. For example, Klingon was rejected, even though it can be argued that it is a scholarly language worthy of study.
No, Unicode is a very well planned and thought out standard, and it is now THE standard for international text.
- Peter -
Re:This is the reason Unicode is so screwed up
Am I the only one unhappy with the current Unicode? The problem is that there's just not one Unicode -- there's THREE (UTF-8, UTF-16, and UTF-32).
No, there is just one Unicode. There are three different ways to represent Unicode data, UTF-8, -16, and -32 as you mentioned.
The Unicode organization seems pretty disciplined to me! To encode all the asian characters in active use, they needed a character space bigger than 2^16. So, they've chosen a 21-bit system that maps very easily to 16-bits for most of the characters. And UTF-8 maps the whole Unicode space down to 8-bits for most US-ASCII characters.
If a filename is encoded in UTF-8, there's still a maximum number of bytes allowed in the name, it's just that in Unicode that could be a variable number of "characters", though strictly less than the number of bytes.
Then there's always UTF-32, which just directly maps the 21-bit Unicode character number onto a longint. Easy to deal with those...
and again, UTF-16 was good enough
You mean a previous version of Unicode that was limited to a 16-bit representation? That'd be way back at Unicode version 1. Unicode version 2.0 brought with it the UTF-x encodings and the 21-bit character space. It's been that way ever since!!!
The point is, 16 bits was not enough for the world's actively used languages.
Once expanded past 16 bits, there was enough room for all the active languages, and also room for some lesser used scripts, like Linear B.
They're not even taking ALL possible scripts and languages, even if the proper channels have been used to propose them. For example, Klingon was rejected, even though it can be argued that it is a scholarly language worthy of study.
No, Unicode is a very well planned and thought out standard, and it is now THE standard for international text.
- Peter -
From the site.
We will not discuss the Cirth, the angular letters seen in the inscription on Balin's tomb. The Cirth are also called runes, while Tengwar is translated as "letters".
I'm no Tolkein expert, but can anyone tell me if "runes" here correspond to the actual, real world runes, that is, letters of the ancient Runic alphabet?
If they are, then typing them is no difficult feat, given that there are fonts available (as the page I linked to shows), and the fact that the alphabet is already recognised by the Unicode 2.0 (here as well it seems, although I'm too lazy to actually check it).
(/.-tters from the Indian sub-continent will, of course, note the irony in being able to effortlessly type obscure ancient and artificial scripts, while struggling for normal, regular, alive Indic languages)
-
Re:This is the reason Unicode is so screwed up
"Feature creep?" You mean like fiction writers inventing new alphabets and languages like Elvish? It's Unicode that's trying to bring some uniformity and saneness to this human condition of Babel.
Your problem is that you're confusing the Universal Character Set (UCS), which is the core of Unicode, with a character encoding, such as UTF-xx and so forth. UTF-16 is NOT Unicode! When will that myth ever die? Perhaps you should go visit the Unicode Consortium home page and read through some of their FAQs.
And there's way more than just three encodings, but there's only one Unicode (actually there's ISO If these Elvish characters are more than just a curious fad then what's wrong with assigning them Unicode code points? The only problem would be doing so prematurely before all the characters have been reasonably deteremined and stable. Giving them codepoints allows font designers and other software applications to unambiguously exchange Elvish text. Granted though, the Unicode Consortium is primarily concerned with real human languages rather than inventions of fiction.
As far as encodings, keep in mind that Unicode is essentially a 20-bit character set allowing slightly more than one million separate characters to be defined (I say 20-bits loosely since the UCS codepoints really don't map to bits at all). So even your beloved UTF-16 (or the older UCS-2) is unnecessarily messy; having to use the low and high surrogate pairs to properly encode the entire UCS repertoire. Not to mention things like byte order issues and so forth.
This is why I actually love UTF-8, it is actually very simple and easy to work with. I think a lot of people get scared-off because it is variable-width, but for anybody who has actually coded using it, it is a very nice and easy to use encoding. Of course people primarily communicating in non-Latin languages may have other opinions. That's fine too.
As far as Project Gutenberg selecting US-ASCII, well, it sure looks identical to UTF-8 to me! In fact ASCII text is identical to UTF-8 text (but not the other way around). Now when they start archiving lots of non-English public domain texts, well, they may start rethinking the ASCII limitations and I'd be very surprised if UTF-8 is not the adopted character encoding. In fact they could just make the policy change right now, and they'd have to retype exactly zero documents in their collection.
-
Unicode
-
Re:The ONLY Universal EBook Format!
I'm happy you brought Project Gutenberg into the discussion. PG is a wonderful resource. When the project was nascent in the 1970s and 80s reading ASCII on your TRS-80 probably seemed pretty neat. But now that the PG dream of preserving and distributing the printed word through digital technology has stagnated into a dogmatic cult with the goal of preserving ASCII it's time to reevaluate the meaning of Project Gutenberg.
Those of us who are literate and computer savvy and have seen places other than the USA recognize the harm that reducing printed material to chunks of ASCII does. And far from mere loss of formatting or typographical embellishments much of the meaning of a text is destroyed when run through the chunky sieve of ASCII conversion. Most accented Roman characters cannot be rendered in ASCII. Non-Roman characters cannot be rendered in ASCII. Typographical features such as relative type size, style, and formatting are either lost entirely or reduced to the low-res rendering capabilities of monospaced ASCII. ASCII has no provision for rendering traditional methods of communicating typographically such as small caps, ligatures, distinction between hyphen, endash, and emdash, etc. despite the fact that virtually every printed text makes use of these features.
Digital technology has progressed without our friends at Project Gutenberg. There is an alternative to ASCII which is now standard to all major computing platforms: Unicode. From the unicode.org website:
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Encoding the PG texts in Unicode would require no extra effort on the part of the PG volunteers (well, those who have moved on from their TRS-80s, anyway).
Why not use technology that attempts to accomodate the typographical traditions inherent in your source material rather than reducing that material to fit an obsolete technology?
And even if you still cling to your belief in the infinite beauty, timelessness, and universality of ASCII, please stop using linefeeds every 70 characters within paragraphs. WTF do you Project Gutenbergers imagine we read these texts on TRS-80s? -
Re:The ONLY Universal EBook Format!
I'm happy you brought Project Gutenberg into the discussion. PG is a wonderful resource. When the project was nascent in the 1970s and 80s reading ASCII on your TRS-80 probably seemed pretty neat. But now that the PG dream of preserving and distributing the printed word through digital technology has stagnated into a dogmatic cult with the goal of preserving ASCII it's time to reevaluate the meaning of Project Gutenberg.
Those of us who are literate and computer savvy and have seen places other than the USA recognize the harm that reducing printed material to chunks of ASCII does. And far from mere loss of formatting or typographical embellishments much of the meaning of a text is destroyed when run through the chunky sieve of ASCII conversion. Most accented Roman characters cannot be rendered in ASCII. Non-Roman characters cannot be rendered in ASCII. Typographical features such as relative type size, style, and formatting are either lost entirely or reduced to the low-res rendering capabilities of monospaced ASCII. ASCII has no provision for rendering traditional methods of communicating typographically such as small caps, ligatures, distinction between hyphen, endash, and emdash, etc. despite the fact that virtually every printed text makes use of these features.
Digital technology has progressed without our friends at Project Gutenberg. There is an alternative to ASCII which is now standard to all major computing platforms: Unicode. From the unicode.org website:
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Encoding the PG texts in Unicode would require no extra effort on the part of the PG volunteers (well, those who have moved on from their TRS-80s, anyway).
Why not use technology that attempts to accomodate the typographical traditions inherent in your source material rather than reducing that material to fit an obsolete technology?
And even if you still cling to your belief in the infinite beauty, timelessness, and universality of ASCII, please stop using linefeeds every 70 characters within paragraphs. WTF do you Project Gutenbergers imagine we read these texts on TRS-80s? -
Re:Klingon in UnicodeSimilar complaints has been raised with regards to the Chinese, Japanese and Korean characters which are unified in Unicode.
The idea is that Unicde encodes characters and not glyphs. The same character may have different glyphs, for example the difference between traditional and simplified chinese. It has spawned a lot of controvercy, and personally I can understand both sides of the argument. But this is the way it works in Unicode, and I don't see anyone else coming up with a better suggestion on a standard character encoding scheme.
I'd say that probably exactly the same reasoning lies behind the codification of the runic characters.
-
Re:Close but not quite.No they shouldn't. Plain and simple. Case-insensitivity has no business in a file system.
Allow me to expand a little on why this is the case:
Case-insensitivity is a complicated business as soon as you leave the simple domain of the english language, and this is the reason you usually only head english-speaking people wanting case-insensitive file systems.
An example: German has a letter ß, which in upper case becomes SS. tchüß -> TCHÜSS. Now, when lowercasing, you can't just map SS to ß, instead it becomes ss. I.e. TCHÜSS -> tschüss.
Do you start to realise the implications this has on a case-insensitive file system? (the question to answer is: is "tchüß" and "tschüss" considered to be the same file?)
It gets worse. In french, as spoken in france, the letter ë is converted to uppercase E. I.e. citroën -> CITROEN. But in Canadian french, it becomes Ë. I.e. citroën -> CITROËN.
When you start to bring in other languages, for example the Japanese full-with and half-width latin characters it starts to get really messy.
In order to handle all of this in a case-insensitive file system the file system itself needs not only to be aware of the intricate details of character encodings and casing for different languages, every single file system operation would also have to look at the currently selected locale in order to determine wether two names are equivalent or not. If you believe this is simple, read the FAQ's at the Unicode site and you will never again suggest that the file system should be case-insignificant.
However, making a user application work independently of case in file names is a reasonable idea. However, it would have to be specified by the UI framework, for example Gnome. I'm not sure exactly if that idea would work at all since I haven't given it much thought.
I'm so happy the Unix file system is case-significant.
-
Re:They make their IT folks out of iron there!
scripsit HarveyBirdman:
Hey, you know those clicking noises in the Bushman language? Are there HTML codes for those?
Well, checking the Unicode pages, I find:
- LATIN LETTER DENTAL CLICK: ǀ
- LATIN LETTER LATERAL CLICK: ǁ
- LATIN LETTER ALVEOLAR CLICK: ǂ
- LATIN LETTER RETROFLEX CLICK: ǃ
So yes, there are. I note, however, that in Zulu orthography the first two and the last can be rendered with x, c, and q, respectively. Cake. See the Latin Extended-B code page.
-
Re:Best quote from the articleNo need to clutter up the ASCII table with more special-purpose characters... languages 100 years from now will almost certainly use Unicode caracters for some identifiers and operators. No more confusion about '!=' versus '' in different languages. In fact, someone will probably add this to C++ and perl within the next ten years.
Some possibilities:
- \u2200 and \u2203: "For All" and "For Every" operators
- \u2263: "Is Identical With" (like java's '=' versus 'equals()' distinction)
- \u2264 and \u2265: less-than-or-equals, greater-than-or-equals
Of course, programmers will still have to type using QWERTY keyboards for a while, so there will be shortcuts, escape-sequences, etc. just as at present...
-
Unicode for the next version?
I am looking forward to the Unicode version of whitespace. This would truly demonstrate the expressfullness of Unicode as it has several whitespace characters.
$ grep ";WS;" UnicodeData.txt
000C;;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
1680;OGHA M SPACE MARK;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS; 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS; 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS; 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS; 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS; 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS; 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS; 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS; 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS; 0020;;;;N;;;;;
2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;WS; 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS; 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS; 0020;;;;N;;;;; -
Re:Yes, and partly language designers' doing
Actually...
It may not be well known, but Ruby is ten years old today (24 February 2003). Perl is 16 years old (1987). I'm not sure how old Python is. Unicode 1.0 was fully released in June 1992, five years after the projects started (mostly at Apple, apparently).
Except for a few "gotchas" (which are, per Matz, part of what will be cleaned up on the way to Ruby 2.0), Ruby supports a lot of Unicode (UTF-8 at least) already, as well as other multibyte character systems (Shift-JIS and EUC, primarily, as it was originalyl written in Japan and is still primarily maintained there).
The point is that Unicode is still immature (4.0 is on the way, but how many systems yet support 3.2?). I suspect that Ruby 2.0 will be fully Unicode compliant. (Even with Unicode compliance, of course, one still has to write programs to be compliant; language Unicode compliance is merely support.)
-austin
-
"Unicode" is more than the BMP
Unicode contains merely the lower sixteen bits of the UCS (Universal Character Set), aka ISO 10646. UCS defines a 31-bit character set; the lower 65534 positions, which Unicode dupes, is the Basic Multilingual Plane (BMP) or Plane 0.
You're confusing Unicode with UTF-16. Unicode covers the entire defined UCS code space: "the Unicode standard and ISO/IEC 10646 now support three encoding forms that use a common repertoire of characters but allow for encoding as many as a million more characters."
But here's something I'm curious about, from the same page:
For example, a group of choreographers may design a set of characters for dance notation and encode the characters using code points in user space.
Doesn't dance notation require just four characters, left down up right?
-
Talk to Unicode Consortium first
I want to Google in Quenya, Sindarin, and possibly the Black Speech, darn it.
Those three languages use tengwar as their native writing system. Tengwar doesn't even have a Unicode block (but it's proposed), let alone support in Windows.
-
Re:ReputationNo shit. His last article was libelous, and several slashdot readers turned up the truth that his employer was working on a proprietary competitor to the Unicode standard.
How about this little snippet?
[...]being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters.
This person does not even do the most cursory research on his subjects. For the uninformed, Unicode assigns a unique address to every human character (i.e., letter, kanji, heiroglyph). The entire code range is 32-bit (4,294,967,296), with various text formats for addressing those codes (UTF-8 and UTF-16 being the most popular).
This person is, at best, an attention seeker. He's more likely a very public troll. -
Re:Well, I dunno
No, that's the case. Some characters do have their forms changed in simplified characters, but mostly it's the *shapes* (the technical term is a radical, noun sense 4) that have been simplified. An example is the character for 'fish': The four dots have been changed to a horizontal line in the simplified character. Another typical simplification is the removal of a radical.
For some reason this change is more easier going from simplified to traditional - maybe it's intuitive to deal with something being added than something being removed. Maybe a linguist can tell us why.
-
Re:Use something else
People who fear that a switch from US-ASCII to UTF-8 will break their existing programs should really read the Bell Labs document linked above, section 2.3 of the Unicode spec, or RFC 2044. UTF-8 was designed very carefully to make life extremely easy for people making that exact migration. There are amazingly few circumstances where it even matters that it is variable width. Those people who are suggesting UCS-2, UCS-4, etc. as alternatives in order to solve the nonexistant problem of UTF-8's variable width nature should really take a closer look at it.
-
Re:Unicode
The Klingon alphabet was disapproved for inclusion in Unicode in may 2001.
-
Re:Unicode
2. The trend for english to become the "standard" language world-wide
That's the part I'm worried about.
Unicode is backwards-compatible with ASCII, so the legacy/source code argument is irrelevant. There are already compilers available (such as Vector Pascal) that interpret Greek, Cyrillic, Katakana and Hiragana. Heck, by 2012 I want to be able to code in Klingon! -
Re:Keyboard/Mouse sub-categories
My prediction: I see, you see, we all see ASCII. Yep, plain text will still be there.
Good grief... If the majority of the Computing Universe hasn't standardized on Unicode by 2012 I will have no hope for Humanity... -
Re:Curly vs. straight quotes.
Here's what I could find: [a list of names of Unicode characters]
The name of a Unicode character never tells you everything about it. In this instance, Apostrophe is merely a legacy backlash - if you were reading the standard (available for free at the Unicode website), you would see the note below apostrophe noting this, and the note below the single right quote noting that it was the real apostrophe. -
Re:Hypocrit
I'm not sure about this page in particular, but I am sure that Netscape 4.08 (the browser) cannot display NCR unicode characters without a charset to help it along. (btw. 4.79 is the suite number, not the browser)
-
Re:ASCII Only?
SGML? How about just straight-up UTF-8?
-
Re:New Newline Character?
Each of them has a different function. 000A and 000D are for compatability with ASCII. 0085 is for a unified character to replace the 000D 000A pair used on some OS's. However, some programs (eg notepad) use line breaks when they really mean paragraph seperators, so Unicode defined two codes which mean REAL line seperator, and REAL paragraph seperator. This report explains it quite clearly.
-
Re:So?
0x85 is à (a grave). So everyone in France?
No, you're looking at the extended ASCII chart. What this is talking about is Unicode. A Unicode 0x0085 is the control character NEL (http://www.unicode.org/charts/PDF/U0080.pdf, page 3) NEL is NExt Line.Chris Beckenbach
-
Re: There is an Indian Linux distro in development
Indian Linux is your answer. The website says it will be developed in all 18 official Indian languages.
Might be slightly misleading of course; I'm presuming they really meant all 10 ISCII ("Indian Standard Code for Information Interchange") alphabets in transmutation to give, I don't know, 12 or so languages. Will be interesting to see if they later provide for transcribing the Arabic script as well; the website at present seems to be suggesting only native Indian scripts. Not to accuse them of ethnic bias; I'm pretty sure it's plain intellectual laziness.
A More Detailed Explanation:- Hindi, Sanskrit, Marathi and Nepali use the Devnagri script; a few languages such as Konkani, Manipuri use the Roman script and scripts of other languages. Sindhi, Kashmiri and Urdu use the Arabic script (or modifications of it thereof). Unicode doesn't recognise the Assamese script to be different from the Bengali one, but provides for two additional Assamese-only characters; not sure if ISCII does that as well. (IndLinux's page gives seperate keymaps for Assamese and Bengali; I neither speak nor read these languages, so I don't know if they are significantly distinct.) All other languages, namely, Gujarati, Oriya, Tamil, Telugu, Malayalam and Kannada have their own unique scripts.
Tamil is way ahead in implementation though; the Tamil Linux group is very active; the website says you can use Tamil in Mandrake 9.0. Can't read Tamil myself, but the KDE snapshots provided look extremely cool to me.
-
Unicode lacks smileys :-/
Seems that Redhat and others are moving to the UTF/Unicode which should solve lots of problems in char encoding.
Unfortunately, there is only T H R E E smileys in unicode when people nowdays use tens of them in daily text. It looks like the unicode is going to be another version of ASCII - collection of junk what nobody uses.
:-) -
Unicode lacks smileys :-/
Seems that Redhat and others are moving to the UTF/Unicode which should solve lots of problems in char encoding.
Unfortunately, there is only T H R E E smileys in unicode when people nowdays use tens of them in daily text. It looks like the unicode is going to be another version of ASCII - collection of junk what nobody uses.
:-) -
Re:Can someone explain what "i18n" is?
actually, unicode decided not to encode the klingon characters for several reasons. the movie producers don't use them consistently and the creators drew two very different alphabets, neither of which is widely used by the fanfic community which uses the latin transliteration. this is documented on the proposed characters page.
proposal to encode klingon -
Perl6 + Unicoded Operators = APL?In memory of the original python/perl parrot
Despite your major efforts at rationalization, Perl6 looks to be just as, if not more complex than Perl5 when it comes to the human readers interpretation of the meaning of the combinations of punctuation marks, brackets, etc in Perl6 source code.
Why not just be done with the concept of multi-punctuation operators and just map the each of the operators into one of the many single Unicode characters available. Imagine the money the Perl institute could make from the sale of keyboards.
-
Re:Arial Unicode MS Equally Important
Arial Unicode was available for download with a click-through licence that basically required you to say "I own a copy of FrontPage 2000". I don't have a down-loaded version to hand to check the exact words. I noticed it had been removed from the MS web site a few weeks ago and I just assumed that they had re-organised their web site -- they have never seemed to care about the persistance of URIs...), but I guess I was wrong and it was a conscious decision to remove it.
The reason why Arial Unicode is (was?) important is that as far as I can make out it's the only way to put several languages on the web (using Unicode), specifically Indic ones, including Punjabi and Gujarati. The site that uses these languages I worked on can be found here.
There is no support for Indic languages in X11 (or OSX AFAIK). Gnome2 and Pango should fix this though
:-)Windows still has the best internationalisation support (most languages), but a default Red Hat install with the latest Mozilla is getting very good -- all the demo languages on the Unicode web site work with no problems and also all the UTF-8 samples on this page work -- this is better then Windows 2000 (I have not tried with later versions).
-
Re:ASCII Centric
-
Re:A 2-letter "country code"...
Koreans don't seem to like Unicode either. Most e-mails I receive are encoded as "euc-kr"...
related MLP:
- broad overview (Netscape FAQ)
- better introduction (*nix Unicode FAQ)
- Unicode homepage
Unfortunately, Unix Unicode implementations appear to use UTF-8 (see here) which is rather inefficient for non-ASCII encodings...
This problem appears to be bigger than the internet; it is deeply rooted in the C library itself... -
Indeed, it's not freeThe mention of M$ Word put me on alert, as have previous stories here which have demostrated that XML will simply be a container for propriatory data formats like M$ Word. Closer examination, however, reveals a much more horrible arangement.
XML is dependent on unicode, as the US Government site's reference states. Follow the W3C to unicode ,
Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.
Unicode is owned by Unicode Incorporated and all of it's documents and standarts are issued under a restrictive license with a unilaeral change clause:
Modification by Unicode Unicode shall have the right to modify this Agreement at any time by posting it to this site. The user may not assign any part of this Agreement without Unicodes prior written consent.
Dare I compare this evil arangement to ASCII and other predecesors? To have IBM, M$, Sun and other OWN the very format your data takes and to be able to change it and break previous implimentations at whim, and YOU may not? Who wants to be a plump nickle that any thing vaugly resembling unicode in the future will be called a "derivative" and it's distribution halted? Is this not a collusion of comercial software vendors to control information at it's most basic representation? Does anyone else here see this as the ultimate extention of copyright? Evil, Evil, Evil.
I'd rather see the US government continue to publish in the American Standard for Information Interchange. This extensible standard is no standard at all.
-
Indeed, it's not freeThe mention of M$ Word put me on alert, as have previous stories here which have demostrated that XML will simply be a container for propriatory data formats like M$ Word. Closer examination, however, reveals a much more horrible arangement.
XML is dependent on unicode, as the US Government site's reference states. Follow the W3C to unicode ,
Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.
Unicode is owned by Unicode Incorporated and all of it's documents and standarts are issued under a restrictive license with a unilaeral change clause:
Modification by Unicode Unicode shall have the right to modify this Agreement at any time by posting it to this site. The user may not assign any part of this Agreement without Unicodes prior written consent.
Dare I compare this evil arangement to ASCII and other predecesors? To have IBM, M$, Sun and other OWN the very format your data takes and to be able to change it and break previous implimentations at whim, and YOU may not? Who wants to be a plump nickle that any thing vaugly resembling unicode in the future will be called a "derivative" and it's distribution halted? Is this not a collusion of comercial software vendors to control information at it's most basic representation? Does anyone else here see this as the ultimate extention of copyright? Evil, Evil, Evil.
I'd rather see the US government continue to publish in the American Standard for Information Interchange. This extensible standard is no standard at all.
-
Re:♬ First musical post! ♫
By the way, have you ever considered using Byzanitine musical notes, or glyphs from the full set of notes?