Domain: unicode.org
Stories and comments across the archive that link to unicode.org.
Comments · 276
-
FUD. The Unicode codespace is open.
Unicode.org has charts of the entire Unicode codespace (yes, including Chinese) in both PDF and HTML formats. There's also an ISO/IEC standard that mirrors the Unicode standard. Heck, the Unicode book (over 1,000 pages) is only $50, less than the cost of many college textbooks half the size.
<O
( \
XGNOME vs. KDE: the game! -
How it's supposed to work
Recently I posted this comment mentioning the fact that there's really no reason why a domain such as www..com (you should see two Chinese ideograms meaning "China" between the "www." and the ".com" parts; further, if you click on this link, your browser should open a window telling you that the domain "www..com" does not exist, with the same two Chinese ideograms) doesn't exist.
Let us recall: first, as specified by the HTML specification, every HTML document, no matter what character set it is "encoded" as, is written in the all-englobing Unicode character set. So when you write something like "中国" in HTML, it refers to the Unicode characters (decimal) 20013 and 22269, no matter what the current character encoding and font are. So that's how you write the link text. Second, as for the URL itself, well, although it is not (as far as I know) formally recommended by an Internet standard, it is widely recognized that URLs are written in the UTF-8 encoding format (which is afterward %-encoded into ASCII).
The whole process is described in this Internet Draft ("Internationalized Uniform Resource Identifiers"; WORK IN PROGRESS!) by Larry Masinter and Martin Duerst where the relationship between URIs and IURIs (Internationalized URIs) is discussed in detail.
The DNS is the toughest part of all. The DNS specification (RFC1034) states (section 3.1) that DNS data is to be taken as binary for possible upward compatibility (this was wonderful foresight on Mockapetris' part!). Consequently, there is nothing as per standards wrong with using (UTF-8 encoded Unicode) 8-bit data in DNS labels. Except, of course, that many "buggy" implementations will have to be corrected for broken assumptions, *sigh*. The IDNS working group suggests using a UTF-5 encoding to avoid going beyond the current domain name limits: I think this is not a good thing and we should stick to UTF-8 and repair broken software.
Oh, and incidentally, see this page too know how broken your browser's Unicode support is.
-
Re:Most of you are missing the point
Ah, one of my favorite pet peeves. You've completely misunderstood the way Unicode works on web pages; but it's not really your fault, it's because Netscape Navigator is completely broken in this respect (it's far more - and far worse - than broken, in fact).
Neither the HTTP headers sent by Slashdot nor the preamble of the HTML file specify a character encoding. Therefore the encoding is the default encoding, i.e. ISO-8859-1 (aka latin-1). What you've written, then, is not "sayonara" but "comma cube comma ae comma E-grave comma c-cedilla". If you see anything else, your browser is broken! You've posted Shift_JIS-encoded data in an ISO-8859-1-encoded page and that doesn't make sense.
Now this does not mean that you can't have Japanese in HTML, even if the page is encoded as ISO-8859-1. Indeed, "at the bottom", every HTML document is written in Unicode, and every Unicode character is available, if not readily though the encoding (not necessarily UTF-8), then at least through SGML numeric entities of the form &#xxxxx; (where xxxxx is the decimal form of the Unicode character number). Consequently, the correct way of posting "sayonara" is "" (which I've written as "㇁よなら"). Again, if you see anything else than the hiragana for "sayonara" here (or perhaps a transcription of it, e.g. with lynx), especially if you see latin-1 characters, again, your browser is broken.
The brokenness about Netscape is that it assumes that numeric SGML character entities are to be interpreted in the current document encoding, and that is completely wrong. They should always be interpreted as Unicode character numbers. So this has somehow led to the conception that the basic HTML character data is in the character set of the encoding, which it is not! Fortunately, Mozilla repairs this brokenness, hopefully before any serious damage is done.
I posted another comment on this article to the effect that you can even have valid Chinese characters (in my example, , i.e. "China" in Chinese) in the host name part of a URL. It just happens that such domain names are not given out, but there is nothing wrong with it.
For more examples of Unicode and to see how badly your browser is broken, follow this link.
Sorry about the rant. .
-
Chinese characters in domain names?
Hey, how come is it they won't let you register domain names with arbitrary Unicode characters in them? Why can't you buy www..com? Yes, this is perfectly valid: the name is UTF8-encoded and then %-encoded as part of the URL (and the DNS specifications do allow binary data). If I didn't mess it up too much, (your browser should show this as two Chinese ideograms) means "China" in Chinese (disclaimer: I don't know Chinese).
Before such languages as Chinese and Hindi become truly usable on the Internet, support for the Unicode standard will have to make much progress. Click here to see how badly your browser supports Unicode.
-
Re:Say it ain't so!demoroniser: DEMORONISER Correct Moronic Microsoft HTML
This page describes, in Unix manual page style, a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications. The demoroniser keeps you from looking dumber than a bag of dirt when your Web page is viewed by a user on a non-Microsoft platform. NAME demoroniser - correct moronic and gratuitously incompatible HTML generated by Microsoft applications SYNOPSIS demoroniser [ -u ] [ -w cols ] [ infile ] [ outfile ] DESCRIPTION Many slick, high profile corporate Web sites I visit seemed to exhibit terrible grammar completely inconsistent with the obvious investment in graphics and design. Apostrophes and quote marks were frequently omitted, and every couple of paragraphs words were run together which should have been separated by a punctuation mark of some kind.
This remained a mystery to me until I wanted to convert a presentation I'd developed in 1996 using Microsoft PowerPoint into a set of Web pages. A friend was kind enough to run the presentation through PowerPoint's "Save as HTML" feature (I have abandoned all use of Microsoft products, so I did not have a current version of PowerPoint which includes this feature). When I got the PowerPoint-generated HTML back and viewed it in my browser, I discovered that it contained precisely the same grammatical errors I'd noted on so many Web sites, and which certainly were not present in my original presentation.
A little detective work revealed that, as is usually the case when you encounter something shoddy in the vicinity of a computer, Microsoft incompetence and gratuitous incompatibility were to blame. Western language HTML documents are written in the ISO 8859-1 Latin-1 character set, with a specified set of escapes for special characters. Blithely ignoring this prescription, as usual, Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.
These characters include open and close single and double quotes, em and en dashes, an ellipsis and a variety of other things you've been dying for, such as a capital Y umlaut and a florin symbol. Well, okay, you say, if Microsoft want to have their own little incompatible character set, why not? Because it doesn't stop there--in their inimitable fashion (who would want to?)--they aggressively pollute the Web pages of unknowing and innocent victims worldwide with these characters, with the result that the owners of these pages look like semi-literate morons when their pages are viewed on non-Microsoft platforms (or on Microsoft platforms, for that matter, if the user has selected as the browser's font one of the many TrueType fonts which do not include the incompatible Microsoft characters).
You see, "state of the art" Microsoft Office applications sport a nifty feature called "smart quotes." (Rule of thumb--every time Microsoft use the word "smart," be on the lookout for something dumb). This feature is on by default in both Word and PowerPoint, and can be disabled only by finding the little box buried among the dozens of bewildering option panels these products contain. If enabled, and you type the string,
"Halt," he cried, "this is the police!"
"smart quotes" transforms the ASCII quote characters automatically into the incompatible Microsoft opening and closing quotes. ASCII single and double quotes are similarly transformed (even though ASCII already contains apostrophe and single open quote characters), and double hyphens are replaced by the incompatible em dash symbol. What other horrors occur, I know not. If the user notices this happening at all, their reaction might be "Thank you Billy-boy--that looks ever so much nicer," not knowing they've been set up to look like a moron to folks all over the world.
You see, when you export a document as text for hand-editing into HTML, or avail yourself of the "Save as HTML" features in newer versions of Office applications, these incompatible, Microsoft-specific characters remain in place. When viewed by a user on a non-Microsoft platform, they will not be displayed properly--most browsers seem to just drop them, as opposed to including a symbol indicating an undisplayable character. Hence, the apparently ungrammatical text, which the author of the page, editing on a Microsoft platform, will never be aware of.
Having no desire to hand-edit the HTML for a long presentation to correct a raft of Microsoft-induced incompatibilities, I wrote a Perl program, the demoroniser, to transform Microsoft's "junk HTML" into at least a starting point for something I'd consider presentable on my site. In addition to replacing the incompatible characters with HTML-compliant equivalents wherever possible (a few rarely-encountered characters which can't be translated result in warning messages if encountered), the following sloppy or downright wrong HTML is corrected.
- The missing semicolon at the end of numeric character escapes (=) is supplied.
- Numeric renderings of special characters (< > &) are replaced with readable equivalents.
- Unquoted <table> tags containing non-alphanumeric characters are quoted.
- PowerPoint's mis-nesting of <font> and <strong> tags is corrected.
- PowerPoint's boneheaded use of <ul> and </ul> tags to accomplish paragraph breaks is corrected and the proper <p> tags inserted.
- Missing <tr> tags in text-only slides are inserted.
- Nugatory </p> tags are removed.
- Unmatched <li> tags in headings are removed.
- Idiot "paragraph-long lines" are broken into something suitable for editing with a normal text editor.
-w cols Wrap output lines at column cols. By default, lines are wrapped at column 72. A cols specification of 0 disables line wrapping. demoroniser attempts to wrap lines so as to preserve their meaning. Lines are broken at white space whenever possible. If this cannot be done, a line longer than the cols specification will remain in the output HTML. BUGS demoroniser is a Perl script. In order to use it, you must have Perl installed on your system. demoroniser was developed using Perl 4.0, patch level 36. FILES If no outfile is specified, output is written to standard output. If no infile is specified, input is read from standard input. SEE ALSO perl(1) Download demoroniser.zip AUTHOR John Walker
http://www.fourmilab.ch/This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided "as is" without express or implied warranty.
by John Walker
January 16th, 1998 -
Hieroglyphs and More!
Heiroglyphs are boring. The real news is that, according to the Unicode Proposed Characters Page, the Klingon alphabet, and the Tengwar and Cirth runes (from Tolkein) are under investigation for being included in Unicode.
---
Zardoz has spoken! -
Read the spec again, sparky
Unicode can be extended to 16 additional 64k segments by paring two of the reserved extender characters. That's 1 million extra, RTFFAQ
We don't know how bad things are in north korea, but here are some pictures of hungry children. -- CNN -
Re:Unicode
UTF-16 is two-byte objects, like UCS-2. Most UTF-16 characters are a single object. The difference is that in UTF-16 you can use "surrogate pairs" - two objects - to refer to characters outside the basic multilanguage plane (BMP). UCS-2 simply doesn't permit references outside the BMP.
UCS-4 gives access to the entire 31-bit ISO 10646 character set, but it's fairly inefficient since most planes in that range haven't even been assigned yet.
See appendix C of The Unicode Standard, Version 2.0 for details, or the Unicode Consortium's Web site.
(Why doesn't
/. allow use of the <cite> element?) -
Are you insane?
Hmm...you probably could go read things at the site in your posting http://www.unicode.org/ and learn these things, its not hard. its just text after all... http://www.unicode.org/unic ode/standard/principles.html
Windows 2000 is Unicode to the extreme, even supporting Dvengali, Thai and Arabic script on all versions.
no thats just full unicode support. If you truly support unicode you must support the entire character set, not just the part you feel like using. Its like saying you are ASCII compliant, but only characters 12-34. BTW: NT 3.5x, NT4 and CE (all versions) are unicode compliant as well. (Probably all NT versions but I'm not positive)
But what are the pitfalls?
Hmm...you might actually support the 4 billion people in the world who never heard of ascii until the british invaded, uh, I mean colonized their homeland.
seriously a unicode "character" is 16bits. Which means if you are developing an app for english speaking/reading americans, your text resources will double in size. Bummer... (Although there is a UTF-8 variant I don't know much about - I think its basically unicode for the most popular/common languages)
Another pain in the ass is dealing with network protocols. Everybody expects ascii. So you do a lot of converting which is a pain. Although maybe some day everything will be xml based and you could be using unicode plaintext for that...
How do you handle whitespace?
Uh just like whitespace. In the nt world you can use iswspace instead of isspace. Basically all your ascii crt functions have a unicode equivalent. This way brain dead programmers don't go running around saying the sky is falling just because they didn't bother to read up on the subject...
How do you make your API display the characters in the right fonts
Uh..instead of a small lookup table for the few ascii characters you need a big lookup table for unicode (probably a smarter implementation that maps in sections at a time when you use them). If your OS supports it, you don't worry about it at all. You use something like DrawText( L"blah blah blah" ). What gets difficult is using an ASCII based text editor to enter unicode strings. Basically you have to use cut and paste from an app that does support it or type things in manually.
All of these issues are becoming more important as the world becomes more switched on, and the boundaries shrink between places And you're saying it wasn't important when the western europeans were ranging all over Africa, Asia and the Americas conquering people? Its only important for those with compassion and understanding. Just like it should have been important to the detroit automakers who were pissed off that Japan wouldn't let them sell cars there, that when they finally did get to sell some cars there that maybe they should check and see what the preferred side of the car was for the steering wheel. -
Are you insane?
Hmm...you probably could go read things at the site in your posting http://www.unicode.org/ and learn these things, its not hard. its just text after all... http://www.unicode.org/unic ode/standard/principles.html
Windows 2000 is Unicode to the extreme, even supporting Dvengali, Thai and Arabic script on all versions.
no thats just full unicode support. If you truly support unicode you must support the entire character set, not just the part you feel like using. Its like saying you are ASCII compliant, but only characters 12-34. BTW: NT 3.5x, NT4 and CE (all versions) are unicode compliant as well. (Probably all NT versions but I'm not positive)
But what are the pitfalls?
Hmm...you might actually support the 4 billion people in the world who never heard of ascii until the british invaded, uh, I mean colonized their homeland.
seriously a unicode "character" is 16bits. Which means if you are developing an app for english speaking/reading americans, your text resources will double in size. Bummer... (Although there is a UTF-8 variant I don't know much about - I think its basically unicode for the most popular/common languages)
Another pain in the ass is dealing with network protocols. Everybody expects ascii. So you do a lot of converting which is a pain. Although maybe some day everything will be xml based and you could be using unicode plaintext for that...
How do you handle whitespace?
Uh just like whitespace. In the nt world you can use iswspace instead of isspace. Basically all your ascii crt functions have a unicode equivalent. This way brain dead programmers don't go running around saying the sky is falling just because they didn't bother to read up on the subject...
How do you make your API display the characters in the right fonts
Uh..instead of a small lookup table for the few ascii characters you need a big lookup table for unicode (probably a smarter implementation that maps in sections at a time when you use them). If your OS supports it, you don't worry about it at all. You use something like DrawText( L"blah blah blah" ). What gets difficult is using an ASCII based text editor to enter unicode strings. Basically you have to use cut and paste from an app that does support it or type things in manually.
All of these issues are becoming more important as the world becomes more switched on, and the boundaries shrink between places And you're saying it wasn't important when the western europeans were ranging all over Africa, Asia and the Americas conquering people? Its only important for those with compassion and understanding. Just like it should have been important to the detroit automakers who were pissed off that Japan wouldn't let them sell cars there, that when they finally did get to sell some cars there that maybe they should check and see what the preferred side of the car was for the steering wheel. -
Re:(OT) 'f' was not used for 's'
(This comment looks best in a browser that supports a lot of Unicode.)
This letter "very much like f", , is called long s. It had the advantage of looking good on paper, enabling more ligatures (st, sh, etc), and generally fitting the way type was designed. The italic print version looked like (an integral sign). Something similar was used in the old Gaelic and German alphabets (surviving today in the German letter ß, which is long-s + s and no relation to the Greek lowercase (beta)).
-
A single set standard of characters
Have you been to the Unicode site lately? But there is one problem: there are more distinct characters in this world's writing systems than there are 16-bit integers; some scripts will never be included into the codespace.
-
Support for alphabets not in Unicode?
-
Re:It's UNICODE
Since the code was a TWO BYTE code, and the browser displayed it as ONE question mark, then the browser knew how to convert UTF-8 encoding into a raw numeric code. It just didn't have a glyph to render it with, so it substituted the question mark.
That may NOT be the standard, but it is also the case that many standards groups are spending (wasting?) too much time with making things like XML more complicated than they need to be, and not keeping all aspects of standards up to date (like officially supporting UTF-8 encoded UNICODE, which is trivial to implement in validators
... for those who use such things).It is already common for standards to be extended and the extensions to be accepted. Netscape added animation to GIF, and while there were some purists crying foul, others just got on with making things better, leaving standards group to eat their dust. I extended GIF to support true-color images and browsers support that, too (Netscape, Explorer, and Opera, that I have tested). People did bitch and whine about it because it wasn't described in the standard for GIF, but it did work, it did not conflict with the literal standard, and it was the only way to get true-color into web pages until PNG came along (which admittedly was slowed due to browser makers dragging their feet).
So I suspect as soon as you have full UNICODE support in X windows and/or the font server, with proper fonts, it will work fine (despite what some useless validator says).
-
Re:It's UNICODE
Since the code was a TWO BYTE code, and the browser displayed it as ONE question mark, then the browser knew how to convert UTF-8 encoding into a raw numeric code. It just didn't have a glyph to render it with, so it substituted the question mark.
That may NOT be the standard, but it is also the case that many standards groups are spending (wasting?) too much time with making things like XML more complicated than they need to be, and not keeping all aspects of standards up to date (like officially supporting UTF-8 encoded UNICODE, which is trivial to implement in validators
... for those who use such things).It is already common for standards to be extended and the extensions to be accepted. Netscape added animation to GIF, and while there were some purists crying foul, others just got on with making things better, leaving standards group to eat their dust. I extended GIF to support true-color images and browsers support that, too (Netscape, Explorer, and Opera, that I have tested). People did bitch and whine about it because it wasn't described in the standard for GIF, but it did work, it did not conflict with the literal standard, and it was the only way to get true-color into web pages until PNG came along (which admittedly was slowed due to browser makers dragging their feet).
So I suspect as soon as you have full UNICODE support in X windows and/or the font server, with proper fonts, it will work fine (despite what some useless validator says).
-
It's UNICODE
Seeing these characters myself, I extracted the codes and looked them up. The code I get where I expected the ASCII symmetrical apostrophe is actually the UNICODE right apostrophe.
It's sad, but in some ways, Microsoft is actually LEADING technology. In this case it is the adoption of UNICODE international character set. I wish the Unix/BSD/Linux community would get their act together and get these things working.
-
It's UNICODE
Seeing these characters myself, I extracted the codes and looked them up. The code I get where I expected the ASCII symmetrical apostrophe is actually the UNICODE right apostrophe.
It's sad, but in some ways, Microsoft is actually LEADING technology. In this case it is the adoption of UNICODE international character set. I wish the Unix/BSD/Linux community would get their act together and get these things working.
-
Re:It's all very clear now (the settlement)
Its not like there are different GPLs out there that might cause confusion. GPL is GPL.
Well, technically any General Public License is a GPL. Saying that it is covered under the GNU GPL would be better, I think (however, I'm not sure if even that is sufficient for coverage. IANAL)
By comparison, when you put a copyright notice on something, all you need is the word "copyright" or the © symbol. There is no requirement to quote the copyright act.
Heh. Maybe we should lobby the Unicode consortium to add a copyleft character.
More seriously, doesn't the GPL contain a clause that states that copies of covered source code must contain the GPL? If so, didn't they already violate the terms of the GPL?
---
Zardoz has spoken! -
Re:Japanese Input
Have you noticed on English Windows that all MS programs stuff up JIS/EUC encodings when you try to copy them to the clipboard? You end up with ?????????? when you do a paste.
I have had the same experience (using Chinese text rather than Japanese. But it is not a bug, it is a feature, and a good one (sort of...). What's happening is that Windows (96/8, at least, and in some respects NT; 2000 is different, they say) is sort-of Unicode-compliant, but not entirely. What happens is that when you copy the text out of IE (4 or 5) it sits in your clipboard in Unicode rather than in pairs of bytes representing each kanji. This is ia good thing. It means that if the application into which you are pasting understands Unicode encoding, it will treat the kanji as individual, two-byte characters rather than pairs of characters that happen to get rendered on your screen/printout as a single kanji.
The problem is that most Windows software doesn't deal with Unicode properly. It can't handle the string it's getting from the clipboard, and can only render it with ????s. At the moment, Word 97 and all the Office 2000 apps can handle Unicode, so if you paste from IE into one of them it should work fine (assuming you have Japanese fonts installed; in Access you need to specify an alternate font to display mixed text). You can also test this by pasting text from one IE window into a form in another (e.g., on a Japanese site).
Overall, it's a good thing that the newer MS apps support Unicode. They've now outpaced MacOS in some respects (Worldscript is dead). And Linux is still finicky about Unicode (I've given up on Redhat 6.1 and am going to try Madrake 7.0 which seems cleverer about it). But the downside is that older apps won't understand text from the newer ones (be it pasted from the clipboard or imported from a file). There are workarounds (usually involving 3rd parties or roll-your-own).
-
Re:One Question Companys now Ask themselves
The decision for Quote.Com to change wasn't only based on the "reliability" of the platform.
Their decision was also based on facts like:
- There are more pre-built software components for IIS/SQL server
- Things like XML support are very primitive in PERL, for example.
- MCSEs are cheaper to hire than Unix admin/programmers
- With more, cheaper machines, you can play the "uptime numbers game"
A lot of developers are working on XML support in PERL (there is a Perl/XML FAQ), but you still can't support Unicode. Perl still relies on 8-bit character sets, so we use UTF-8 instead of 16-bit Unicode. Unicode support is neccesary for a complete XML implementation.
You'll also find that MCSEs will be cheaper to hire than Unix programmers. This is partly due to their (general) lack of skills, and partly due to their great abundance. An MSCE course only teaches you how to think the Microsoft Way. I wouldn't trust an MSCE to maintain or write code in C++ or Perl, for example. Without the MFC and a pointy-clicky interface, an MSCE can't function.
However, give the MSCE the MFC and a pointy-clicky interface, and an MSCE can deliver a program faster than a "traditional" developer. The fact that the program inherits all the bugs and mis-features of the MFC is not an issue here. The fact that the program was slapped together without regard for maintenance or robustness is also not an issue here. The issue that Quote.Com chose to focus on was delivery time, not quality of product.
As for the uptime numbers game, it works like this:
If you have 1 Sun server, and you need to upgrade the hardware, you need to shut it down. If it takes 1 hour to shut down, replace the hardware and restart, then you have 1 hour downtime.
If you have 2 Windows NT servers (for the same price as 1 Sun machine), and you need to upgrade the hardware, you need to shut them down. If you do it one machine at a time, and take 4 hours total to replace the hardware, then the server pool still has 0 hours downtime. Windows NT pundits will happily overlook the fact that the individual machines are constantly being overhauled.
In addition, Microsoft introduces the idea of "scheduled downtime". That is - you plan to reboot each machine once a day, to make sure the system remains stable. So twice a day, you have one of your two machines reboot. One machine reboots in the morning, the other in the afternoon. The total downtime of the server pool as a whole is still 0 hours (because you're not counting "scheduled downtime" as "real downtime").
Now combine the MSCE factor with the downtime numbers game factor, and you'll find that you can get away with shoddy code, because when your server crashes, it's not really downtime anymore. The problem of data integrity in your backend database is something for the DBA to worry about. You've got your uptime figures and time-to-delivery figures up there in the top 10. If the DBA complains about data integrity, you sack her and fire someone with a more "can-do" attitude. You don't want slackers in your Microsoft Powered enterprise!
Daily Reboots:
-
Re:Why does the dash break telnet/ftp?
The reference is RFC1035 (``Domain Names — Implementation and Specification'') by Mockapetris. But read it carefully: the section 2.3.1 is entitled ``Preferred name syntax'' (emphasis is mine). There is nothing illegal about names not following this convention; in fact, RFC1035 domain names can contain arbitrary characters, even binary (including the dot character, the null character, and mixed case!). The RFC essentially restates the golden rule: be liberal in what you accept and conservative in what you send (i.e. use the suggested conventions if possible when creating domain names, but be prepared to accept any kind of data when you receive a domain name). The RFC suggests the conventions you name precisely for compatibility with mail and TELNET (see the notes at the beginning of section 2.3.1), so you are putting the cart before the horses (``illegal'' names are ``illegal'' because they break TELNET, not the other way around).
A domain name is not supposed to start with a digit, but this rule is violated in the very RFC for the IN-ADDR.ARPA domain. Arguably, this is not a problem because you can't TELNET directly to an IN-ADDR.ARPA domain host (I find it rather unfortunate that I can't type telnet t.z.y.x.in-addr-arpa as a substitute for telnet x.y.z.t, I've never understood why that is disallowed).
I wonder, with all the fad on Unicode, whether Unicode characters will end up being allowed in domain names. Then every trademark-owning company would rush to register their name in every possible script. Or, worse even, get their logo added to the Unicode tables and register the <logo>.com domain. Fortunately, the Unicode Consortium has decided never to include logos in the official Unicode tables (but they might get added in the user-reserved areas if some vendor (i.e. Micro$oft) is influential enough to provide a kind of standard in this domain). Imagine people not knowing how to write any more, just choosing the company's logo in a huge table, and pasting it in the location bar of their browser...
-
Answer to Q2: Unicode 2.1 is in HTML 4.0
HTML 4.0 includes version 2.1 of the Unicode standard for international characters which assigns a unique identifier to each of 38,887 characters in the set of the world's major languages. This work is being coordinated by the Unicode Consortium.
-
Re:wow blatant disrespect and bigotryActually, this Webslacker's Mandarin was at most half correct. Webslacker's sentence has following errors:
Webslacker: Woh Duh Monitor Way Sa Ma Lan Sah?
1. Webslacker was not using standard Mandarin pronunciations. The correct way to put the pinyin as follows:
Wo3 de5 Monitor wei2 she2 me5 lan2 se4?
2. 'Monitor' is not Chinese. Chinese for 'monitor' is 'xian3 shi4 qi4'.
3. The grammar of the sentence is wrong. Simply put, Webslacker's translation sounded like a Chinese dub of Yoda's speech. Correct way to speak 'Why is my monitor blue?' in Chinese is follows:
Wei2 she2 me5 wo3 de5 xian3 shi4 qi4 shi4 lan2 se4?
For those without proper browser, the UCS-2 codes for the sentence as follows:
U+70BA U+751A U+9EBC U+6211 U+7684 U+986F U+793A U+5668 U+662F U+85CD U+8272 U+003F
References:
Chinese Characters and culture [http://www.zhongwen.com/]
Unihan Database [http://charts.unicode.org/] -
What About OTHER Languages/Dictionaries?
So we're all in a rush to buy up every last word and phrase in the English language, for later resale to highest bidders. Gag me with snake-oil! Are we forever stuck with ASCII-ONLY DNS?
The fastest growth on the web now, finally, is non-English . Please correct me if I'm wrong, but it seems like the HTTP:// protocol works only with ASCII-ONLY URL's. When will the web outgrow ascii and into UniCode? We need URL's in Czech, Chinese, Cyrillic. etc. Anybody know of specific initiatives?
ask slashdot: are any MultiLingual URL Protocols being developed to allow us to record and browse the world in more of its many languages? Where are they?
"ever tried. ever failed. no matter. try again. fail again. fail better." - s. beckett
-
CJK unificationThe idea is, you have two or three characters that look very similar. One is Chinese, one is Japanese, one may be Korean. They look similar, but usage differs (ie, they have different meanings). Because of the ridiculous Unicode proposal, they are all unified into the same code point
According to the Unicode web page and everything I have ever read on Unicode the unification only takes place if the characters have the same meaning. Can you name an exception where that rule hasn't been followed?
depending on the Unicode font used, you might get the Chinese character, you might get the Japanese character, or maybe the Korean character
That's just font management and/or language management. Every decent DTP system needs font management anyway, and if it is going to get hyphenation right it needs language tagging, even if you are only using Latin-1.
The whole point of Unicode was that code points were supposed to be kept separate between languages
What on earth makes you think that was the whole point of Unicode?
One of the big points in Unicode is that it should be possible to convert from any character set encoding to Unicode and then back again. That has caused some compromises for example the fact that the Greek capital letter Alpha has a different encoding to the Latin capital letter A although you could argue that they are the same character.
But you can't make that 2-way conversion guarantee for encoding systems that let you switch from character set to character set with escape codes. Amongst other problems that would make that impossible is the fact that you can use the same escape codes to switch into Unicode, so you would get an infinte recursion.
If you want to convert to and from escape-code switching encoding systems you will have to extract the implicit language and font information and make it explicit in the Unicode version of your data. That is probably a good idea anyway, and is possible in HTML and any other serious text format.
If it's a 'plain text file' then you can't embed the font or the language information, but that's why plain text sucks, and the same problem appears in Latin-1 plain text files.
-
Unicode? Internationalization!I agree that Unicode will not solve all the world's problems, but...
For language labeling, what's wrong with Unicode Technical Report #7: Plane 14 Characters for Language Tags?
Also, regarding XML's requirement to use Unicode, if XML permitted multiple character sets, wouldn't the parsing become much more complicated?A related question for the Linux GUI geeks: is anyone working on making the text widgets in the various GUI toolkits handle bidirectional text properly? The last time I bought a computer, I bought a reconditioned Mac, specifically because I knew I could get decent Hebrew support with WorldScript.