ICANN Plans Non-English Character Domain Testbed
Wanted writes: "This article reveals ICANN's plan to open registration of domain names
with national characters. Actually it's Network Solutions, who are responsible for technical issues of implementing that project. Initially they want to support CJK (Chinese, Japanese, Korean), then Spanish and other European languages. I don't know why they like Spaniards, but I'd rather say about supporting ISO-8859-1, not particular languages. Nevertheless the Internationalized Domain Names IETF Working Group should be pretty happy about it. Wonder, how would you type www.wong-kar-wai.org in Chinese with classic keyboard :)"
I guess most of the posters here are americans, because you just can't grasp that there is a need for this in most other countries. If you can't type ö on your keyboard - too bad. You know what? There will be software that helps you with this, or web services. I promise: the majority of times you want to visit a webpage on a URL with a >7 bit character in it, you will have a link for it. Just point and click! What is so difficult with that? The big problem will occur when you try to read the page. It will consist of letters in combinations that you can not understand. They will not make up english words. You will have to take a course in another language to get it. That will be the hard part. But just relax, because the english/american web will still be the dominant one, and you will not feel you have missed anything.
www.ràÐÊMåRk-LâwSÜïTs-ÄRè-ÙS.com
[ maur_at_technologist.com ] "For a sufficiently powerful message,
[ http://maur.litestep.com ] the medium is irrelevant."
First, 8-bit computers are still in use now, and their bus width does not prevent them from dealing with data of any format.
Second, DNS already is in use, and NOONE BUT "UNICODERS" EVER COMPLAINED about ASCII use in it. There is no demand for this feature, only some people's desire to break all existing software to sell "updates".
Domain name is an address. Address should be reachable from everywhere and everything. This works even for postal addresses -- I can write them in English, and they will reach the intended destination in any country, be it US, Spain, Russia or Japan. The same functionality is available now with DNS, but if this proposal will be implemented, it won't be available from every computer unless everyone will switch to Unicode -- and that won't happen until Hell freezes over (being Russian I have all reasons to be sure about that).
If someone is too concerned about "good-looking" addresses, they should implement some name-translation service like AOL keywords for people who don't like DNS, but the basic architecture of the Internet should not lose interoperability just because someone wants to add one useless feature to his shitty software.
Contrary to the popular belief, there indeed is no God.
Not completely true. The domain name is an alias. The dotted quad is the address.
Tell me why I can't put a name server out there that supports more characters? Yes, comaptibility will be a problem but not an impossible one. The DNS request must inform the server which charset it is using. Default to UTF-8 of course.
So we have the following alternatives. (Let's accept "unicode" for "standardized extended charset" in the following OK?)
1) I run a web server with a UTF-8 domain name. No problem, a Unicode DNS will be able to handle the ascii subset.
2) I run a web server with a unicode domain name. I must register my domain with a unicode dns and I'd be wise to also register a UTF-8 domain name as an alias if I think my domain name will cause trouble.
This works even for postal addresses -- I can write them in English, and they will reach the intended destination in any country, be it US, Spain, Russia or Japan
Well, but can you write an adress in russian and expect the letter to be delivered in the US? Or in russia, if mailed in the US?
If someone is too concerned about "good-looking" addresses, they should implement some name-translation service like AOL keywords for people who don't like DNS, but the basic architecture of the Internet should not lose interoperability just because someone wants to add one useless feature to his shitty software.
Someone sure is "too concerned" and I rather have ICANN setting a standard than wait until there is an AOL/MS proprietary name space.
I do see your point. Clueless fibbling with DNS is not a good thing. *My* point it that that is exactly what will happen unless it is done the proper way.
All opinions are my own - until criticized
Numbers. 417 million people speak Spanish, 191 million speak Portuguese, 128 million each speak French and German, and no other Latin-alphabet European language has as many as 100 million speakers. It isn't that NSI prefers Spaniards, it's that it prefers larger markets over smaller ones.
CJK has a similar "numbers" vibe. Since the CJK character sets are generally handled by a single solution in software (esp. since written forms of Japanese and Korean include both native syllabic/alphabetic [respectively] scripts and Chinese idographic script), you get Japan, Korea, and Greater China in one fell swoop. (Greater China here not only including the PRC and Taiwan, but the Chinese-speaking groups in Maylasia, Singapore, and Indonesia.)
So why not Devanagari too? Because 1) there are a lot more CJK and Spanish language customers than Hindi/Bengali customers due to internet penetration and financial factors, and 2) the people who would buy the domains in India generally are of the educated classes that speak English. So there's less demand for Devanagari.
Steven E. Ehrbar
http://domän.nu/
Interesting, BIND 8 works with it (my nameserver), but when I enter that in nslookup it pukes (i.e. I can use a webbrowser [IE 5.5], but can't type it into nslookup).
--
If you sell a product internationally, and use a lots of strange letters in the url, you are just stupid. Of course you should use a "classic" url for this. But if you have a company that delivers shrimp sandwiches within a swedish town, you should be able to use the domain räksmörgåsar.se instead of raksmorgasar.se.
Does this mean I can register micrösoft.com and yàhoo.com and släshdot.org?
--
Conflicts with existing names are mostly dangerous when the user might type the name and make a typo. Evidently you would not type such a thing as "amazn.com" (here the "o" before the "n" is replaced by the cyrillic form of the same letter which is supposedly indistinguishable from it). When you are following links, well, you are following links, and you are therefore trusting the site with the links to some extent. After all, non-power users rarely read the URL written at the bottom of the page, in any case: if someone writes a site which looks very much like a well-known site and links to it, whatever the URL, many users will be fooled. I don't think "internationalized URLs" will be a major change in this respect.
Slashdot's handling of accented characters in nicknames was completely grotesque, in any case. It was done naïvely by taking the 8-bit data as submitted and using it in the URL. But this is not how it works: the data should have been encoded in UTF-8 beforehand.
--
Here you should see an upper-case e with an acute acent: é. Here you should see an upper-case Y with two dots on it: . Here you should see a capital greek Gamma: . Here you should see a Hebrew aleph and a Hebrew beth: ; of course, the aleph should be on the right because it is first (unless there was a line split between the two). Here you should see the Devanagari "OM" sign: . Here you should see a smiling face: . Here you should see the Chinese (or Japanese) character for "sun": . None of this should depend on your selected "document encoding". If you did not see all that, then your browser is broken and you should change it.
Finally you are mistaken in the assumption that 31 characters would be limiting for Chinese names. Those chinese characters are much more powerful than letters or digits, so far fewer of them are required to form a name
That's interesting -- that had entirely escaped me. (This from the kid who spent years studying Egyptian and Mayan ideograms. *smack*) And I thought 63 characters was incredibly long in English. You could have an several haikus in Japanese ideograms as a domain name!
I like ICANN but
NSI really pisses
me off a lot
-Waldo
-------------------
I could see this as a possible way of the internet "cleaving" into national groups. What I mean, is that there is no easy way for me to type in asian characters to get to a site, and if someone is used to an asian-only computer system, how do they go to Russian sites without clicking on a link or knowing an IP address?
At the risk of sounding anglo-centric, isn't this a big blow against interoperability?
{resume, résume, resumé, résumé}.{com, net, org, new TLDs}
Costly!
The only thing I'm worried about is that infrastructure/backbone-level software might break.
Because:
1) I can't read a Japanese-language site whether or not I can get to it.
2) If I could read it, I'd use software that let me input it.
3) A rational web designer will register a non-accented Roman/ASCII character name if they intend to reach an audience that may include people who can't input other characters. The irrational deserve to have their sites fail anyway.
Steven E. Ehrbar
Domain names should map from something like: "Señor Hussong's Cantina.com" to "senorhussongscantina.com". Spaces, punctuation, and hyphens should be deleted. Special characters should be translated into the closest low ascii character.
This way, you can write your domain name however you want, and there isn't so much of a potential for people registering something similar.
Hyphens have got to be the dumbest idea of all time. If you have a multi-word name, you almost have to register both with and without the hyphen or you will lose visitors.
Even better would be using something like soundex, which makes a "hash" of a name so that similar sounding words map to the same value. Memorizing exact spelling is not something people are used to doing.
They shouldn't do CKJ domain names, they should just define a standard translation, which can then be incorporated into client software and possibly into DNS systems. What's next, I'll be unable to get to a site unless I also choose the correct encoding? Let's see, was that "cool-shit.org in 8859-1, or coolshit.org in japanese encoding, or maybe cool-shit.net eastern european encoding. Or was it coolshít.org?"
> Unless I'm mistaken, Unicode is a combination of two ASCII characters to create a single one
You are mistaken.
I've finally had it: until slashdot gets article moderation, I am not coming back.
Slashdot has had to ban accented characters to prevent this kind of abuse; ICANN should do the same lest they a similar outbreak of mimicry infect the entire Web.
History of the "Something must be done to control the outbreak" syndrome.
Early 1990s: OMG! People are making up their own web sites in large numbers. Thousands of people will see them and be unable to distinguish fact from fiction.
Mid 1990s: OMG! People are now making up their own news sites. Millions of people are reading them and can't tell the difference between real and fake news.
Late 1990s: OMG! People are posting stock market tips which are causing market fluctuations. People will be unable to tell the difference between real and fake stock market news!
Early 2000s: OMG! People are allowed to use accented chars. Millions of people will be diverted to fake sites which use similar accented chars in their domain name, and thus be unable to tell the difference between real and fake sites!
Here, take a chill pill. Welcome to the internet, my friend.
w/m
This is a bad idea -- domain names must be interoperable on all systems, with or without Unicode or any other charset support, with or without keyboard capable of entering certain characters. The ASCII subset allowed in DNS now is the only subset supported by absolutely all computers (even ones that natively use EBCDIC), and no matter how the use of other charsets (and/or Unicode) will expand, this is not going to change. I see it as an attempt to just promote "unicodefication" of existing standards for no good reason.
And if anyone cares, my native language has nothing to do with ASCII.
Contrary to the popular belief, there indeed is no God.
Slashdot has had to ban accented characters to prevent this kind of abuse; ICANN should do the same lest they a similar outbreak of mimicry infect the entire Web.
People should realize that it's a world wide web. It's not only american, and it should not only be in english -- diversity is important. And if you want to support other languages, you have to accept accented characters; they are not only "decorative", they make a whole difference.
Sure people will abuse it. But we already have slahsdot.org and other similar sites. It's already being abused.
Você não acha?
--
This space left intentionally blank.
In the UTF-8 encoding (defined by RFC2279), it takes between one and six octets (bytes) to encode one character, although no currently assigned character needs more than three. UTF-8 can address all the 2147483648 characters of ISO-10646-1.
In the UTF-16 encoding (RFC2781), it takes either two or four octets (bytes) to encoed one character, although no currently assigned character needs four. UTF-16 can access only the first 1114112 characters of ISO-10646-1 (the first 17 planes), which form the Unicode range proper.
Both these encodings use characters outside the ASCII range (i.e. 8-bit characters), which are not supported by current BIND versions, but which are still permitted by the DNS standards (RFC1034&1035).
However, the proposed IDNS standard does not use either of these encodings (IMHO not using UTF-8 is a terrible mistake) but yet another one, called UTF-5 (see "draft-jseng-utf5-00" in Internet Drafts).
In the UTF-5 encoding (defined by the aforementioned dreft), it takes between one and eight octets (bytes) to encode one character, although no currently assigned character needs more than four. UTF-5 can address all the 2147483648 characters of ISO-10646-1.
If UTF-5 is used on DNS labels, you can have up to 15 Chinese characters in such a label.
--
And it's "non-ASCII", not "non-English". There are already plenty of domain names that are non-English, as others have pointed out already. ASCII is a character set (of sorts); English is a human language. The differences are defined in detail in the requirements document for the IETF's working group.
Full details on the working group can be found at http://www.i-d-n.net. Maybe folks should consider reading the copious archives before declaring that it can't be done. It can be done, and hopefully it can be done right. We're quite sure that the Powers That Be in the IETF won't allow it to become a standard if it isn't right.
Expensive, yes
A pain for hard core geeks to get used to, yes
Necessary, hell yes! Pehaps not today, but soon.
I do see your point, but the same argument could be used against 16-bit computers (8-bits is the current standard and programs and data must be interoperable...)
Do you know how much creative spelling there is, simply to force non a-z characters into the DNS? Simply removing dots, rings and accents is not good enough. (oops sudddenly my domain name became equivalent to "www.faggot.com" or someone elses brand name)
Ever tried enforcing a "8 character a-z only" file name policy on a network where *some* servers and programs could not handle other names? Forget it. It was cheaper to dump those, buy a new network and microsoft products (as you can see this was after the dos days:-) even if tecnically inferior, than to handle the constant hassle.
People *hate* modifying spelling to comply with stupid limits. There is no standard way to map non a-z chars onto a dns
ASCII is outdated, get rid of it!
Either it will be done in a standardized way, or it will be done by Microsoft. I prefer the former.
All opinions are my own - until criticized
It depends on which Chinese character set you use (either traditional for Hong Kong and Taiwan or simplified for China)
For each character set there's a choice between a couple of input methods to map keystrokes from a QWERTY keyboard to the actual Chinese characters. I normally use a method called traditional Cangjei and here's how you type the URL:
twlb vfog vfbtv .mg jmso hodqn .dvii dttb .wong kar wai .(--org--)
w w w
Of course there are rules to generate the above if you know what the word looks like :-). However as you can see it's much more inconvenient that way, and anyone who thinks that the average person who doesn't know Chinese
typing would be able to reach their Chinese domain is being silly at the
least.
Keith So GnuPG fingerprint = 168F 874B 4E26 DCA8 B8BF 57F4 80F9 412E F82B AE4C
There are many Americans who understand internationalization issues very thoroughly, and some of them disagree with this proposal. It is a bad proposal because, first off, it really does not seem to understand internationalization issues. You do not accomplish I18N by using national character sets. Using an NCS is not making your content supported for an international environment. It is doing exactly the opposite. In many cases, there are half a dozen NCS that support the same damn alphabet. If you really want I18N, you need to use Unicode (preferably UTF8) or UCS.
If you are going to support multilingual domain names, resolution must occur in either Unicode or UCS. Let DNS lookup libraries handle the conversion from KOI-8 to UTF8. The user enters the domain in their NCS and the DNS server only has to handle one character set.
Beyond the issue of I18N, however, is the issue of who a TLD is targetted at. If .com is aimed at a global audience, then domains registered under that TLD should support a global audience: i.e. ASCII or ISO-8859-1. NOTHING ELSE. Let .ru use more of the unicode spectrum. Or even allow for a .ðî (that was the cyrillic letters for the first to characters in the Russian word Rossiya, in case your browser cannot resolve those) for domains aimed specifically at Russian speakers.
Recently I posted this comment mentioning the fact that there's really no reason why a domain such as www..com (you should see two Chinese ideograms meaning "China" between the "www." and the ".com" parts; further, if you click on this link, your browser should open a window telling you that the domain "www..com" does not exist, with the same two Chinese ideograms) doesn't exist.
Let us recall: first, as specified by the HTML specification, every HTML document, no matter what character set it is "encoded" as, is written in the all-englobing Unicode character set. So when you write something like "中国" in HTML, it refers to the Unicode characters (decimal) 20013 and 22269, no matter what the current character encoding and font are. So that's how you write the link text. Second, as for the URL itself, well, although it is not (as far as I know) formally recommended by an Internet standard, it is widely recognized that URLs are written in the UTF-8 encoding format (which is afterward %-encoded into ASCII).
The whole process is described in this Internet Draft ("Internationalized Uniform Resource Identifiers"; WORK IN PROGRESS!) by Larry Masinter and Martin Duerst where the relationship between URIs and IURIs (Internationalized URIs) is discussed in detail.
The DNS is the toughest part of all. The DNS specification (RFC1034) states (section 3.1) that DNS data is to be taken as binary for possible upward compatibility (this was wonderful foresight on Mockapetris' part!). Consequently, there is nothing as per standards wrong with using (UTF-8 encoded Unicode) 8-bit data in DNS labels. Except, of course, that many "buggy" implementations will have to be corrected for broken assumptions, *sigh*. The IDNS working group suggests using a UTF-5 encoding to avoid going beyond the current domain name limits: I think this is not a good thing and we should stick to UTF-8 and repair broken software.
Oh, and incidentally, see this page too know how broken your browser's Unicode support is.
Now when I buy a product and need technical support, they can tell me that my keyboard isn't compatible with their web site! I knew this day would come! Brian Tobin
I dont have anything against other cultures, and dont mind other languages exsisting, in writing or on web pages... but DNS is NOT the place for them.
a domain name i supposed to be universally accessable. this is going to make a great many pains in the asses.
old browsers wont work
english keyboards lack accented characters
its not fun changing your charset, then punching in random alt+XXX codes until you match the CJK symbol your looking at.
the internet is really becoming dumb.
i dont see how this would be possible without the modification of every name server in the world to support multibyte domains... since BIND 9 is in feature freeze... this might get in to BIND 10... look for betas in about 10 years.
wouldn't this choke most applications? im not entirerly sure how CJK are handled... doesnt seem to me like it would be a pop-in transition.
However, if you had some kind of translation software that automatically mapped the local character set back to ASCII (and of course disallowed name clashes for the mapped names when registration occurred), it could be a win/win situation both for making the DNS more useful for non-english speakers, and keeping the net globally accessible.
Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)
I think it is about time we tossed out DNS when it comes to URLs. It is ridiculuous that so many millions of non-technical users are expected to use DNS. The further absurdity of the DNS systems application to URLs is realized by the endless "property" claims made by rich litigious corporations.
Why not use some kind of distributed, non-exclusive labeling system that lets IBM have the name "IBM". Maybe something LDAP based?
We are not going to get anywhere by patching up the DNS system a problem at a time. We need to engineer a new solution. I'm all for evolution but I don't want to wait for it to come up with something that works
I have read about it on: http://www.spiegel.de but as far as I remember, it was only a doubtful claim by some politicans, who have a german special sign like ä,ö or ü in their names and could not register it without using substituions like ä --> ae and so on.
As far as I remember, the chairman of http://www.denic.de only laughed about this claim.
I think it would lead us to more problems than it would solve! Or is it just again about making $$?
Michael
I doubt that this will be a big problem for you. First of all, most URLs you will encounter as links, so you don't have to type anything at all. Second, if you can't type the url because you don't understand it, how do you expect to understand the information on the page that it points to? And third: there will probably be software that helps you with this.
Unless I'm mistaken, Unicode is a combination of two ASCII characters to create a single one, which is how Japanese, Chinese, etc., characters are created. 255^2 is a lot of characters. (65025, to be exact.) Doesn't this mean that these domains are limited to 31 characters? Further, can BIND *support* using characters beyond [a-z0-9-.]? I sure wouldn't think that it could.
I didn't find these questions answered anywhere on ICANN or NSI's sites. Anybody have any ideas?
-Waldo
-------------------
Look, I'm panamanian. Spanish is my first language (it is Panamá, not Panama), but i just can't agree with this because i don't think it's practical at the moment. Take for example this web site we're building called galeriacentral.com. everyone knows automatically how to acces it when they hear an ad for it on the radio, but with the intl characters allowd, I would have to register galeriacentral.com, galeríacentral.com (correct form) and galerìacentral.com. and then someone would register galeríacentrál.com and i'd be screwed (cybersquattin is allowed in most parts of the world)...
.com/net/org are already abused enough to leave more room for stuff like slashdog.org.
my recomendation would be to leave it up to the countrlies TLD's. so if i want to register cualquiercosa.com.pa then ok, but the regular
There are two kinds of people in the world: Those with good memory.
...namely, it's not yet practical. It may not be for a long time.
One, people talk about accented characters as being harder to recognize when spoken. While this is true, there's another problem, and one that's a lot tougher: there is no standard way to type these characters. On a Mac it's done one way (fairly intuitive, based on the character over which a given mark most frequently appears), on Windows it's done another way (an unnecessarily difficult process involving a four-digit keycode), and on Linux/Unix it's still another (I don't even know how it's done there).
Part of the reason the keyboard works so well is that it's at least semi-standardized; for the basic Roman character set I can move across platforms effortlessly. But when you start throwing diactiricals into the mix, I'm lost when I move from platform to platform. We need to solve that problem before we can even think of putting such characters in URL's. Can it be done? I think so.
Now, there's the problem of CJK characters in URL's. First of all, most computers aren't even capable of recognizing these without special software. As a result, the characters come out as a sequence of ASCII chars which if you're really lucky might all be printable. If you're not so lucky, the characters won't even be printable, or they'll be indistinguishable from one another so you still don't know what to type.
The answer here? Unicode (specifically UTF-8) helps, but many computers still don't support Unicode. Even in the case of those that do, I doubt there are any fonts which support every single character in the CJK set yet (remember, the Chinese character set in particular is truly vast; a two-byte encoding system is still insufficient for encoding all the possible characters). While all current operating systems can banage Unicode, many people are unable or unwilling to upgrade to current technology, and that's going to be a huge barrier to overcome (it may even prove insurmountable).
Supporting all the world's languages in URL's is a Good Thing. However, we have more than a few problems that we have to get through before we can accomplish that goal. The resources currently being spent on this project would be better spent solving those problems first.
----------
And worldsnames.net also features Japanese characters, Chinese, Korean, Arabic, Cyrillic.
Tho one of the nice "features" of the internet is the fact that you have the opportunity to reach a gobal public. Which is rather hard when you have country/language specific characters. my 0.02