Internationalized Domain Names Coming Soon
rduke15 writes "You think you know how to parse a domain name for validity? Well, in case you haven't noticed, things are getting tougher as registrars keep adopting IDN (Internationalized Domain Names), which uses a weird encoding named Punycode to enable accented characters in domain names. The Register reports about Switzerland, Germany and Austria's joint move to enable IDN. See the overview in English from Switch. But I guess it would be difficult to talk about this on /., since it does not even support basic Latin-1 ... :-)"
I have mixed feelings about this. I am from Sweden, and it always looks kind of ugly when names lose their dots and circles in the domain name.
On the other hand, this is also quite convenient. I live in the US now, and I travel around quite a bit. I often surf on Swedish Internet sites, typically without access to a Swedish keyboard. It would not be very convenient if the domain names used non-English symbols.
Sometimes I go to Japanese sites also, and I am really glad that I don't have to install a Japanese word processor to do this...
Tor
Punycode *is* a Unicode encoding.
Unicode has many encodings; UTF-8 is one encoding and Punycode is another. UTF-8 aims for efficiency when the majority of the text is ASCII, and Punycode aims for completeness when you must fit in 64 characters and use only the ASCII characters to do it.
[
> You think you know how to parse a domain name for validity?
Yes, I do, and if you _read_ the RFC you'll see that nothing changes, these domain names are encoded into the same character set as the current DNS system. And hence if you give me a URL I can validate it with existing scripts. There's an example which shows that Bucher.ch (with an umlaut on the u) would be translated to: xn--bcher-kva.ch which looks totally parseable to me.
John.
No. The problem that punycode solves is that the encoded DNS names are themselves valid RFC1034 DNS names. That is, even when encoded, standard DNS validity checkers will accept the name.
UTF-8 does not have this property
the growth in cynicism and rebellion has not been without cause