Internationalized Domain Names Coming Soon
rduke15 writes "You think you know how to parse a domain name for validity? Well, in case you haven't noticed, things are getting tougher as registrars keep adopting IDN (Internationalized Domain Names), which uses a weird encoding named Punycode to enable accented characters in domain names. The Register reports about Switzerland, Germany and Austria's joint move to enable IDN. See the overview in English from Switch. But I guess it would be difficult to talk about this on /., since it does not even support basic Latin-1 ... :-)"
I'm delighted to tell that Mozilla is one step forward again, and already supports IDN since version 0.9.5 http://www.mozilla.org/projects/intl/idn_mozilla.h tml
I have mixed feelings about this. I am from Sweden, and it always looks kind of ugly when names lose their dots and circles in the domain name.
On the other hand, this is also quite convenient. I live in the US now, and I travel around quite a bit. I often surf on Swedish Internet sites, typically without access to a Swedish keyboard. It would not be very convenient if the domain names used non-English symbols.
Sometimes I go to Japanese sites also, and I am really glad that I don't have to install a Japanese word processor to do this...
Tor
Punycode *is* a Unicode encoding.
Unicode has many encodings; UTF-8 is one encoding and Punycode is another. UTF-8 aims for efficiency when the majority of the text is ASCII, and Punycode aims for completeness when you must fit in 64 characters and use only the ASCII characters to do it.
[
It looks as if the goal is to implement this without breaking existing implementations. I did RTFA, although I might be missing something, but it seems to be that the translation is done by the client/local nameserver.
i would imagine it probably attempts to query with the unicode first, and upon failure tries the munged address. since both versions are in the whois db, as DNS servers become unicode compliant, this would be naturally phased out.
however, it means that any accent-containing domains would actually have two entries; i wonder, would you have to actually register twice (i.e., pay twice)?
one good thing is that it does look like suficiently undesireable names are the result of the conversion, so i don't think there would be much overlap between existing domains and the converted form of new accent-containing domains...
have you been seen on slash?
Now I won't have to be limited to using a hyphen! I can register d[i-circ]xiechicks.com, or dixi[e-grave]chicks.com, or maybe dixie[c-cedil]hicks.com!
That last one would be doubly good, because if I understand the Punycode spec correctly, it'll get translated to ASCII as dixiehicks-XXXX.com. Not my opinion of the group, but maybe it would attract hits from the Toby Keith crowd.
Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
Hint: ascii is 7-bit.
don't understand a lick of french? 'taco is a mean man'
> You think you know how to parse a domain name for validity?
Yes, I do, and if you _read_ the RFC you'll see that nothing changes, these domain names are encoded into the same character set as the current DNS system. And hence if you give me a URL I can validate it with existing scripts. There's an example which shows that Bucher.ch (with an umlaut on the u) would be translated to: xn--bcher-kva.ch which looks totally parseable to me.
John.
Elsewhere in the world, the Arabic numeral system (012345679) had zero, and before that, so did ancient India.
I don't think the Mayans even used a base-ten system like the rest of the world, so attributing zero to them seems odd to me.
No. The problem that punycode solves is that the encoded DNS names are themselves valid RFC1034 DNS names. That is, even when encoded, standard DNS validity checkers will accept the name.
UTF-8 does not have this property
the growth in cynicism and rebellion has not been without cause
Nothing in the DNS infrastructure need to be upgraded. There is only us-ascii in the zones. BUT, you have to upgrade your applications in order to read them the names the way they are supposed to read, otherwise you will end up with www.xn--rksmrgs-5wao1o.se instead of "www.raksmorgas.se".
Sounds like a great idea.... If you're willing to re-implement the DNS code in my Win-95 box.... or on my Amiga-4000. How about my 10 year old Apollo workstation or the SUN-3 that's still working just fine, thank you. etc. etc.
A lot of old DNS implementations would choke (and properly so) on UTF-8 encoded DNS names. We probably could have seeded the needs of the future by saying that IP-6 DNS servers should support unicode, but I think that even that boat has been missed. (or is quickly leaving dock).
In the meantime the old DNS and it's anglo-centric presumptions and restrictions are with us for the next few years (or decades, as the case may be). Clearly some people feel the need to live within those restrictions.
Free Software: Like love, it grows best when given away.
There's no need to put accents on things, you can spoof just as well without. For example: the Greek omicron, Russian lowercase o, and Latin lowercase o all look identical... but they are all different Unicode characters!
Unless the registries all implement some sort of canonicalization, owners of domain names containing the letter "o" are going to have a combinatorial explosion!
They did obviously consider unicode, perhaps you did not RTFA. However their solution uses unicode at a different layer.
I think the *real* solution here is to reimplement ALL top level DNS servers to support unicode. But the overhead in doing this, when you really think about it, seems difficult (ICANN approval, unicode related bugs, getting everyone to use new DNS server, etc). At least, since the ASCII text supported by DNS are exactly the same in Unicode, backwards compatibility should not be a problem.
This solution is a workaround that uses unicode at the client level, encodes it to "punicode" (which only contains characters supported by DNS, unlike, say, BASE-64 or Quoted-Printable), and sends the request to the DNS server. It is a quick and easy solution to a messy problem. But its hacky-ness makes me doubt it will be supported by whatever governing body influences this stuff (IETF, ICANN, etc).
-Mani
There are a couple others, but I don't remember them offhand... So in other words, these characters are unusable for a reason.
_______________________________
"I'm not Conceited...I'm just a realist..."
Actually, I'm aware of that, but Slashdot seems to have stripped out the accents from my stuff...
I am aware that the German scharf s is not a capital B. I had it correctly in my submission, but someone who was working on the slashcode thought it would be a good idea to eliminate accents, rather than to possibly HTMLize them.
Try it yourself, put in an scharf s into a Slashdot comment, and see what happens.
I notice that you DIDN'T complain about the missing accent on the French e, or the missing slash through the Swedish o.
Now, as a speaker of German for 10 years, I'm going to leave it at that.
I am unamerican, and proud of it!
Why monsteras instead of moensteraas?
Good question. Basically people don't think/too lazy to translitterate the letters properly.
Some places have the forethought to register both:
Munich in Germany has registered both "munchen.de" and "muenchen.de".
(But it's really a u with an umlaut)
http://www.xn--rksmrgs-5wao1o.se/ will work if you are using a recend Mozilla
... : NO
Thanks for the example. Let's do a few quick tests.
The encoded version always works, and leads to a page where you have an unencoded link (normal spelling with the accents).
Copied the unencoded version, and tried:
On WinXP:
- Mozilla 1.4 : OK
- MSIE 6, Opera 6.2 : NO
On Linux - Red Hat 6.2 (of course, that's a pretty old system):
- lynx, ping, host, dig,
(cannot test Mozilla, since this server has no GUI.)
Well, I guess we'll have to live with that horrible Punycode.