ICANN Mulling Multilingual URLs

← Back to Stories (view on slashdot.org)

ICANN Mulling Multilingual URLs

Posted by ryuzaki0 on Thursday October 11, 2007 @07:12AM from the so-many-ways-to-say-google dept.

griffjon writes "The Washington Post is reporting that ICANN is testing out fully multilingual domain names. These won't just be [non-western-language].com, but would have TLDs translated into other scripts, fixing annoyances for non-English speaking audiences. An example: 'Speakers of Hebrew, Arabic and any other language written from right to left must type half of the URL in one direction and the other half — the .com, .net or .org postscript — the opposite way.' Let's hope it goes better this time around: 'Next week's experiments use the domain name "example.test" translated into 11 languages. A previous model, however, used "hippopotamus" instead of "test." These plans went awry when an Israeli registrar realized the Hebrew word ICANN thought meant "hippopotamus" was an expletive and threatened to involve the Israeli government.'"

5 of 213 comments (clear)

Min score:

Reason:

Sort:

Seriously by El+Lobo · 2007-10-11 07:22 · Score: 3, Interesting

Seriously, multilingual domain names are a pain (for the whole humanity). Visiting japan, last year, I saw a lot of servers using japanish simplified language on it. As a foreigner, I hadn't the minimal idea about what the site was (without clicking on ot). Clicking on it didn't help either. Yes, a lot of japanese have the same problem with english domain names, but adding multilanguage names adds more complexity to the whole thing. I would like to see the face of a chinese guy trying to decrypt some URL using ukranian characters... or... trying to write it on his japanese keyboard...

--
It's time to realise that Abble's products are the biggest abomination these days. Just say NO to the dumb iAbble way!!
1. Re:Seriously by veganboyjosh · 2007-10-11 07:25 · Score: 2, Interesting
  
  Speaking of Asian (written) languages, don't a lot of them read top to bottom?
  
  How to accommodate those?
Re:Domain name != URL by CastrTroy · 2007-10-11 08:28 · Score: 2, Interesting

Sounds like something that the Canadian government would embrace. There's rules for government websites that the url must be bilingual, so the directory path and file names must be mirrored to create the same structure in both French and English. The loophole in the rules is that you don't have to provide multiple directories and folders where the name isn't linguistic, such as calling your file 1243.html, or ESADOFE.html. So you can either mirror your directory structure in French and English, or have a completely incomprehensible gibberish based directory structure.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:What word? by zunger · 2007-10-11 14:00 · Score: 2, Interesting

Behemot is the plural of behema; the word literally means (roughly) "large, mindless quadruped." In the plural it's often used as an equivalent to "livestock," and in Biblical Hebrew it was used as the (only) word for hippopotamus. In more modern Hebrew, the borrowed word "hipopotam" is used for hippo, and "behema" has a slightly more literary feel to it -- except when it's used to refer to a person, which is probably its most common use today. And not polite. :)
Re:Some actual facts by jc42 · 2007-10-12 06:35 · Score: 3, Interesting

Before they rush on with alphabets that read right to left and use alternative character sets they really should try English words with greater than 8 bit characters. Are they gonna actually work?

Well, lately I've been testing a lot of my old code in various UTF-8 environments, and I've been duly impressed by the fact that, as Ken intended, almost all the code "just works" with Arabic, Chinese, Japanese, etc.

It turns out that there's a simple explanation. If the code doesn't examine chars with bit 8 turned on, but just treats them as unexamined "data" (or letters if the code is trying to distinguish that way), then everything works right. The only time the code needs to actually look at non-ASCII characters' values are when the text is being rendered in physical form. And hardly any code ever actually does that. Almost all my code reads data from files and writes data to other files, but never does anything with the physical representation of the data. It passes the data to other programs for that.

A case in point: I was recently working on some multi-language HTML files, and I decided to try a fun test with CSS: I defined a whole lot of classes whose names were in Chinese. This made sense, since these classes were being used for pieces of the text that contained mostly Chinese characters, not counting things like spaces and punctuation. I tested the CSS using more than a dozen browsers that I have installed on my linux and OSX test machines. I was unable to find a single case where it didn't work. I even hunted down some Windows boxes and tested the files on IE6 and IE7; the worked fine (despite the well-known CSS incompatibilities in IE ;-). I also tried a few CSS class names with Arabic and Hebrew names, and they worked fine, too.

Now, I don't think for a second that the writers of all those browsers spent time making sure that their code could handle UTF-8-encoded Chinese identifiers in CSS. I suspect that most of them never even considered the possibility. I'd bet that the code just takes anything that's not a significant character in CSS syntax, and tacitly treats it as a "letter". This is all it takes to make UTF-8 work correctly in this case.

I did mention this in a couple of browsers' newsgroups. The responses were basically of the form "Well, of course it works. Why wouldn't it? You don't need special code to handle charset=UTF-8, except for the rendering. You'd have to be a fairly incompetent programmer to write code that doesn't work correctly with UTF-8. Except for rendering."

I can hear people saying "but those browsers all need to render the text." Yeah, but the CSS routines don't render text. They parse the CSS input, and fill in fields in data structures that tell the rendering code how to position and color the text. But the charset-handling code is probably not called anywhere in the CSS modules; it's only called in the few places that actually need to color pixels on the screen.

Lots of people have suggested declaring UTF-8 to be the only encoding for URLs. If this is done, there's probably very little URL-handling code anywhere that needs to be changed; it'll mostly "just work", because char codes 0x800 to 0xFF are treated as "letters". The only question is whether the final step of rendering the text's pixels will produce the right glyph, and the URL-handling code doesn't care about that.

I happen to have a DNS server handy. Maybe I'll try a little test: In one of the domains, I'll add hostnames in Russian, Chinese, Arabic, and maybe a few other non-Roman alphabets. I'll wait a while, and see if I can access the machines via those names from a few other machines. I'll predict that it'll also "just work".

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.