Slashdot Mirror


Muliti-Lingual Web Sites and Character Encoding?

languageLost asks: "I'm working on developing a multi-lingual web site, and I've come across a major problem. It looks like when web browsers submit a form, they don't include the character encoding they used in the headers anywhere. This means I have no way to distinguish between ISO-Latin-1, Shift-JIS, or GBK, for example. Netscape Navigator, Internet Explorer and Mozilla all have this problem. The browsers do send a header called "Accept-charset", but that's not what I'm looking for (and this header typically lies, in any case). I need to know what encoding was used for the text in the form fields. Does anyone know how to do this without using "detection" heuristics? Why don't the browsers just say what encoding they're using?"

4 comments

  1. use multipart/form-data by regexp · · Score: 2
    The HTTP and HTML specs only provide for ASCII content in the most basic form data (by default, the data is sent back using enctype=x-www-form-urlencoded, which officially only supports ASCII.

    I'm not the best person to be able to tell you, but if you're building a form that is likely to contain a lot of non-ASCII data, I think the most effective solution is to use enctype=multipart/form-data which does take a charset parameter. See this portion of the HTML spec for more detail.

  2. Apache modules by babbage · · Score: 3
    I've never actually tried to use the facility, but Apache allows you to set environment variables more or less on the fly. Assuming that you're running Apache, look up the documentation on SetEnv. If you've got a copy of O'Reilly's Apache guide, the reference material starts on page 90. The syntax is one of:
    SetEnv variable value
    SetEnvIf attribute regex envar[=value] [..]

    How you actually get the character encoding into this new variable is the proverbial exercise left to the reader, but I'm pretty confident that it could be done. In the worst case scenario, you'd have to write a new module for Apache, but it's possible that something like this already exists. Surely this isn't a rare problem when getting into I18N issues....

  3. Forms are submitted with page's encoding by DarkToast · · Score: 2
    From my experience, the forms are submitted with the page's encoding. If the page containing the form is a UTF-8 encoded page, the content would be submitted as UTF-8 encoded. Simply set your HTTP header "Content-Type" to the right encoding (e.g. Content-Type: text/html; charset=UTF-8).

    Internet Explorer also tends to convert posted messages with characters which don't fit into the current encoding (e.g. Russian text, while the page is in ISO-8859-1 encoding) into numerical HTML entities (Ӓ...) which are the character's position in the UCS-2 (Unicode) table.

    Are there other ways to know what the browser meant? I'm not sure.

  4. I've done this before by divbyzero · · Score: 3
    Speaking from experience, I can say that posting HTML form data (using either GET or POST) works just fine in arbitrary encodings. The encoding will always be that of the page containing the form.

    If your script is the one capturing the form data, then it is also usually the one which generated the page with the form on it, so you can tell the browser to switch into whatever encoding you want (using the charset option on the Content-type HTTP header or placing it in an HTML META tag).
    But my grandest creation, as history will tell,

    --
    But my grandest creation, as history will tell,
    Was Firefrorefiddle, the Fiend of the Fell.