Muliti-Lingual Web Sites and Character Encoding?
languageLost asks: "I'm working on developing a multi-lingual web site, and I've come across a major problem. It looks like when web browsers submit a form, they don't include the character encoding they used in the headers anywhere. This means I have no way to distinguish between ISO-Latin-1, Shift-JIS, or GBK, for example. Netscape Navigator, Internet Explorer and Mozilla all have this problem. The browsers do send a header called "Accept-charset", but that's not what I'm looking for (and this header typically lies, in any case). I need to know what encoding was used for the text in the form fields. Does anyone know how to do this without using "detection" heuristics? Why don't the browsers just say what encoding they're using?"
I'm not the best person to be able to tell you, but if you're building a form that is likely to contain a lot of non-ASCII data, I think the most effective solution is to use enctype=multipart/form-data which does take a charset parameter. See this portion of the HTML spec for more detail.
How you actually get the character encoding into this new variable is the proverbial exercise left to the reader, but I'm pretty confident that it could be done. In the worst case scenario, you'd have to write a new module for Apache, but it's possible that something like this already exists. Surely this isn't a rare problem when getting into I18N issues....
DO NOT LEAVE IT IS NOT REAL
Internet Explorer also tends to convert posted messages with characters which don't fit into the current encoding (e.g. Russian text, while the page is in ISO-8859-1 encoding) into numerical HTML entities (Ӓ...) which are the character's position in the UCS-2 (Unicode) table.
Are there other ways to know what the browser meant? I'm not sure.
If your script is the one capturing the form data, then it is also usually the one which generated the page with the form on it, so you can tell the browser to switch into whatever encoding you want (using the charset option on the Content-type HTTP header or placing it in an HTML META tag).
But my grandest creation, as history will tell,
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.