Unicode, WWW, Databases and Japanese Web Sites?
Matthew Branton asks: "I have recently started working on a database driven Japanese language web site, and frankly I am getting lost in a sea of complicated Unicode madness. I was toying with the idea of using Python 2.1, PostgreSQL and the mxODBC interface, has anyone else had experience with this particular setup? Are there more appropriate free solutions available, perhaps JSP/Servlet based?" Unicode received a a lot of coverage this week, on Slashdot, and maybe that's due to the increasing worldwide popularity of the net and the desire to read and work in the character sets for users of a given nationality rather than trying to have it all fit in Latin-1. Do Python, PostgreSQL or mxODBC have any issues with Unicode?
...and everything works just fine with Shift-JIS.
Contrary to the popular belief, there indeed is no God.
Even if you ignore the fact that many Anime fans visit Japanese sites on a regular basis, and /. Anime fans tend to be technical, and thus know what is used on Japanese sites, I am sure that I am not the only person who has been contracted for an Asian site (in my case, Chinese).
I *can* tell you that PHP and MySQL work quite nicely with Chinese character sets, but you will quickly run into a few dozen tiny issues involving things like sorting and oddball string functions. In PHP's case (and since I don't know Chinese), I created a testbed which ran an arbitrary string through every function that could possibly be in use. I then had a Chinese tester go through and make sure everything worked (iirc, only a few wierd, replacable functions like ucword() didn't work). In MySQL's case, using the latest version, setting a few flags and writing the queries slightly differently than I normally do (BINARY flag, etc) was the fix: something that I contacted the developers about, and got (as of then) undocumented, cutting edge answers.
--
Evan
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
I've written several dynamic web sites using mySQL, JSP and servlets and this is what I did:
Stored the database internally in UTF-8 format and accessed it using unicode from the servlets.
Stored and served the HTML and JSP files in Shift-JIS format. As long as you tell the servlet that a JSP file is in shift-jis, any characters you write to the document as it's being processed will be converted automatically to shift-jis.
Because HTML forms are stupid and don't support (by default) character sets, you have to assume that the form will be sent back in the same character set the file was sent in. In Java, this meant telling the form handler to interpret the bytes as Shift-JIS which automatically converted them back into unicode, which would let me handle the form data in a uniform manner (1 character is 1 character, even its UTF-8 representation is more than one byte) and easily store it in the database in UTF-8.
As far as I knew, Python supported unicode strings, which would allow this type of handling. I recommend storing your database in some sort of native unicode format, which should make this fairly seamless and should cut down on conversion costs, since UTF-8 to unicode is a MUCH faster conversion than Shift-JIS to unicode. So storing like this cuts down one step, assuming you wish to handle all form data as unicode.
- use the unicode like you've been trying
- use shift-JIS (like most Japanese websites)
The trick to handling Japanese data-driven content is to remove all the 'text' filters you may have in place for the content administration or data types in the system. Unfortunately, when you tell it 'text', it assumes the single-byte limitations (which is fine for ascii). But Japanese is double-byte, and as long as you pass data along as raw data, everything should be okay.I'm sorry if that's restating the obvious-- but it's a point many people overlook.
yoroshiku,
Dave
davejenkins.com |
You could always try asking the same question on Slashdot Japan. You might get more of an answer. I think the /. readers there read a bit more Japanese.