Migrating Large Scale Applications from ASCII to Unicode?
bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.)
I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."
We were using Excel as the data entry client, the using Perl (with the Excel module, very good BTW) and/or VBA to extract the data and send it to Oracle, and ODBC to query from Oracle into Excel. This wasn't a decision we made, it was the clients(i.e. the customer, not the software) legacy way of doing things, and they weren't up for paying us to rebuild it, and retrain all their staff.
You can use Perl to extract the data from Oracle and write SQL INSERT or SQL*Loader scripts, but this is a real pain. Windows is pretty good for Unicode, actually, even Notepad is a Unicode text editor, but the actual encoding is (off the top of my head) fixed width (16 bit) UCS2. The locale of the Oracle client was UTF8 (variable width), and it was verifying that the translating worked that sucked up a lot of resource (we naively first assumed that it would just work). UTF8 is great because if you're only using a subset of it, it doesn't waste storage space. The Oracle server was Windows 2000, the client terminals were a variety of different versions of Windows, running Excel for some bits of the app, MSIE4 for others. On the web side, there was some rather crap ASP/COM based middleware, in the end we dumped it and redid it in Java just for the Unicode-nativeness of it.
Around that time (this was just over 6 months ago) I woulda killed for a Java API to Excel with access to all the objects exposed to VBA, which would have made things a breeze; maybe that exists now.
Does your application support multiple languages now? If it does, it probably has a default language for everything that should be present in case the specific language asked for is missing. Rather than have that be "en_us" (or whatever), make that "US English ASCII-friendly". You can then add a new language "US English Unicode". Then alter your mandate so that everything has at least that language. I'd add Unicode and ASCII flavors for all other languages too, although anything that doesn't use high chars can just be stored as ASCII with the Unicode encoding generated (if space is that much of an issue).
If your application database is not multi-lingual already, then you have some serious architechture work to do. I'd look at it from that standpoint though -- there is a wealth of reference material describing how to add language support to existing data and apps. Think of Unicode as another language.
Concentrate on these issues, and let the technical issues (such as encoding scheme) be decided after you know what you want to do. As far as that specific one goes (seems to have the most interest on this page so far), just use whatever you DBMS supports most natively.
-Richard
You're special forces then? That's great! I just love your olympics!