Migrating Large Scale Applications from ASCII to Unicode?
bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.)
I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."
We were using Excel as the data entry client, the using Perl (with the Excel module, very good BTW) and/or VBA to extract the data and send it to Oracle, and ODBC to query from Oracle into Excel. This wasn't a decision we made, it was the clients(i.e. the customer, not the software) legacy way of doing things, and they weren't up for paying us to rebuild it, and retrain all their staff.
You can use Perl to extract the data from Oracle and write SQL INSERT or SQL*Loader scripts, but this is a real pain. Windows is pretty good for Unicode, actually, even Notepad is a Unicode text editor, but the actual encoding is (off the top of my head) fixed width (16 bit) UCS2. The locale of the Oracle client was UTF8 (variable width), and it was verifying that the translating worked that sucked up a lot of resource (we naively first assumed that it would just work). UTF8 is great because if you're only using a subset of it, it doesn't waste storage space. The Oracle server was Windows 2000, the client terminals were a variety of different versions of Windows, running Excel for some bits of the app, MSIE4 for others. On the web side, there was some rather crap ASP/COM based middleware, in the end we dumped it and redid it in Java just for the Unicode-nativeness of it.
Around that time (this was just over 6 months ago) I woulda killed for a Java API to Excel with access to all the objects exposed to VBA, which would have made things a breeze; maybe that exists now.