Migrating Large Scale Applications from ASCII to Unicode?
bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.)
I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."
You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?
I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.
Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
Geocrawler error message.
What might be useful is to read how StarOffice, did their unicode and internationalization changes to an existing large code base at sun.com
C.
I sometimes write stuff
For older clients, simply send a question mark or similar for any character not in the ASCII character set; this is extremely trivial to add to your back end. New clients get unicode and all the trappings that go with it. Be sure your support people are trained to explain that updating the client provides the new multinational functionality and eliminates the question mark placeholders.
Regarding your question about different languages/encodings - you may need to include the language per record all the way through to the client end. Without knowing more about your output system, it's difficult to say what the display issues are, but it's difficult to believe many display libraries would limit you to a language per session.
If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)
And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.
Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.
In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.
So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)
Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
Check this standart for unicode compression.
It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!!
A very useful resource on Unicode is this page, written by Markus Kuhn. In particular you may be interested in How do I have to modify my software?; while it does concentrate on Unix, the general principles should be the same on any OS.
Steven Murdoch.
web: http://www.cl.cam.ac.uk/users/sjm217/
Oracle 8i, UTF8 character set. Compatibility with both Unicode and ASCII character sets. What're the problems? Well, clients that think that Unicode is UCS2, is one to watch out for, or forgetting that there's more to life than Western European ISO.
Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.
Good luck!
I'm finding it depressing seeing how things get modded here. This has been modded as funny??
The guy is absolutely right - using UTF-8 solves lots of problems when having to use legacy software with Unicode. I did one project working with twelve languages, including arabic, japanese, hindi and welsh, and we just used SED to search and replace marker tags in hundreds of UTF-8 files. Worked a treat.
What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127. After that it takes two bytes to encode a character, possibly more when you get to "big" characters.
UTF-8 is an encoding for unicode characters.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
The way I understand this, you have old clients, new clients, and a server that must handle both. And the server and new clients should support Unicode.
First, although this is probably obvious, I should note that if your data is primarily text, then you're looking at a 2Tb database when you start using Unicode (depending on the encoding).
This is sortof like supporting German language entry, and wanting to display it on English clients. Its not easy, but it can be done, to some extent. Most Unicode you encounter will have an equivalent ASCII representation; there are acceptable conversions for almost all non-Eastern character sets. You can serve up a converted representation to your ASCII clients.
DO NOT listen to the bullshit about serving up UTF-8 to ASCII clients. They can't understand it any more than I can understand German ; it will seem to work only for low-ASCII characters, but break for all others.
As for data validation, you are going to have to have two rulesets. One will be client-side ASCII; the other a unicode ruleset used by both the new client and the server. Incoming ASCII from the old client should be converted to equivalent Unicode (that's the easy part) before being validated.
Sorry, no realworld information here either ; certainly not on database that size.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
You first have to examine carfully the chracter set your current application can deal with. Is it ASCII? Or just the printable range? Or do most routines treat everything as sequences of 8-bit characters? Is the null character permitted in data? And so on.
After that, you have to identify the operations which are character set specific. This can be quite a bit of work. Character set specific operations include case conversion, collating, normalizing, measuring string length and character width (for formatting plain text), text rendering in general, and so on.
Now you look at your tools. Do they prefer some kind of Unicode encoding? For example, with Java or Windows, using UTF-16 is most convinient (some would say: mandated).
Now you put the pieces together and look for a suitable internal representation (not necessarily "Unicode", i.e. UTF-8, UTF-16, or UTF-32), identify points at which data has to be converted (usually, it is a good idea to minimize this, but if you want to fit everything together, there is sometimes no other choice), and modules and external tools which have to be replaced because adjusting them or adapting to them is too much work.
Your web page generation tools probably need a complete overhaul, so that they are able to minimize the charset being used (for example, German text is sent as ISO-8859-1, but Russian text as KOI8-R or something like that), since client-side Unicode support is mostly ready, but many people don't have the necessary fonts.
But if your database is currently dominated by ASCII or even typical Latin-1 text, that's a reasonable tradeoff; no increase for ASCII text, a slight increase for Latin-1 text (100% on a minority of the characters in actual text; anyone have actual stats handy?), 50% increase for the rest of the 16-bit range, and the same maximum character size (U+10000 - U+1fffff take 4 bytes in both UTF-8 and UTF-16). And then you have the other advantages already mentioned: compatibility with 7-bit ASCII, NUL-terminated C strings, and ordinary 8-bit clean text channels. If you're currently in the ASCII or Latin-1 domain the question isn't even what you expect to store in the future, so much as how much cheaper disk space will be when you finally need to store it.
Just in case any of this work is being done on Microsoft Windows, you should avoid "#define UNICODE", TCHAR, and _T(). These are mainly legacy tricks used to help Windows 3.1 developers cross-compile their code for NT. Microsoft themselves doesn't use them, and insted goes with pure Unicode through the app. Even COM in Win32 since the first release of Windows 95 is all Unicode (BSTRs).
Of course, this would preclude you from using MFC, but then again, many think that avoiding it is a good thing (again, Microsoft is among those who avoid using it). But aside from other benefits, you'd end up with not needing to build two separate binaries: one for Windows NT/2K and one for Win9X.
Oh, and one other thing. If you are doing any portable code, remember that the Microsoft documentation lies and that wchar_t is not always 16-bit like they say. In fact, the spec recomends that it be 32-bit, and most other platforms (Linux included) define it thus.