Slashdot Mirror


Migrating Large Scale Applications from ASCII to Unicode?

bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."

5 of 202 comments (clear)

  1. Convert all interaction to XML by Kingpin · · Score: 5, Informative


    You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?

    I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.

    --
    Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
    Geocrawler error message.
  2. Perhaps useful, how staroffice did it. by caolan · · Score: 5, Informative

    What might be useful is to read how StarOffice, did their unicode and internationalization changes to an existing large code base at sun.com
    C.

    --
    I sometimes write stuff
  3. Possible solutions and a plea by mir · · Score: 5, Insightful

    If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)

    And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.

    Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.

    In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.

    So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)

    --
    Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
  4. Useful resource on how to migrate software by sjmurdoch · · Score: 5, Informative

    A very useful resource on Unicode is this page, written by Markus Kuhn. In particular you may be interested in How do I have to modify my software?; while it does concentrate on Unix, the general principles should be the same on any OS.

    --
    Steven Murdoch.
    web: http://www.cl.cam.ac.uk/users/sjm217/
  5. Been there, done that by sql*kitten · · Score: 5, Insightful

    Oracle 8i, UTF8 character set. Compatibility with both Unicode and ASCII character sets. What're the problems? Well, clients that think that Unicode is UCS2, is one to watch out for, or forgetting that there's more to life than Western European ISO.

    Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.

    Good luck!