Slashdot Mirror


Migrating Large Scale Applications from ASCII to Unicode?

bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."

27 of 202 comments (clear)

  1. Convert all interaction to XML by Kingpin · · Score: 5, Informative


    You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?

    I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.

    --
    Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
    Geocrawler error message.
    1. Re:Convert all interaction to XML by RelliK · · Score: 3, Informative

      Why is it that everybody jumps up and down thinking that XML is some kind of magic potion? XML is useful in some cases but it solves none of the problems bobm is asking about.

      --
      ___
      If you think big enough, you'll never have to do it.
  2. Suggestion. by Domini · · Score: 3, Insightful

    Why not encode the data using XML... that way most of your data already maps to the real data.

    This would be without the XML tags, of course. Just the encoding of the data...

    Thus, you will be using UNICODE, and encoding it in XML text.

    Hmm... at some places you may need an XML to unicode translator.

    The fact that you are still storing and transfering your data in ASCII, does not mean it's a ASCII system... it's only your communication medium. This way systematic migration may become more possible.

  3. Perhaps useful, how staroffice did it. by caolan · · Score: 5, Informative

    What might be useful is to read how StarOffice, did their unicode and internationalization changes to an existing large code base at sun.com
    C.

    --
    I sometimes write stuff
  4. Capability levels & preserving language taggin by Snowfox · · Score: 4, Insightful
    Clients should be capable of telling their capability level, and servers should be able to use this to determine the data format they receive. If your clients can't return a capability level, new clients should have the feature added, and the lack of the feature should be considered capability level 0. Capability level 1 would be unicode display.

    For older clients, simply send a question mark or similar for any character not in the ASCII character set; this is extremely trivial to add to your back end. New clients get unicode and all the trappings that go with it. Be sure your support people are trained to explain that updating the client provides the new multinational functionality and eliminates the question mark placeholders.

    Regarding your question about different languages/encodings - you may need to include the language per record all the way through to the client end. Without knowing more about your output system, it's difficult to say what the display issues are, but it's difficult to believe many display libraries would limit you to a language per session.

  5. Possible solutions and a plea by mir · · Score: 5, Insightful

    If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)

    And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.

    Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.

    In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.

    So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)

    --
    Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
    1. Re:Possible solutions and a plea by infiniti99 · · Score: 4, Informative

      In fact Unicode is certainly hard an painful to implement

      Maybe for library programmers. I have been extremely impressed with the Qt library's handling of Unicode characters. The QString class is used across the board and supports full Unicode. My project, Psi can handle unicode everywhere (chat, nicknames), thanks to Qt. Heck, I didn't even know about this for the longest time. In fact, getting unicode chat over Jabber took just one extra function call:

      QString::toUtf8();

      I just use that before sending content or attributes to the Jabber XML stream. Qt's parser already converts incoming UTF-8 to Unicode. This was so amazingly easy to use from an "application coder"'s standpoint it's not even funny.

      Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in.

      -Justin

    2. Re:Possible solutions and a plea by bertilow · · Score: 3, Insightful
      Of course, I can't speak any language other than English, so I personally won't be taking advanta Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in. ge of this. I know other people will though, and thankfully it was easy enough to put in.

      So you think Unicode is just for non-English text? Well, neither ASCII nor Latin 1 is really sufficient for English. There are plenty of characters above 255 in Unicode that are needed or useful for writing English. And then we have foreign names that tend to pop up in English texts with all sorts of funny characters that you need to write even if you only speak English.

  6. Compression Scheme for Unicode by GeLeTo · · Score: 4, Insightful

    Check this standart for unicode compression.
    It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!!

  7. Useful resource on how to migrate software by sjmurdoch · · Score: 5, Informative

    A very useful resource on Unicode is this page, written by Markus Kuhn. In particular you may be interested in How do I have to modify my software?; while it does concentrate on Unix, the general principles should be the same on any OS.

    --
    Steven Murdoch.
    web: http://www.cl.cam.ac.uk/users/sjm217/
  8. UTF-8 by bertilow · · Score: 3, Informative

    What's the problem? If you use the UTF-8 encoding
    for Unicode, all your data will be ASCII compatible.

    1. Re:UTF-8 by pubjames · · Score: 4, Informative

      I'm finding it depressing seeing how things get modded here. This has been modded as funny??

      The guy is absolutely right - using UTF-8 solves lots of problems when having to use legacy software with Unicode. I did one project working with twelve languages, including arabic, japanese, hindi and welsh, and we just used SED to search and replace marker tags in hundreds of UTF-8 files. Worked a treat.

  9. Been there, done that by sql*kitten · · Score: 5, Insightful

    Oracle 8i, UTF8 character set. Compatibility with both Unicode and ASCII character sets. What're the problems? Well, clients that think that Unicode is UCS2, is one to watch out for, or forgetting that there's more to life than Western European ISO.

    Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.

    Good luck!

    1. Re:Been there, done that by sql*kitten · · Score: 4, Interesting

      We were using Excel as the data entry client, the using Perl (with the Excel module, very good BTW) and/or VBA to extract the data and send it to Oracle, and ODBC to query from Oracle into Excel. This wasn't a decision we made, it was the clients(i.e. the customer, not the software) legacy way of doing things, and they weren't up for paying us to rebuild it, and retrain all their staff.

      You can use Perl to extract the data from Oracle and write SQL INSERT or SQL*Loader scripts, but this is a real pain. Windows is pretty good for Unicode, actually, even Notepad is a Unicode text editor, but the actual encoding is (off the top of my head) fixed width (16 bit) UCS2. The locale of the Oracle client was UTF8 (variable width), and it was verifying that the translating worked that sucked up a lot of resource (we naively first assumed that it would just work). UTF8 is great because if you're only using a subset of it, it doesn't waste storage space. The Oracle server was Windows 2000, the client terminals were a variety of different versions of Windows, running Excel for some bits of the app, MSIE4 for others. On the web side, there was some rather crap ASP/COM based middleware, in the end we dumped it and redid it in Java just for the Unicode-nativeness of it.

      Around that time (this was just over 6 months ago) I woulda killed for a Java API to Excel with access to all the objects exposed to VBA, which would have made things a breeze; maybe that exists now.

  10. Use UTF-8 by Argon · · Score: 3

    Considering using UTF-8 for export instead of direct Unicode. As long as the legacy systems are 8-bit clean, you can feed UTF-8 back to them without too many problems. There will be no issues at all for ASCII data since 7-bit ASCII is the same in UTF-8. You just need to convert front end applications to be UTF-8 aware. You need not convert legacy backends to understand Unicode, they will just store UTF-8 as some weird 8 bit characters. The beauty is you'll be able to convert them in phases and ASCII never stops working.

  11. Re:I don't get it... by Twylite · · Score: 4, Informative

    What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127. After that it takes two bytes to encode a character, possibly more when you get to "big" characters.

    UTF-8 is an encoding for unicode characters.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  12. mySQL & PHP by mnordstr · · Score: 3, Informative

    In the development todo for mySQL 4, they have a list of "Things that must be done in the real near future". Quite far down on that list I found:

    "* Add support for UNICODE."

    That's great, because mySQL 4 is about to be released any day now.
    As a PHP developer I wanted to know if php supports unicode. This is what I found:

    Strings:
    "A string is series of characters. In PHP, a character is the same as a byte, that is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode."

  13. Use approximate character set conversion by Twylite · · Score: 4, Insightful

    The way I understand this, you have old clients, new clients, and a server that must handle both. And the server and new clients should support Unicode.

    First, although this is probably obvious, I should note that if your data is primarily text, then you're looking at a 2Tb database when you start using Unicode (depending on the encoding).

    My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems

    This is sortof like supporting German language entry, and wanting to display it on English clients. Its not easy, but it can be done, to some extent. Most Unicode you encounter will have an equivalent ASCII representation; there are acceptable conversions for almost all non-Eastern character sets. You can serve up a converted representation to your ASCII clients.

    DO NOT listen to the bullshit about serving up UTF-8 to ASCII clients. They can't understand it any more than I can understand German ; it will seem to work only for low-ASCII characters, but break for all others.

    As for data validation, you are going to have to have two rulesets. One will be client-side ASCII; the other a unicode ruleset used by both the new client and the server. Incoming ASCII from the old client should be converted to equivalent Unicode (that's the easy part) before being validated.

    Sorry, no realworld information here either ; certainly not on database that size.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  14. Migration of data to unicode sets by Alan+Cox · · Score: 3, Informative

    Make sure you use UTF8. Firstly because unlike UCS2 (16bit) it can encode all the characters not a subset of them. Eventually 16bit won't be enough for you. Secondly its 7bit ASCII equivalent so there is no real problem with migration over time.
    Thirdly since ascii 7bit is UTF8 ascii space there isnt any data migration to be done to set this up.

  15. What do you mean by "ASCII"? by Florian+Weimer · · Score: 4, Informative

    You first have to examine carfully the chracter set your current application can deal with. Is it ASCII? Or just the printable range? Or do most routines treat everything as sequences of 8-bit characters? Is the null character permitted in data? And so on.

    After that, you have to identify the operations which are character set specific. This can be quite a bit of work. Character set specific operations include case conversion, collating, normalizing, measuring string length and character width (for formatting plain text), text rendering in general, and so on.

    Now you look at your tools. Do they prefer some kind of Unicode encoding? For example, with Java or Windows, using UTF-16 is most convinient (some would say: mandated).

    Now you put the pieces together and look for a suitable internal representation (not necessarily "Unicode", i.e. UTF-8, UTF-16, or UTF-32), identify points at which data has to be converted (usually, it is a good idea to minimize this, but if you want to fit everything together, there is sometimes no other choice), and modules and external tools which have to be replaced because adjusting them or adapting to them is too much work.

    Your web page generation tools probably need a complete overhaul, so that they are able to minimize the charset being used (for example, German text is sent as ISO-8859-1, but Russian text as KOI8-R or something like that), since client-side Unicode support is mostly ready, but many people don't have the necessary fonts.

  16. Space tradeoffs by d5w · · Score: 4, Informative

    But if your database is currently dominated by ASCII or even typical Latin-1 text, that's a reasonable tradeoff; no increase for ASCII text, a slight increase for Latin-1 text (100% on a minority of the characters in actual text; anyone have actual stats handy?), 50% increase for the rest of the 16-bit range, and the same maximum character size (U+10000 - U+1fffff take 4 bytes in both UTF-8 and UTF-16). And then you have the other advantages already mentioned: compatibility with 7-bit ASCII, NUL-terminated C strings, and ordinary 8-bit clean text channels. If you're currently in the ASCII or Latin-1 domain the question isn't even what you expect to store in the future, so much as how much cheaper disk space will be when you finally need to store it.

  17. That's a tough one -- some ideas though: by rjstanford · · Score: 3, Interesting
    The hardest problem to solve is the business one. Storing the data is easy -- scaling from 1TB to 2TB (or more) is a solved problem. The hard part is deciding what to do when an ASCII client requests information that you only have in Unicode.

    Does your application support multiple languages now? If it does, it probably has a default language for everything that should be present in case the specific language asked for is missing. Rather than have that be "en_us" (or whatever), make that "US English ASCII-friendly". You can then add a new language "US English Unicode". Then alter your mandate so that everything has at least that language. I'd add Unicode and ASCII flavors for all other languages too, although anything that doesn't use high chars can just be stored as ASCII with the Unicode encoding generated (if space is that much of an issue).

    If your application database is not multi-lingual already, then you have some serious architechture work to do. I'd look at it from that standpoint though -- there is a wealth of reference material describing how to add language support to existing data and apps. Think of Unicode as another language.

    Concentrate on these issues, and let the technical issues (such as encoding scheme) be decided after you know what you want to do. As far as that specific one goes (seems to have the most interest on this page so far), just use whatever you DBMS supports most natively.

    -Richard

    --
    You're special forces then? That's great! I just love your olympics!
  18. just in case... avoid #define UNICODE by mughi · · Score: 4, Informative

    Just in case any of this work is being done on Microsoft Windows, you should avoid "#define UNICODE", TCHAR, and _T(). These are mainly legacy tricks used to help Windows 3.1 developers cross-compile their code for NT. Microsoft themselves doesn't use them, and insted goes with pure Unicode through the app. Even COM in Win32 since the first release of Windows 95 is all Unicode (BSTRs).

    Of course, this would preclude you from using MFC, but then again, many think that avoiding it is a good thing (again, Microsoft is among those who avoid using it). But aside from other benefits, you'd end up with not needing to build two separate binaries: one for Windows NT/2K and one for Win9X.

    Oh, and one other thing. If you are doing any portable code, remember that the Microsoft documentation lies and that wchar_t is not always 16-bit like they say. In fact, the spec recomends that it be 32-bit, and most other platforms (Linux included) define it thus.

  19. Advantages and Disadvantages of UTF-8 by Anonymous Coward · · Score: 3, Insightful
    There seem to be a lot of posts advocating the use of UTF-8 without explaining what the advantages and disadvantages are. Also, some of the posts are simply incorrect.

    Here are some of the advantages and disadvantages of UTF-8:

    • UTF-8 allows you to encode any character in the entire ISO-10646 character set (which is potentially much larger than Unicode since it is a 31-bit code, rather than Unicode which is only a little over 20 bits, or 17 * 65,536 code points). This is probably not of great interest since it is not expected that the ISO character set will ever need to define any characters outside the Unicode range.
    • Strings encoded in UTF-8 can be processed by standard C language routines. A binary 0 embedded in the string can be used as a string terminator just as in 1-byte character sets. Note that routines like strlen() will return the number of bytes rather than the number of characters in a string.
    • UTF-8 preserves the Unicode sorting order so that string comparisons work the way you'd expect without having to convert to Unicode to do the comparison (but note that the Unicode sorting order is not likely to be a useful "language sensitive" sorting order if that matters for your application, so you may still need some way to perform that kind of sort).
    • If you have an arbitrary byte in a string, it is possible to determine unambiguously whether it is the starting byte for a character, and if not you can probe backwards for the starting byte. This is not true of all multibyte character set encodings. This can be very useful for some applications and not at all for others of course.
    • Characters within the ASCII range (00-7f) are transmitted unchanged.
    • Most alphabetic characters (including Hebrew and Arabic characters) are transmitted with only 2 bytes - the same as if you'd stored them as UCS-2 or UTF-16, but not as compact as if you'd stored them with their corresponding ISO 8859-x character set.
    • Ideographic characters and the remaining rare alphabetics within Unicode Plane 0 are transmitted with 3 bytes, which is 50% larger than if they'd been stored with UCS-2 or UTF-16 or (often) with their native computer character set like Shift-JIS.
    • All other Unicode characters (mostly historical Chinese and Japanese characters and character sets for dead languages) can be transmitted in at most 4 bytes.
    • Depending on your display systems, you may need transformation routines to convert to and from other formats used by those systems. For example, many printers or computer fonts that support large character sets might be arranged for use as Shift-JIS or Big5 rather than for Unicode.
    • Because it preserves a certain degree of compatibility with 1-byte character streams, many existing programs and subsystems can coexist with UTF-8 with little or no modification. That does not mean you can count on UTF-8 being safe anywhere that ASCII is safe; you need to evaluate each system on its own merits. However it is quite likely to make your conversion easier.
    Even if you don't use UTF-8 for the external storage format, many projects have found that its advantages make it ideal for processing data in memory. Other times using a fixed-with (16 or 32-bit format) is desirable; fortunately the conversion between UTF-8 and the fixed-width Unicode formats is quite easy and quick.
  20. Encodings by osolemirnix · · Score: 3, Informative
    There is an additional problem with unicode in that you can convert from/to any encoding to unicode, but the encodings are not necessarily compatible.

    E.g. we had that with two different japanese kanji encodings (on Sun workstations and Windowze boxes). Both encodings converted to Unicode and back, but they both had characters not present in the other encoding. So if you created, say, a filename on one system, converted the string to unicode and back to the other encoding on the other system, then all you got was a lot of gibberish.

    So storing your data in unicode alone doesn't solve all your problems. All the clients that access that data need to support the same encodings used. (e.g. your american windowze box cannot handle unicode with kanji stuff unless you have the right language pack installed)

    Essentially it boils down to: all your clients and servers must use the same encoding, wether you use unicode or something else.

    --

    Idempotent operation: Like MS software, wether you run it once or often, that doesn't make it any better.
  21. Re:Unicode not adequate for internationalization by Anonymous Coward · · Score: 3, Informative
    The problem is that there aren't currently any reasonable alternatives for handling the problems that you mention. All of the various national character sets and vendor character sets are subsets of Unicode, so if you want to write something today you have little practical alternative.

    There are two basic problems with Unicode: Han unification and ideographic character variations. Essentially all of the various Asian national character sets imply some form of Han unification, and their internal structures are quite different. In either event you are left with having to indicate the original language in order to display the "best possible" glyph, with the added burden that if you use the national character sets you'd have to have multiple interpretation and display systems to handle the very different character set encoding structures.

    The other issue is that of character variations and nuances. Unfortunately there aren't any character coding standards (as opposed to ideas that have been kicked around) that address this at all; if you include the Plane 2 characters in Unicode then it comes closer to handling this than any one national standard.

    I agree that Unicode isn't ideal, but there's nothing on the immediate horizon that looks much better, especially if you need to to be able to display text in any language. But if you can restrict yourself to a single language family (European, Hebrew, Arabic, Japanese, Chinese, etc) then there are already alternatives out there. Unicode is designed for applications where you don't have that luxury.

    If you have the need to handle multiple languages simultaneously, you're still probably better off converting to Unicode first and then converting to whatever "ultimate" encoding system emerges in 20 years or so.

  22. Don't by Alex+Belits · · Score: 3, Informative

    Unicode does not solve any problems with multilingual text processing -- what it solves is not a problem (having non-iso8859-1 native language, I am qualified to testify that displaying and respresenting data in various languages wasn't a problem for at least 30 years already), and real problems -- rules, matching, hyphenation, spell checking, etc. remain problems with Unicode just like they are without it.

    To make it possible to process, transfer and store the data in multiple languages one does not need Unicode -- in fact Unicode usually only adds additional step that requires some knowledge of language context that may be unknown, unavailable for some kind of processing, or simply not disclosed by end-users. What is necessary is byte-value transparency, so text in multiple languages at least will not be distorted by "too smart" procedures that cut the upper bits or make some other ASCII-centric assumptions. If/when users will care about marking languages in a way more advanced than iso 2022, they probably will find byte-value transparent channels to be suitable for whatever they will use.

    However if/when real usable languages-handling infrastructure that will solve those problems will be created, it won't need unicode because it will have language metadata attached to the text already, and without metadata, text, in unicode or in native charsets, is not usable for most of applications if it's not somehow already known what language it is supposed to be in.

    --
    Contrary to the popular belief, there indeed is no God.