Slashdot Mirror


Migrating Large Scale Applications from ASCII to Unicode?

bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."

14 of 202 comments (clear)

  1. Suggestion. by Domini · · Score: 3, Insightful

    Why not encode the data using XML... that way most of your data already maps to the real data.

    This would be without the XML tags, of course. Just the encoding of the data...

    Thus, you will be using UNICODE, and encoding it in XML text.

    Hmm... at some places you may need an XML to unicode translator.

    The fact that you are still storing and transfering your data in ASCII, does not mean it's a ASCII system... it's only your communication medium. This way systematic migration may become more possible.

  2. Capability levels & preserving language taggin by Snowfox · · Score: 4, Insightful
    Clients should be capable of telling their capability level, and servers should be able to use this to determine the data format they receive. If your clients can't return a capability level, new clients should have the feature added, and the lack of the feature should be considered capability level 0. Capability level 1 would be unicode display.

    For older clients, simply send a question mark or similar for any character not in the ASCII character set; this is extremely trivial to add to your back end. New clients get unicode and all the trappings that go with it. Be sure your support people are trained to explain that updating the client provides the new multinational functionality and eliminates the question mark placeholders.

    Regarding your question about different languages/encodings - you may need to include the language per record all the way through to the client end. Without knowing more about your output system, it's difficult to say what the display issues are, but it's difficult to believe many display libraries would limit you to a language per session.

  3. Possible solutions and a plea by mir · · Score: 5, Insightful

    If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)

    And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.

    Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.

    In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.

    So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)

    --
    Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
    1. Re:Possible solutions and a plea by bertilow · · Score: 3, Insightful
      Of course, I can't speak any language other than English, so I personally won't be taking advanta Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in. ge of this. I know other people will though, and thankfully it was easy enough to put in.

      So you think Unicode is just for non-English text? Well, neither ASCII nor Latin 1 is really sufficient for English. There are plenty of characters above 255 in Unicode that are needed or useful for writing English. And then we have foreign names that tend to pop up in English texts with all sorts of funny characters that you need to write even if you only speak English.

    2. Re:Possible solutions and a plea by Malc · · Score: 2, Insightful

      I'm an application programmer, and I can't say that I've found Unicode particulary hard. It's a blessed relief, especially after working with multi-byte character sets. Note, I'm not talking about UTF-8 or UTF-7, which are a multi-byte representation, and are a pain in the arse. In C++, Unicode characters have a dedicated type (wchar_t), and you can index directly into strings, which you can't do with a multibyte char string (see: isleadbyte). The other big advantage of Unicode is being able to share stuff with systems in different localities... there are no "code pages" to worry about. On top of this, some OSes (Windows NT) have been Unicode-only for some time, so switching applications to Unicode is a more natural way of working.

  4. Re:just ignore it by Anonymous Coward · · Score: 2, Insightful

    Latin-1 accomodates Western Europe and the Americas. It doesn't work for Eastern Europe or Asia. With Latin-1, you're cutting out potential profits from Greece, Russia, Arab countries, China, and Japan. For an international company, Unicode IS about making money.

  5. Compression Scheme for Unicode by GeLeTo · · Score: 4, Insightful

    Check this standart for unicode compression.
    It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!!

  6. Been there, done that by sql*kitten · · Score: 5, Insightful

    Oracle 8i, UTF8 character set. Compatibility with both Unicode and ASCII character sets. What're the problems? Well, clients that think that Unicode is UCS2, is one to watch out for, or forgetting that there's more to life than Western European ISO.

    Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.

    Good luck!

  7. Re:everyone should learn English by Anonymous Coward · · Score: 1, Insightful

    Because chinese

    (a)has only a standardised written form, not spoken form

    (b)that written form is especially annoying to represent digitally.

    (c) it is a tonal language, and therefore not very easy to learn unless you have been raised from birth speaking it, since your brain won't have developed the requisite pitch analysis. There are many more non-tonal than tonal language speakers in the world, so standardising on a tonal language would place ALL of them at a disadvantage. It's easy for a tonal language speaker to go the other way though.

    spanish:

    (a) everyone would be spitting all over eachother. That's just the way the language is.

    (b) It has bizarre gender constructions. Gendered nouns, again, are easy to learn from birth, but going from a non-gendered to a gendered language is difficult, since the brain's from-birth language database hasn't allocated a row for "gender".

    (c) It has annoying verb tense constructions. In english, one can easily construct new tenses to deal with problems encountered when talking about time travel/relativity in physics. "He would have been going to do that last week". That's a pain in the ass in spanish. Hence, native spanish speakers have a much shakier grasp of the concept of time.

    We should really standardise on conlang like lojban. Then everyone would be at a roughly equal disadvantage, the language would be totally sanely constructed, amenable to computer parsing, and representable as ascii.

  8. Re:UTF-8 by Anonymous Coward · · Score: 1, Insightful

    I'm finding it depressing seeing how things get modded here. This has been modded as funny??

    Just remember, this is Slashdot, not some fancy-pants two-year community college.

  9. Use approximate character set conversion by Twylite · · Score: 4, Insightful

    The way I understand this, you have old clients, new clients, and a server that must handle both. And the server and new clients should support Unicode.

    First, although this is probably obvious, I should note that if your data is primarily text, then you're looking at a 2Tb database when you start using Unicode (depending on the encoding).

    My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems

    This is sortof like supporting German language entry, and wanting to display it on English clients. Its not easy, but it can be done, to some extent. Most Unicode you encounter will have an equivalent ASCII representation; there are acceptable conversions for almost all non-Eastern character sets. You can serve up a converted representation to your ASCII clients.

    DO NOT listen to the bullshit about serving up UTF-8 to ASCII clients. They can't understand it any more than I can understand German ; it will seem to work only for low-ASCII characters, but break for all others.

    As for data validation, you are going to have to have two rulesets. One will be client-side ASCII; the other a unicode ruleset used by both the new client and the server. Incoming ASCII from the old client should be converted to equivalent Unicode (that's the easy part) before being validated.

    Sorry, no realworld information here either ; certainly not on database that size.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  10. Use UTF-8 encoding by Anonymous Coward · · Score: 1, Insightful

    If you use the UTF-8 encoding, of which ASCII is a subset, then you minimize the amount of code and text that has to change -- only the text that isn't expressable in ASCII changes, using multiple bytes per character, and ASCII string manipulations still "just work".

  11. Use UTF8 by The+Panther! · · Score: 2, Insightful

    While I know XML is a favored silver bullet by the popular press and developers, I still haven't decided if the infatuation with a complicated packaging scheme is really worthwhile. It's nice in a sense that there are off the shelf readers that can interpret the data for you, sure, but ultimately it's still up to your code to pull out the data in a meaningful way. A good XML reader will do two things for you: 1) provide a regular format for all data, and 2) handle string conversions to and from various encoding schemes.

    It seems to me quite silly to bother dealing with all sorts of encoding schemes if you can control the data from the get-go. Convert from whatever your input data is to UTF8 as early as possible. With that, you immediately have support as if you wrote everything as wide characters, but don't have to change much, if any of your code. UTF8 is narrow, with reserved codes for multi-byte encoding. UTF8 doesn't require changing your string functions* that depend on a single terminating null, and you never really have to think about the encoding again. We've migrated from ASCII to UTF8 and now support whatever languages come in as an XML input format, but we immediately convert to UTF8 and forget the XML once we hit our database.

    * Caveat: Poorly encoded UTF8 can represent the same wide character in many ways. For this reason, a straight byte comparison of UTF8 strings is sometimes incorrect. Either you should test all strings at conversion time to see if they are minimally encoded, or convert to UCS2 and back again, just so all strings go through the same manipulative process, and give you the same byte stream. I learned this the hard way. With that out of the way, it's just like using normal ASCII.

    --
    Any connection between your reality and mine is purely coincidental.
  12. Advantages and Disadvantages of UTF-8 by Anonymous Coward · · Score: 3, Insightful
    There seem to be a lot of posts advocating the use of UTF-8 without explaining what the advantages and disadvantages are. Also, some of the posts are simply incorrect.

    Here are some of the advantages and disadvantages of UTF-8:

    • UTF-8 allows you to encode any character in the entire ISO-10646 character set (which is potentially much larger than Unicode since it is a 31-bit code, rather than Unicode which is only a little over 20 bits, or 17 * 65,536 code points). This is probably not of great interest since it is not expected that the ISO character set will ever need to define any characters outside the Unicode range.
    • Strings encoded in UTF-8 can be processed by standard C language routines. A binary 0 embedded in the string can be used as a string terminator just as in 1-byte character sets. Note that routines like strlen() will return the number of bytes rather than the number of characters in a string.
    • UTF-8 preserves the Unicode sorting order so that string comparisons work the way you'd expect without having to convert to Unicode to do the comparison (but note that the Unicode sorting order is not likely to be a useful "language sensitive" sorting order if that matters for your application, so you may still need some way to perform that kind of sort).
    • If you have an arbitrary byte in a string, it is possible to determine unambiguously whether it is the starting byte for a character, and if not you can probe backwards for the starting byte. This is not true of all multibyte character set encodings. This can be very useful for some applications and not at all for others of course.
    • Characters within the ASCII range (00-7f) are transmitted unchanged.
    • Most alphabetic characters (including Hebrew and Arabic characters) are transmitted with only 2 bytes - the same as if you'd stored them as UCS-2 or UTF-16, but not as compact as if you'd stored them with their corresponding ISO 8859-x character set.
    • Ideographic characters and the remaining rare alphabetics within Unicode Plane 0 are transmitted with 3 bytes, which is 50% larger than if they'd been stored with UCS-2 or UTF-16 or (often) with their native computer character set like Shift-JIS.
    • All other Unicode characters (mostly historical Chinese and Japanese characters and character sets for dead languages) can be transmitted in at most 4 bytes.
    • Depending on your display systems, you may need transformation routines to convert to and from other formats used by those systems. For example, many printers or computer fonts that support large character sets might be arranged for use as Shift-JIS or Big5 rather than for Unicode.
    • Because it preserves a certain degree of compatibility with 1-byte character streams, many existing programs and subsystems can coexist with UTF-8 with little or no modification. That does not mean you can count on UTF-8 being safe anywhere that ASCII is safe; you need to evaluate each system on its own merits. However it is quite likely to make your conversion easier.
    Even if you don't use UTF-8 for the external storage format, many projects have found that its advantages make it ideal for processing data in memory. Other times using a fixed-with (16 or 32-bit format) is desirable; fortunately the conversion between UTF-8 and the fixed-width Unicode formats is quite easy and quick.