Slashdot Mirror


Migrating Large Scale Applications from ASCII to Unicode?

bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."

27 of 202 comments (clear)

  1. Convert all interaction to XML by Kingpin · · Score: 5, Informative


    You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?

    I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.

    --
    Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
    Geocrawler error message.
    1. Re:Convert all interaction to XML by RelliK · · Score: 3, Informative

      Why is it that everybody jumps up and down thinking that XML is some kind of magic potion? XML is useful in some cases but it solves none of the problems bobm is asking about.

      --
      ___
      If you think big enough, you'll never have to do it.
  2. I don't get it... by jonr · · Score: 2, Informative

    All decent databases have unicode support and allow you to convert the data on the fly. What's the problem here? And if you use UTF-8 encoding you have ASCII combatiabilty (sp)...
    J.

    1. Re:I don't get it... by Twylite · · Score: 4, Informative

      What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127. After that it takes two bytes to encode a character, possibly more when you get to "big" characters.

      UTF-8 is an encoding for unicode characters.

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    2. Re:I don't get it... by cwiegand · · Score: 2, Informative

      simple. Characters 0 - 127 have the 1st (or 8th) byte OFF (ie. a space (32) = 0x00100000). Now, character 129 would be 0x10000001. In ASCII (or so-called "8-bit ASCII"), that would be fine. In UTF-8, though, the high bit indicates it's a multi-byte character, and the next byte ALSO has to have that high-bit turned on.

      So, for chars 0-127, UTF-8 is a great way to use Unicode. For European languages, they just have an extra byte. But for unicode chars that would have the high byte turned OFF, you have a problem, and it takes more bytes to encode them.

      Basically, UTF-8 is a great way to move to Unicode, but don't consider it the destination. Use UTF-16, if you can.

      --
      Define sqrt(x) as something really evil like (x / rand()), and bury it deep in a shared include somewhere.
  3. Perhaps useful, how staroffice did it. by caolan · · Score: 5, Informative

    What might be useful is to read how StarOffice, did their unicode and internationalization changes to an existing large code base at sun.com
    C.

    --
    I sometimes write stuff
  4. ebXML by Anonymous Coward · · Score: 2, Informative
  5. Useful resource on how to migrate software by sjmurdoch · · Score: 5, Informative

    A very useful resource on Unicode is this page, written by Markus Kuhn. In particular you may be interested in How do I have to modify my software?; while it does concentrate on Unix, the general principles should be the same on any OS.

    --
    Steven Murdoch.
    web: http://www.cl.cam.ac.uk/users/sjm217/
  6. UTF-8 by bertilow · · Score: 3, Informative

    What's the problem? If you use the UTF-8 encoding
    for Unicode, all your data will be ASCII compatible.

    1. Re:UTF-8 by pubjames · · Score: 4, Informative

      I'm finding it depressing seeing how things get modded here. This has been modded as funny??

      The guy is absolutely right - using UTF-8 solves lots of problems when having to use legacy software with Unicode. I did one project working with twelve languages, including arabic, japanese, hindi and welsh, and we just used SED to search and replace marker tags in hundreds of UTF-8 files. Worked a treat.

  7. Re:Urk by dragonfly_blue · · Score: 2, Informative
    Oh, and speaking of Unicode and Perl, I'd have to say that once again O'Reilly is probably a great place to start, and sending the dev team in charge of the Unicode conversions to ORA Unicoe boot camps/geek cruises is probably not a half bad approach.


    There is also this fascinating title, which I've been meaning to read, merely because the page layout and typography within is a work of art. If you're in the bookstore and see this one, check it out. It's impressive.

    --
    Free music from Jack Merlot.
  8. mySQL & PHP by mnordstr · · Score: 3, Informative

    In the development todo for mySQL 4, they have a list of "Things that must be done in the real near future". Quite far down on that list I found:

    "* Add support for UNICODE."

    That's great, because mySQL 4 is about to be released any day now.
    As a PHP developer I wanted to know if php supports unicode. This is what I found:

    Strings:
    "A string is series of characters. In PHP, a character is the same as a byte, that is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode."

    1. Re:mySQL & PHP by Hooya · · Score: 2, Informative

      i've been involved in designing and implementing a site to support arabic, thai, japanese, chinese, russian, korean, hindi and some 15 other languages (the european ones) using , you guessed it, MySQL and PHP. php apparantly supports UNICODE strings (we're using version 3.x even). in MySQL, we set the field to binary. i'm sure that adds some overhead but it works. we've used java to 'convert' strings from x encoding to UTF-8. iconv works too. now users can switch the language of the site purley by selecting an appropriate radio button for the desired language. and the languages are 'translated' gettext() style but thru database instead of files. this is a survey type site so the hittage is quite high and the site along with the database shows no signs of slowing down. i'm not sure if that's what you wanted to know but since our client (the browser) is multiple encoding compatible we have no problem. you might want to look into String class in java as it provides some neat encoding conversion in a roundabout sort of way. you possibly could get the Unicode string and the convert it to ASCII but i'm not sure what it does with the non-ASCII characters. as for MySQL database, set the field to binary. i dont' know about oracle etc.. as i haven't found the need to use it.

  9. Migration of data to unicode sets by Alan+Cox · · Score: 3, Informative

    Make sure you use UTF8. Firstly because unlike UCS2 (16bit) it can encode all the characters not a subset of them. Eventually 16bit won't be enough for you. Secondly its 7bit ASCII equivalent so there is no real problem with migration over time.
    Thirdly since ascii 7bit is UTF8 ascii space there isnt any data migration to be done to set this up.

    1. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 1, Informative


      Even though UCS2 is 16bit, UTF-16 is 16 or 32 bit. You can see the Unicode 3.1 standard for more details. Unicode can have surrogate pairs. So all of Unicode can be represented. This allows for 1 million different Unicode characters. I think Windows XP and MacOS 10.1 can handle surrogates.



      One open source project that can do UTF-16 is from IBM. You can go to http://oss.software.ibm.com/icu/ for more details on ICU4C. It also gives some information on converting charater sets to and from Unicode. It also does collation, number formatting, and all sorts of Unicode manipulation.



      If all your data is ASCII there is probably no problem, but once you encounter character sets like ShiftJis and other East Asian encodings, you will encounter a huge number of variants. Finding the right kind converter can be difficult. Fortunately ICU4C (see previous address) allows you to similate several platforms' conversion behaviour, and if it doesn't, you can modify it yourself. It's all under the X license :-)

    2. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 1, Informative
      UCS-2 is limited to 16 bits (65,536 code points) but UTF-16 is not - it uses a "surrogate" area to support multi-word encodings much like UTF-8 allows multi-byte encodings, with about 1,114,112 code points (17 * 65,536) and this is enough to support all of the characters that are ever likely to be defined.

      It is not necessarily true that you'll eventually need more than 65,536 code points; the code points outside Plane 0 of Unicode are used to encode historical characters and languages such as historical (archaic literary) Chinese characters, Egyptian hieroglyphics, Etruscan, etc. Even many dead writing systems such as Aramaic and the Northern European runes are supported in Plane 0. In addition many font formats do not support more than 65,536 characters, making working with such large character sets awkward. It really depends on your application whether you might need such a large character set, though if you can write your low-level routines now to support it then you can add the display layer later if and when it becomes necessary.

  10. What do you mean by "ASCII"? by Florian+Weimer · · Score: 4, Informative

    You first have to examine carfully the chracter set your current application can deal with. Is it ASCII? Or just the printable range? Or do most routines treat everything as sequences of 8-bit characters? Is the null character permitted in data? And so on.

    After that, you have to identify the operations which are character set specific. This can be quite a bit of work. Character set specific operations include case conversion, collating, normalizing, measuring string length and character width (for formatting plain text), text rendering in general, and so on.

    Now you look at your tools. Do they prefer some kind of Unicode encoding? For example, with Java or Windows, using UTF-16 is most convinient (some would say: mandated).

    Now you put the pieces together and look for a suitable internal representation (not necessarily "Unicode", i.e. UTF-8, UTF-16, or UTF-32), identify points at which data has to be converted (usually, it is a good idea to minimize this, but if you want to fit everything together, there is sometimes no other choice), and modules and external tools which have to be replaced because adjusting them or adapting to them is too much work.

    Your web page generation tools probably need a complete overhaul, so that they are able to minimize the charset being used (for example, German text is sent as ISO-8859-1, but Russian text as KOI8-R or something like that), since client-side Unicode support is mostly ready, but many people don't have the necessary fonts.

  11. Re:Possible solutions and a plea by infiniti99 · · Score: 4, Informative

    In fact Unicode is certainly hard an painful to implement

    Maybe for library programmers. I have been extremely impressed with the Qt library's handling of Unicode characters. The QString class is used across the board and supports full Unicode. My project, Psi can handle unicode everywhere (chat, nicknames), thanks to Qt. Heck, I didn't even know about this for the longest time. In fact, getting unicode chat over Jabber took just one extra function call:

    QString::toUtf8();

    I just use that before sending content or attributes to the Jabber XML stream. Qt's parser already converts incoming UTF-8 to Unicode. This was so amazingly easy to use from an "application coder"'s standpoint it's not even funny.

    Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in.

    -Justin

  12. Space tradeoffs by d5w · · Score: 4, Informative

    But if your database is currently dominated by ASCII or even typical Latin-1 text, that's a reasonable tradeoff; no increase for ASCII text, a slight increase for Latin-1 text (100% on a minority of the characters in actual text; anyone have actual stats handy?), 50% increase for the rest of the 16-bit range, and the same maximum character size (U+10000 - U+1fffff take 4 bytes in both UTF-8 and UTF-16). And then you have the other advantages already mentioned: compatibility with 7-bit ASCII, NUL-terminated C strings, and ordinary 8-bit clean text channels. If you're currently in the ASCII or Latin-1 domain the question isn't even what you expect to store in the future, so much as how much cheaper disk space will be when you finally need to store it.

  13. Re:Been there, done that by Keick · · Score: 2, Informative

    Ditto. I was in charge of converting a legacy library system into supporting unicode, and it was easier than you might think. It was no small system either, with the main windows user interface weighing in at over 200K lines of code, and the server at over 500K lines... You get the gist.

    UTF8 is about the only way to go. Windows provides some decient convertions between local character sets and unicode (UTF8). Also, you may want to look at the Mozilla code, that had a decent UTF8 convertion set as well.

    The details are this: On the server we used Oracle 8i, and converted all the tables to UTF8. Importing old data was fairly straight forward, especially the english since it maps 1 to 1. We used Fulcrum to index with. Fulcrum was our biggest scare, but the easiest to fix. Fulcrum was only capable of ASCCII, and even worse it used a lot of special control characters, with prevent us from using UTF8 with it. The trick was we wrote our own UTF7 layer that encoded UTF8 into our homegrown UTF7 to avoid using the control chars. Beautiful.

    The client side was our biggest hurdle, but Delphi and the windows API saved our butts. Since all the code was based on a common library, i.e. the VCL, we simple rewrote the VCL to handle Unicode. All internal data was in UTF8, so only minor changes were needed for most the controls. We wrote wrappers for the entire windows API. Depending on which Windows you were using, we switched out layers. On english only boxen, the layer simply converted UTF8 to Ascii and visa when dealing with the API. For boxen that supported Unicode, we used a different layer to convert between UTF8 and Unicode. For foreign language boxen, it was the same Ascii layer, but using local page convertions, so the user would always at minimum see their language.

    If you want more details, feel free to email me at bfleming@rjktech.com

  14. Re:Learn your ASCII by Anonymous Coward · · Score: 1, Informative

    Learn your ASCII and discover that's it's only 7 bits.

  15. just in case... avoid #define UNICODE by mughi · · Score: 4, Informative

    Just in case any of this work is being done on Microsoft Windows, you should avoid "#define UNICODE", TCHAR, and _T(). These are mainly legacy tricks used to help Windows 3.1 developers cross-compile their code for NT. Microsoft themselves doesn't use them, and insted goes with pure Unicode through the app. Even COM in Win32 since the first release of Windows 95 is all Unicode (BSTRs).

    Of course, this would preclude you from using MFC, but then again, many think that avoiding it is a good thing (again, Microsoft is among those who avoid using it). But aside from other benefits, you'd end up with not needing to build two separate binaries: one for Windows NT/2K and one for Win9X.

    Oh, and one other thing. If you are doing any portable code, remember that the Microsoft documentation lies and that wchar_t is not always 16-bit like they say. In fact, the spec recomends that it be 32-bit, and most other platforms (Linux included) define it thus.

  16. Migrating Applications from ASCII to Unicode by xnetinc · · Score: 2, Informative

    It sounds like part of your system is using code pages to communicate is various languages like a web baised application. The data portions is not the linguistic text but just items that can be represented in ASCII. Some of you application can only support ASCII and all the data in your database is ASCII. If it is truly ASCII 0 - 127 (0x7F) (7 bit clean)then you data can often just redefine the database to declace that it contains UTF-8 (Unicode) data. But you must be sure that is is 7-bit clean first. Ont of the best Unicode support packages for C/C++ code (I assume that this is C) is ICU. http://oss.software.ibm.com/icu/ ICU uses UTF-16, but there is xIUA http://www.xnetinc.com/xiua/ which is also free open source software that add UTF-8 support to ICU. Even better it will allow you to add support and still run in code page first and then you the same code to support Unicode. It makes it easy to develop hybred application that may use Unicode in one part of the application and not in another. It will also allow you to use UTF-8 for database access. UTF-32 to interfece with Linux Unicode wchar_t and a mix of code page and UTF-8 requests to a browser.

  17. Encodings by osolemirnix · · Score: 3, Informative
    There is an additional problem with unicode in that you can convert from/to any encoding to unicode, but the encodings are not necessarily compatible.

    E.g. we had that with two different japanese kanji encodings (on Sun workstations and Windowze boxes). Both encodings converted to Unicode and back, but they both had characters not present in the other encoding. So if you created, say, a filename on one system, converted the string to unicode and back to the other encoding on the other system, then all you got was a lot of gibberish.

    So storing your data in unicode alone doesn't solve all your problems. All the clients that access that data need to support the same encodings used. (e.g. your american windowze box cannot handle unicode with kanji stuff unless you have the right language pack installed)

    Essentially it boils down to: all your clients and servers must use the same encoding, wether you use unicode or something else.

    --

    Idempotent operation: Like MS software, wether you run it once or often, that doesn't make it any better.
  18. Re:Unicode not adequate for internationalization by Anonymous Coward · · Score: 3, Informative
    The problem is that there aren't currently any reasonable alternatives for handling the problems that you mention. All of the various national character sets and vendor character sets are subsets of Unicode, so if you want to write something today you have little practical alternative.

    There are two basic problems with Unicode: Han unification and ideographic character variations. Essentially all of the various Asian national character sets imply some form of Han unification, and their internal structures are quite different. In either event you are left with having to indicate the original language in order to display the "best possible" glyph, with the added burden that if you use the national character sets you'd have to have multiple interpretation and display systems to handle the very different character set encoding structures.

    The other issue is that of character variations and nuances. Unfortunately there aren't any character coding standards (as opposed to ideas that have been kicked around) that address this at all; if you include the Plane 2 characters in Unicode then it comes closer to handling this than any one national standard.

    I agree that Unicode isn't ideal, but there's nothing on the immediate horizon that looks much better, especially if you need to to be able to display text in any language. But if you can restrict yourself to a single language family (European, Hebrew, Arabic, Japanese, Chinese, etc) then there are already alternatives out there. Unicode is designed for applications where you don't have that luxury.

    If you have the need to handle multiple languages simultaneously, you're still probably better off converting to Unicode first and then converting to whatever "ultimate" encoding system emerges in 20 years or so.

  19. XML & Unicode libraries by melatonin · · Score: 2, Informative
    Apple's CoreFoundation does a great job of dealing with Unicode and XML. It's an OO library written in C, and as such it has string objects and an xml parser/generator that works with its array and dictionary objects. It does an excellent job of abstracting Unicode messiness when working with XML.

    I've found CF a bit cumbersome to use by itself. A wrapper in an OO language like C++ or Objective-C is very convenient. Your Objective-C wrapper is commonly called the Cocoa Foundation framework :)

    It's been ported to Linux and FreeBSD, and I'd recommend it to anyone doing Unicode or XML work. The parser is currently non-validating, but there are so many other 'gifts' that come with CF that makes it worthwhile.

    Hey, it was good enough to build an OS on.

    --
    Moderators should have to take a reading comprehension test.
  20. Don't by Alex+Belits · · Score: 3, Informative

    Unicode does not solve any problems with multilingual text processing -- what it solves is not a problem (having non-iso8859-1 native language, I am qualified to testify that displaying and respresenting data in various languages wasn't a problem for at least 30 years already), and real problems -- rules, matching, hyphenation, spell checking, etc. remain problems with Unicode just like they are without it.

    To make it possible to process, transfer and store the data in multiple languages one does not need Unicode -- in fact Unicode usually only adds additional step that requires some knowledge of language context that may be unknown, unavailable for some kind of processing, or simply not disclosed by end-users. What is necessary is byte-value transparency, so text in multiple languages at least will not be distorted by "too smart" procedures that cut the upper bits or make some other ASCII-centric assumptions. If/when users will care about marking languages in a way more advanced than iso 2022, they probably will find byte-value transparent channels to be suitable for whatever they will use.

    However if/when real usable languages-handling infrastructure that will solve those problems will be created, it won't need unicode because it will have language metadata attached to the text already, and without metadata, text, in unicode or in native charsets, is not usable for most of applications if it's not somehow already known what language it is supposed to be in.

    --
    Contrary to the popular belief, there indeed is no God.