Slashdot Mirror


Migrating Large Scale Applications from ASCII to Unicode?

bobm asks: "We've been asked to migrate our newer applications to Unicode. My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems. This is important in that we are currently updating > 25k customers a day and managment does not want that to be affected. I also haven't found a clean way to provide multilanguage data mining that can return a single language output. This doesn't even begin to address issues like data validation and display issues. (note: we currently handle the web pages in multiple language sets but require the data to be in ascii form.) I've spent some time on Unicode.Org but I really haven't found any real world discussions on people doing this on a large scale (>1Tbyte databases)."

202 comments

  1. Convert all interaction to XML by Kingpin · · Score: 5, Informative


    You don't mention any specifics, so it's hard to give details in response. What databases? How free hands do you have?

    I'd suggest a message oriented XML based system. You can model to your hearts content in XML, languages, charset etc. You can design near anything around that, and have various backends convert the XML messages (SOAP possibly) to the kind of data that's useful for the given backend.

    --
    Unable to read configuration file '/bigassraid/htdig//conf/14229.conf'
    Geocrawler error message.
    1. Re:Convert all interaction to XML by RelliK · · Score: 3, Informative

      Why is it that everybody jumps up and down thinking that XML is some kind of magic potion? XML is useful in some cases but it solves none of the problems bobm is asking about.

      --
      ___
      If you think big enough, you'll never have to do it.
  2. I don't get it... by jonr · · Score: 2, Informative

    All decent databases have unicode support and allow you to convert the data on the fly. What's the problem here? And if you use UTF-8 encoding you have ASCII combatiabilty (sp)...
    J.

    1. Re:I don't get it... by Twylite · · Score: 4, Informative

      What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127. After that it takes two bytes to encode a character, possibly more when you get to "big" characters.

      UTF-8 is an encoding for unicode characters.

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    2. Re:I don't get it... by Shafalus · · Score: 1
      After that it takes two bytes to encode a character, possibly more when you get to "big" characters.

      UTF-8 takes:

      • 1 byte from 0 to 0x7f
      • 2 bytes from 0x80 to 0x7ff
      • 3 bytes from 0x800 to 0xffff
      • 4 bytes from 0x10000 to 0x1fffff

      That's why it's only popular in Europe and the Middle East. Characters in scripts from India, South-East Asia and the native American languages take up more space in UTF-8 than in UTF-16.

      --

      Linux advocates are in a no Win situation

    3. Re:I don't get it... by ericf · · Score: 1

      Actually UTF-8 takes one byte for characters 0 to 255. From there the rest of the unicode charset takes between two and three bytes.

    4. Re:I don't get it... by bertilow · · Score: 0, Flamebait
      Actually UTF-8 takes one byte for characters 0 to 255.

      Actually you need to check your bullshit information. Characters 160 to 255 take up two bytes each in UTF-8.

    5. Re:I don't get it... by pne · · Score: 1

      What's with people assuming that UTF-8 is ASCII? Its not. UTF-8 is a multibyte representation, that just happens to coincide with ASCII for characters 0 through 127.

      The original poster talked not about "the same as ASCII" but about "ASCII compatible". And if you have text that's in ASCII, then it's automatically in UTF-8 as well since, as you said, for characters 0 to 127 the ASCII bytes are the same as UTF-8 bytes.

      (Of course, this breaks if you have a language that uses a superset of ASCII such as iso-8859-1, but if you have only have characters from "real" ASCII, then UTF-8 has the same representation as ASCII.)

      Cheers,
      Philip.

      --
      Esli epei etot cumprenan, shris soa Sfaha.
    6. Re:I don't get it... by jonr · · Score: 1

      Hello? Where did I say that UTF-8 being ASCII? I said compatible. UTF-8 was designed in this way, to make the transition somewhat easier.
      And this is marked up as Informative, are moderators still getting crack with their modartion status?

    7. Re:I don't get it... by dvdeug · · Score: 2

      Please explain how this would work. If you store 0 .. 255 in one byte, how do you indicate a multibyte sequence.

    8. Re:I don't get it... by Anonymous Coward · · Score: 0

      That should be "Characters 128 to 255 take up 2 bytes ...".

    9. Re:I don't get it... by chubso · · Score: 1

      I agree. UTF-8 can make 8-bit and 31-bit Unicode play nice together. I just wish people would realize that supporting 16-bit characters is not supporting Unicode.

    10. Re:I don't get it... by cwiegand · · Score: 2, Informative

      simple. Characters 0 - 127 have the 1st (or 8th) byte OFF (ie. a space (32) = 0x00100000). Now, character 129 would be 0x10000001. In ASCII (or so-called "8-bit ASCII"), that would be fine. In UTF-8, though, the high bit indicates it's a multi-byte character, and the next byte ALSO has to have that high-bit turned on.

      So, for chars 0-127, UTF-8 is a great way to use Unicode. For European languages, they just have an extra byte. But for unicode chars that would have the high byte turned OFF, you have a problem, and it takes more bytes to encode them.

      Basically, UTF-8 is a great way to move to Unicode, but don't consider it the destination. Use UTF-16, if you can.

      --
      Define sqrt(x) as something really evil like (x / rand()), and bury it deep in a shared include somewhere.
    11. Re:I don't get it... by ChadN · · Score: 1

      That should be "Characters 128 to 255 take up 2 bytes ...".

      Actually, characters 128 - 2047 take up 2 bytes each. (I know, the original context of this thread was about ASCII or Latin-1 encoding; thus, I leave off the "dumbass!")

      --
      "It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
    12. Re:I don't get it... by BJH · · Score: 1

      Jeez, read his comment properly. He's replying to the parent comment that said that characters 0-255 are one byte, which is obviously complete bullshit.

    13. Re:I don't get it... by divbyzero · · Score: 1

      Mod this up, please. Too few people get this right, especially the parenthetical, and this is a rather clear, accurate statement of the situation.

      --
      But my grandest creation, as history will tell,
      Was Firefrorefiddle, the Fiend of the Fell.
    14. Re:I don't get it... by bertilow · · Score: 1
      That should be "Characters 128 to 255 take up 2 bytes..."

      There are no characters in Unicode between 128 and 160!

  3. Suggestion. by Domini · · Score: 3, Insightful

    Why not encode the data using XML... that way most of your data already maps to the real data.

    This would be without the XML tags, of course. Just the encoding of the data...

    Thus, you will be using UNICODE, and encoding it in XML text.

    Hmm... at some places you may need an XML to unicode translator.

    The fact that you are still storing and transfering your data in ASCII, does not mean it's a ASCII system... it's only your communication medium. This way systematic migration may become more possible.

  4. Perhaps useful, how staroffice did it. by caolan · · Score: 5, Informative

    What might be useful is to read how StarOffice, did their unicode and internationalization changes to an existing large code base at sun.com
    C.

    --
    I sometimes write stuff
  5. Re:Even before I get to work by forged · · Score: 0, Offtopic

    No you don't.

    Your comments record with all those -1 scores indicate that you're just a lameass.

    In fact I don't even know why I am wasting my time replying to this. Gotta go, see you never.

  6. Capability levels & preserving language taggin by Snowfox · · Score: 4, Insightful
    Clients should be capable of telling their capability level, and servers should be able to use this to determine the data format they receive. If your clients can't return a capability level, new clients should have the feature added, and the lack of the feature should be considered capability level 0. Capability level 1 would be unicode display.

    For older clients, simply send a question mark or similar for any character not in the ASCII character set; this is extremely trivial to add to your back end. New clients get unicode and all the trappings that go with it. Be sure your support people are trained to explain that updating the client provides the new multinational functionality and eliminates the question mark placeholders.

    Regarding your question about different languages/encodings - you may need to include the language per record all the way through to the client end. Without knowing more about your output system, it's difficult to say what the display issues are, but it's difficult to believe many display libraries would limit you to a language per session.

  7. Re:Suggestion - XML by Domini · · Score: 1, Redundant

    Hehe... seems like someone else had the same idea already.

    And, I must also insist that more domain specific information be given to aid in giving a solution.

    PS: By no mean do I think XML is the begin and end of all things... just that it may actually be useful here...
    ;)

  8. ebXML by Anonymous Coward · · Score: 2, Informative
    1. Re:ebXML by mnot · · Score: 1

      WTF does ebXML have to do with i18n, and why is it 'informative'?

  9. Re:Even before I get to work by Anonymous Coward · · Score: 0



    You should really set your moderation threshold higher... replying to thoses posts are pointless, and cause follow-up posts like this etc. etc.
    ;)

    (I saw the post because I wanted to see if who was first with a specific suggestion...)

    </offtopic>

  10. Possible solutions and a plea by mir · · Score: 5, Insightful

    If your application returns results in XML you can always encode "safely" parts of the text using character entities (&#nn;). An other solution is to return not one but several results, in various encodings (you would have either to store the native encoding of a text or to figure out what it could be)

    And I hope this kind of practical discussion can help to raise the level of interest in Unicode amongst application coders.

    Although a lot of "core" coders (as in people who write languages and tools) are really into Unicode and trying to get their code to process it properly I found that most "application programmers", people who use those tools, are not at all interested. They tend to think that all software should support their favorite encoding natively. They also tend to curse alot when they get data in a different encoding ;--) Usually they view Unicode as yet another curse thrown upon them by an irresponsible buzzword-worshipping management.

    In fact Unicode is certainly hard an painful to implement, but it is a standard and at least written by people who know what they're doing. It solves problems that most of us either have had to deal with (oh the agony of dealing with odd characters in SGML data) or will have to deal with,:face it people, there are more and more people whose names include funny characters, even in the US, to leave that market untapped.

    So please view Unicode as a chance, and if the poster can do it on a terabyte of data, you can certainly do it on much less, especially as the tools are coming (yes, even Perl!)

    --
    Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
    1. Re:Possible solutions and a plea by infiniti99 · · Score: 4, Informative

      In fact Unicode is certainly hard an painful to implement

      Maybe for library programmers. I have been extremely impressed with the Qt library's handling of Unicode characters. The QString class is used across the board and supports full Unicode. My project, Psi can handle unicode everywhere (chat, nicknames), thanks to Qt. Heck, I didn't even know about this for the longest time. In fact, getting unicode chat over Jabber took just one extra function call:

      QString::toUtf8();

      I just use that before sending content or attributes to the Jabber XML stream. Qt's parser already converts incoming UTF-8 to Unicode. This was so amazingly easy to use from an "application coder"'s standpoint it's not even funny.

      Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in.

      -Justin

    2. Re:Possible solutions and a plea by bertilow · · Score: 3, Insightful
      Of course, I can't speak any language other than English, so I personally won't be taking advanta Of course, I can't speak any language other than English, so I personally won't be taking advantage of this. I know other people will though, and thankfully it was easy enough to put in. ge of this. I know other people will though, and thankfully it was easy enough to put in.

      So you think Unicode is just for non-English text? Well, neither ASCII nor Latin 1 is really sufficient for English. There are plenty of characters above 255 in Unicode that are needed or useful for writing English. And then we have foreign names that tend to pop up in English texts with all sorts of funny characters that you need to write even if you only speak English.

    3. Re:Possible solutions and a plea by Anonymous Coward · · Score: 0

      Well, neither ASCII nor Latin 1 is really sufficient for English.

      Excellent point :)

    4. Re:Possible solutions and a plea by Anonymous Coward · · Score: 0
      "... face it people, there are more and more people whose names include funny characters ..."

      I will give you the benefit of the doubt and assume that by funny you meant unusual, not laughable. Regardless, though, the characterization exemplifies why UTF-8 is considered such a pain in the ass by presbyopic developers who fail to realize the obvious: we live in a world of increasingly diverse communities for which ASCII information systems are woefully ill-suited. But most importantly, as someone who has "funny" characters embedded in his name, I think I have just as much a right as anyone else to have my name spelled correctly in the mailing list and telemarketing databases which are responsible for the plethora of crap that's sent to me, yet not me.

      Sincerely,
      Añoñymous Cöward

    5. Re:Possible solutions and a plea by Malc · · Score: 2, Insightful

      I'm an application programmer, and I can't say that I've found Unicode particulary hard. It's a blessed relief, especially after working with multi-byte character sets. Note, I'm not talking about UTF-8 or UTF-7, which are a multi-byte representation, and are a pain in the arse. In C++, Unicode characters have a dedicated type (wchar_t), and you can index directly into strings, which you can't do with a multibyte char string (see: isleadbyte). The other big advantage of Unicode is being able to share stuff with systems in different localities... there are no "code pages" to worry about. On top of this, some OSes (Windows NT) have been Unicode-only for some time, so switching applications to Unicode is a more natural way of working.

    6. Re:Possible solutions and a plea by Anonymous Coward · · Score: 0
      UNICODE is great in principle, until you try and find a UNICODE font. There just don't seem to be any out there - not free, at least. You can get fonts covering most languages, but if you want one that covers *every* language you have to pay big bucks [as a game developer, I'm mainly thinking of Japanese here].

      That's one reason ASCII is still so well loved, because ASCII fonts can be found *anywhere*.

      Does anyone have a good source for free UNICODE fonts?

    7. Re:Possible solutions and a plea by Anonymous Coward · · Score: 0

      Just get one that covers the ones that you're interested in. You care "mainly" about Japanese, right?

    8. Re:Possible solutions and a plea by Anonymous Coward · · Score: 0

      That's one reason ASCII is still so well loved, because ASCII fonts can be found *anywhere*.

      Does anyone have a good source for free UNICODE fonts?

      I did a little searching, and found this and this, which is a start.

      The problem is that designing fonts (good fonts) is hard. The difficulty is compounded by the large character set in Unicode. I mean, it takes years, man! :)

      - MFN

  11. Re:everyone should learn English by Anonymous Coward · · Score: 2, Troll

    Why not stick with the most used languages on the planet then? Chinese or Spanish?

  12. Re:everyone should learn English by Hektor_Troy · · Score: 1

    Actually the easiest thing would be for everyone to learn chineese. There are at least twice as many people who speak chineese as there are people who speak english; hence don't have to teach that many people chineese as compared to english.

    --
    We do not live in the 21st century. We live in the 20 second century.
  13. Re:just ignore it by Anonymous Coward · · Score: 2, Insightful

    Latin-1 accomodates Western Europe and the Americas. It doesn't work for Eastern Europe or Asia. With Latin-1, you're cutting out potential profits from Greece, Russia, Arab countries, China, and Japan. For an international company, Unicode IS about making money.

  14. Compression Scheme for Unicode by GeLeTo · · Score: 4, Insightful

    Check this standart for unicode compression.
    It compresses 16 bit unicode chars to 8bit using some reserved tags to switch the character windows. Sample java implementation is avaiable. The best thing is that most of the standart ASCII chars will still be encoded as 8bit ASCII after the compression. So you can still store all your data in 8bit ASCII and convert it to unicode before displaying it. And you don't have to modify your old data!!!

    1. Re:Compression Scheme for Unicode by bodin · · Score: 1

      But hey. Why not use UTF-8 instead?
      It's more widely spread and it also stores old ASCII data in 8-bit format.

  15. Re:Very Humanitarian by Anonymous Coward · · Score: 0

    Interesting read... but it's all just logic and truth.

    It won't find an open-minded intelligent audience (At least not a large one) in the States.
    Especially not now.

    Besides posting things like this on a forum such as slashdot.org will rally less support and more iritation.

  16. Urk by dragonfly_blue · · Score: 1

    The title of this article alone was enough to give me horrible flashbacks to working with EBCDIC/ASCII conversions and IBM's weirdly proprietary and immutable standards. Thank GOD for Larry Wall, is all I have to say about that.

    --
    Free music from Jack Merlot.
    1. Re:Urk by dragonfly_blue · · Score: 2, Informative
      Oh, and speaking of Unicode and Perl, I'd have to say that once again O'Reilly is probably a great place to start, and sending the dev team in charge of the Unicode conversions to ORA Unicoe boot camps/geek cruises is probably not a half bad approach.


      There is also this fascinating title, which I've been meaning to read, merely because the page layout and typography within is a work of art. If you're in the bookstore and see this one, check it out. It's impressive.

      --
      Free music from Jack Merlot.
    2. Re:Urk by Artichoke · · Score: 1


      > [...], check it out

      Direct link to the online sample pdf of Chapter 1

      ... and whilst I'd not go overboard on the beautiful tpyograpyh angle, it certainly looks an interesting read.
      [Note to self: get a life]

      --
      __
      Arse
    3. Re:Urk by darkonc · · Score: 2

      ASCII/EBCDIC conversions are probably not as bad as EBCDIC/EBCDIC conversions ... It took me a long time to realize that IBM has a number of EBCDIC encodings -- and you often don't know which one you're getting unless you know what kind of device you got it from.

      --
      Sometimes boldness is in fashion. Sometimes only the brave will be bold.
  17. Useful resource on how to migrate software by sjmurdoch · · Score: 5, Informative

    A very useful resource on Unicode is this page, written by Markus Kuhn. In particular you may be interested in How do I have to modify my software?; while it does concentrate on Unix, the general principles should be the same on any OS.

    --
    Steven Murdoch.
    web: http://www.cl.cam.ac.uk/users/sjm217/
  18. UTF-8 by bertilow · · Score: 3, Informative

    What's the problem? If you use the UTF-8 encoding
    for Unicode, all your data will be ASCII compatible.

    1. Re:UTF-8 by pubjames · · Score: 4, Informative

      I'm finding it depressing seeing how things get modded here. This has been modded as funny??

      The guy is absolutely right - using UTF-8 solves lots of problems when having to use legacy software with Unicode. I did one project working with twelve languages, including arabic, japanese, hindi and welsh, and we just used SED to search and replace marker tags in hundreds of UTF-8 files. Worked a treat.

    2. Re:UTF-8 by Anonymous Coward · · Score: 1, Insightful

      I'm finding it depressing seeing how things get modded here. This has been modded as funny??

      Just remember, this is Slashdot, not some fancy-pants two-year community college.

    3. Re:UTF-8 by teg · · Score: 2

      What's the problem? If you use the UTF-8 encoding
      for Unicode, all your data will be ASCII compatible.


      ASCII is 7 bit while UTF-8 is 8 bit. You would want UTF-7 to remain ASCII-"compatible" (UTF-7 is defined in RFC 2152).

    4. Re:UTF-8 by dvdeug · · Score: 2

      The normal meaning of ASCII compatible is an ASCII stream converted into that encoding doesn't change, with occasionally the further restriction being added that bytes in the range 00-7F are equal to ASCII characters (i.e. are not parts of multibyte characters.)

      In this sense, UTF-8 is ASCII compatible. UTF-7, on the other hand, munges certain ASCII characters, and uses bytes in the range 00-7F to stand for non-ASCII characters. If you have to deal with a 7 bit channel, UTF-7 may be the way to go, but otherwise you want to avoid it.

  19. Why bother with perl? by Anonymous Coward · · Score: 0

    Much easier to do that sort of thing in C

    char get_ascii(char ebcdic)
    {
    char map[256]={ ...... };
    return map[ebcdic];
    }

    Whats the big deal?

    1. Re:Why bother with perl? by Anonymous Coward · · Score: 0

      Ah, yes, but I don't know C, and I do know Perl. :-)

  20. Been there, done that by sql*kitten · · Score: 5, Insightful

    Oracle 8i, UTF8 character set. Compatibility with both Unicode and ASCII character sets. What're the problems? Well, clients that think that Unicode is UCS2, is one to watch out for, or forgetting that there's more to life than Western European ISO.

    Basically, 90% of the problems you will encounter is in converting between character sets to integrate with other things. If you can use Java (Unicode native) and PL/SQL for as much as possible, you'll have fewer problems. If your client is Excel (don't ask) that complicates matters. If you can assume that everything in the database is US7ASCII you're all set, because you won't need to do any data cleansing. If you have to convert stuff that's already there, then you will run into problems, what happened to me is that we had a Western European encoding, but people were entering Cyrillic data. It all came out fine on their desktops, which were configured for that character set, but the actual data in the database was gibberish as far as the queries were concerned. Non-trivial to fix.

    Good luck!

    1. Re:Been there, done that by pubjames · · Score: 2

      If your client is Excel (don't ask) that complicates matters.

      Do you mean Microsoft Excel? Do you mind expanding on this a bit, because I am doing a project at the moment that involves a translation agency giving us translated files in Excel in lots of different languages.

    2. Re:Been there, done that by sql*kitten · · Score: 4, Interesting

      We were using Excel as the data entry client, the using Perl (with the Excel module, very good BTW) and/or VBA to extract the data and send it to Oracle, and ODBC to query from Oracle into Excel. This wasn't a decision we made, it was the clients(i.e. the customer, not the software) legacy way of doing things, and they weren't up for paying us to rebuild it, and retrain all their staff.

      You can use Perl to extract the data from Oracle and write SQL INSERT or SQL*Loader scripts, but this is a real pain. Windows is pretty good for Unicode, actually, even Notepad is a Unicode text editor, but the actual encoding is (off the top of my head) fixed width (16 bit) UCS2. The locale of the Oracle client was UTF8 (variable width), and it was verifying that the translating worked that sucked up a lot of resource (we naively first assumed that it would just work). UTF8 is great because if you're only using a subset of it, it doesn't waste storage space. The Oracle server was Windows 2000, the client terminals were a variety of different versions of Windows, running Excel for some bits of the app, MSIE4 for others. On the web side, there was some rather crap ASP/COM based middleware, in the end we dumped it and redid it in Java just for the Unicode-nativeness of it.

      Around that time (this was just over 6 months ago) I woulda killed for a Java API to Excel with access to all the objects exposed to VBA, which would have made things a breeze; maybe that exists now.

    3. Re:Been there, done that by iawia · · Score: 1
    4. Re:Been there, done that by pubjames · · Score: 1, Offtopic

      Thanks for spending the time to reply to my query.

    5. Re:Been there, done that by Keick · · Score: 2, Informative

      Ditto. I was in charge of converting a legacy library system into supporting unicode, and it was easier than you might think. It was no small system either, with the main windows user interface weighing in at over 200K lines of code, and the server at over 500K lines... You get the gist.

      UTF8 is about the only way to go. Windows provides some decient convertions between local character sets and unicode (UTF8). Also, you may want to look at the Mozilla code, that had a decent UTF8 convertion set as well.

      The details are this: On the server we used Oracle 8i, and converted all the tables to UTF8. Importing old data was fairly straight forward, especially the english since it maps 1 to 1. We used Fulcrum to index with. Fulcrum was our biggest scare, but the easiest to fix. Fulcrum was only capable of ASCCII, and even worse it used a lot of special control characters, with prevent us from using UTF8 with it. The trick was we wrote our own UTF7 layer that encoded UTF8 into our homegrown UTF7 to avoid using the control chars. Beautiful.

      The client side was our biggest hurdle, but Delphi and the windows API saved our butts. Since all the code was based on a common library, i.e. the VCL, we simple rewrote the VCL to handle Unicode. All internal data was in UTF8, so only minor changes were needed for most the controls. We wrote wrappers for the entire windows API. Depending on which Windows you were using, we switched out layers. On english only boxen, the layer simply converted UTF8 to Ascii and visa when dealing with the API. For boxen that supported Unicode, we used a different layer to convert between UTF8 and Unicode. For foreign language boxen, it was the same Ascii layer, but using local page convertions, so the user would always at minimum see their language.

      If you want more details, feel free to email me at bfleming@rjktech.com

    6. Re:Been there, done that by oozer · · Score: 2

      You could always access Excel's object model from Java (well at least since Microsoft's "evil" version of Java was available). Microsoft wanted people to write Java that would only run on Windows so they coded up a Java->COM bridge. Since all Excel's object model is available as COM objects to all-comers (VBA is just a free-be crippled copy of VB) you could easily automate Excel from Java.

    7. Re:Been there, done that by sql*kitten · · Score: 2

      That's true, but at the time, no-one knew that this wouldn't disappear in the next release of Excel, for example MS had pretty much abandoned J++. So it was judged to be too risky to just use the old tech. Oh well. Maybe next time... :0)

  21. Re:Ignore to proselytising - don't use XML by 6*7 · · Score: 1

    IMHO XML isn't just "the flavor of the month", it's been around sometime now and probably will be (at least I hope).

    XML data can be bloated by using verbose tags, but nobody is forcing you to use descriptive tags. If you want just use tags like <a> thru <zzzzzz>

  22. Use UTF-8 by Argon · · Score: 3

    Considering using UTF-8 for export instead of direct Unicode. As long as the legacy systems are 8-bit clean, you can feed UTF-8 back to them without too many problems. There will be no issues at all for ASCII data since 7-bit ASCII is the same in UTF-8. You just need to convert front end applications to be UTF-8 aware. You need not convert legacy backends to understand Unicode, they will just store UTF-8 as some weird 8 bit characters. The beauty is you'll be able to convert them in phases and ASCII never stops working.

    1. Re:Use UTF-8 by smallpaul · · Score: 2

      Considering using UTF-8 for export instead of direct Unicode.

      UTF-8 is Unicode. It is one way of representing Unicode on disk. It is much Unicode as UTF-16 which is probably what you mean by "direct Unicode". They are just two different representations, like one's-complement or two's-complement integers. Both are integers!

    2. Re:Use UTF-8 by Argon · · Score: 1

      I know UTF-8 is Unicode Transformation Format, thank you. I've worked with it for two years. My point is with UTF-8 you don't have intervening nulls, so legacy applications continue to work. UTF-8 made it's beginning as FSS-UTF (File System Safe) in Plan9.

    3. Re:Use UTF-8 by smallpaul · · Score: 2
      I understand your point but your terminology was confusing. It was an instance of a common terminological mistake that I thought was worth correction. UTF-8 and Unicode are not alteratives. UTF-8 and UTF-16 are alternatives. UTF-8 and UTF-16 are both Unicode.

      One historical root of this terminological mistake is that there was a time where UTF-16 was a sort of blessed or default Unicode encoding. But that is no longer the case.

  23. Re:Ignore to proselytising - don't use XML by pubjames · · Score: 2

    Why has this been modded as insightful?

    It should be fairly obvious to anyone who is knowledgeable about technology (I hope that includes most Slashdot moderators) that this guy doesn't know what he is talking about.

  24. Re:everyone should learn English by Anonymous Coward · · Score: 1, Insightful

    Because chinese

    (a)has only a standardised written form, not spoken form

    (b)that written form is especially annoying to represent digitally.

    (c) it is a tonal language, and therefore not very easy to learn unless you have been raised from birth speaking it, since your brain won't have developed the requisite pitch analysis. There are many more non-tonal than tonal language speakers in the world, so standardising on a tonal language would place ALL of them at a disadvantage. It's easy for a tonal language speaker to go the other way though.

    spanish:

    (a) everyone would be spitting all over eachother. That's just the way the language is.

    (b) It has bizarre gender constructions. Gendered nouns, again, are easy to learn from birth, but going from a non-gendered to a gendered language is difficult, since the brain's from-birth language database hasn't allocated a row for "gender".

    (c) It has annoying verb tense constructions. In english, one can easily construct new tenses to deal with problems encountered when talking about time travel/relativity in physics. "He would have been going to do that last week". That's a pain in the ass in spanish. Hence, native spanish speakers have a much shakier grasp of the concept of time.

    We should really standardise on conlang like lojban. Then everyone would be at a roughly equal disadvantage, the language would be totally sanely constructed, amenable to computer parsing, and representable as ascii.

  25. Re:everyone should learn English by Anonymous Coward · · Score: 0

    Flamebait? I thought this was fairly funny (and assume that it was intended that way).

  26. Re:everyone should learn English by SlamMan · · Score: 1

    but something like 60% of all people who speak chinese never speak to anyone else who speaks another langauge.

    --
    Mod point free since 2001
  27. Re:Even before I get to work by Domini · · Score: 1

    How do you know you were the first post?

    Also, can't someone else post quickly while you were typing away?

    Just curious...

  28. Re:everyone should learn English by mutende · · Score: 1

    Unless you mean British English, in which case ASCII is not sufficient to produce a £ sign...

    --
    Unselfish actions pay back better
  29. mySQL & PHP by mnordstr · · Score: 3, Informative

    In the development todo for mySQL 4, they have a list of "Things that must be done in the real near future". Quite far down on that list I found:

    "* Add support for UNICODE."

    That's great, because mySQL 4 is about to be released any day now.
    As a PHP developer I wanted to know if php supports unicode. This is what I found:

    Strings:
    "A string is series of characters. In PHP, a character is the same as a byte, that is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode."

    1. Re:mySQL & PHP by Tony+Blair · · Score: 0

      If MySQL isn't Unicode compliant how does this site work? Just curious.

    2. Re:mySQL & PHP by Yokaze · · Score: 1

      Yes, it's shame that PHP doesn't has native unicode support.
      But if you use utf-8 and don't touch the strings and just pass them to the (unicode-capable) DB from the Webbrowser (or the reverse) it seem to work (at least for me using latin-1 and japanese characters).

      And there is an experimental multi-byte string module

      --
      "Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
    3. Re:mySQL & PHP by Hooya · · Score: 2, Informative

      i've been involved in designing and implementing a site to support arabic, thai, japanese, chinese, russian, korean, hindi and some 15 other languages (the european ones) using , you guessed it, MySQL and PHP. php apparantly supports UNICODE strings (we're using version 3.x even). in MySQL, we set the field to binary. i'm sure that adds some overhead but it works. we've used java to 'convert' strings from x encoding to UTF-8. iconv works too. now users can switch the language of the site purley by selecting an appropriate radio button for the desired language. and the languages are 'translated' gettext() style but thru database instead of files. this is a survey type site so the hittage is quite high and the site along with the database shows no signs of slowing down. i'm not sure if that's what you wanted to know but since our client (the browser) is multiple encoding compatible we have no problem. you might want to look into String class in java as it provides some neat encoding conversion in a roundabout sort of way. you possibly could get the Unicode string and the convert it to ASCII but i'm not sure what it does with the non-ASCII characters. as for MySQL database, set the field to binary. i dont' know about oracle etc.. as i haven't found the need to use it.

    4. Re:mySQL & PHP by spitzak · · Score: 2
      There is no need for anything other than "bytes". Nothing requires each "character" to be a "byte". Just use UTF-8.

      If you think it is a problem that the characters are different sizes, please realize that UTF-16 uses prefix codes and thus it also has characters different sizes. Even storing 32-bit Unicode would result in the need to treat multiple words as a "character" depending on how you think about prefix accent codes. Also try to get your coding out of the 1960's, modern software thinks about "words" which are varying size.

      All this I18N and Unicode stuff would be a no-brainer (every single interface would use UTF-8) if it were not for this illusion by so many idiots that "characters" need to be equal in size. They aren't, it is impossible for them to be so. Deal with it.

    5. Re:mySQL & PHP by Anonymous Coward · · Score: 0

      Please realize that USC2 is fine unless you really need those Aramaic characters and ancient Egyptian heiroglyphics. UTF-16 may be complete, but no-one needs completeness. Certainly not at the added complexity of handling multi-width characters. Everyone outside of a museum should be using USC2.

    6. Re:mySQL & PHP by spitzak · · Score: 2
      USC2 has been officially rejected by the standards bodies in preference for UTF-16. Also USC2 would still encode accent prefix characters which pretty much requires your program to think about multiple widths anyway.

      I also believe the added complexity of needing to handle both 8-bit text and USC2 is way more complex than just using UTF-8 everywhere. I also have never seen an algorithim where the location of characters are calculated directly, rather than being offsets calculated by scanning all the letters before that point in a string. This means that variable-sized characters do not complicate any known algorithims.

      "Wide characters" have delayed our ability to get working internationalization for decades now. I strongly recommend that you stop contributing to this shameful history and start working with something that works like UTF-8.

  30. Use approximate character set conversion by Twylite · · Score: 4, Insightful

    The way I understand this, you have old clients, new clients, and a server that must handle both. And the server and new clients should support Unicode.

    First, although this is probably obvious, I should note that if your data is primarily text, then you're looking at a 2Tb database when you start using Unicode (depending on the encoding).

    My biggest issue is that if we start storing user data in unicode we will no longer be able to provide complete updates the legacy (pure ASCII) systems

    This is sortof like supporting German language entry, and wanting to display it on English clients. Its not easy, but it can be done, to some extent. Most Unicode you encounter will have an equivalent ASCII representation; there are acceptable conversions for almost all non-Eastern character sets. You can serve up a converted representation to your ASCII clients.

    DO NOT listen to the bullshit about serving up UTF-8 to ASCII clients. They can't understand it any more than I can understand German ; it will seem to work only for low-ASCII characters, but break for all others.

    As for data validation, you are going to have to have two rulesets. One will be client-side ASCII; the other a unicode ruleset used by both the new client and the server. Incoming ASCII from the old client should be converted to equivalent Unicode (that's the easy part) before being validated.

    Sorry, no realworld information here either ; certainly not on database that size.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    1. Re:Use approximate character set conversion by Anonymous Coward · · Score: 0

      if you call something bullshit, please be correct yourself.

      If the clients have only ASCII (i.e. 7-bit chararacters as found almost everywhere), than there is not need to convert to UTF-8, as they are identical. If you have 8-bit clean processing of the data, no need to change anything for UTF-8, as long as you do not cound characters by bytes.

      You only have to convert, if you want to show UTF-8 to an human, then you have to found some subscription for the characters. (Which is normally lots of "?" whenn I get some asian spam).

      Or if you do not use ASCII, but some 8-bit code ( many of these encoding are called ASCII often, but they are not, and in this discussion the difference is important).

    2. Re:Use approximate character set conversion by Anonymous Coward · · Score: 0

      Whether it makes sense to serve UTF-8 to an ASCII client will depend on what the client is doing with the data. For example, if the client is just sorting records, UTF-8 will preserve sorting order and is a perfectly acceptable transformation. However if the client is displaying the data on a screen or a printer, or otherwise trying to split up individual "characters" within the data stream, then serving UTF-8 to the client may not be appropriate. (Though remember that some clients that don't know about UTF-8 can still display it properly because they are just sending the data to an output device that does understand it - for example, many display terminals or text windows on various operating systems can handle UTF-8 characters).

  31. Re:everyone should learn English by pubjames · · Score: 2

    spanish: (a) everyone would be spitting all over eachother. That's just the way the language is.

    I live in Spain and speak spanish. I've never found people spitting on each other a problem, perhaps you're thinking of a particular country in South America.

    (b) It has bizarre gender constructions.

    Bizarre?? Lots of languagues, perhaps the majority, have this.

    English has many idiosyncracies, one of the worst for people that are learning it are that it isn't pronounced as it is written. In this respect, Spanish is much more sensible and easier to learn. Also, phrasal verbs are a nightmare for anyone trying to learn English. In this respect Spanish is also easier.

    (c) It has annoying verb tense constructions. In english, one can easily construct new tenses to deal with problems encountered when talking about time travel/relativity in physics. "He would have been going to do that last week". That's a pain in the ass in spanish.

    This is relatively obscure. All working languages have their idiosyncracies, including English.

    Hence, native spanish speakers have a much shakier grasp of the concept of time.

    Is this a joke?

  32. It might not be that bad by tbray · · Score: 2

    If you store it using UTF-8 (there are lots of options for storing Unicode) your problem may not be that bad. I'm assuming your system is in C or a derivative. UTF-8 avoids the obvious breakage of embedded null bytes. You might need to add an output filter to make sure you don't ship out any characters numbered higher than 127 to non-Unicode-savvy customers.

    On the other hand, if you've got deep assumptions that strlen(whatever) == numberOfCharsIn(whatever) then you're pretty well hosed.

  33. Re:everyone should learn English by swright · · Score: 1

    To bring it back into computing terms, it would actually be best for everyone to do English - the main reason being the small set of glyphs required to express full words and sentences (i.e. only 26 letters and the various punctuation, unlike european lanquages with all the accents and chinese et al with the 10,000 symbols or however many...)

  34. Re:everyone should learn English by BadDoggie · · Score: 1
    Why not?

    Because Chinese-speaking people can speak English but not many non-Chinese can speak Chinese. Hell, most Chinese can't talk to most other Chinese due to two primary written forms (old, simplified), 31 major dialects (Mandarin now primary, Cantonese losing place) and hundreds of "minor" dialects.

    Spanish-speaking people can also speak English [1] but most English-speakers can't handle much beyond "Yo quiero Taco Bell".

    All over the world, you see many people of different nationalities -- none of whom have English for a Mother Tongue -- talking in English to each other, albeit with accents and of varying quality. It is the de facto "lingua franca", which, incidentally, means "French language", which used to be the world-wide basis for talking. Hell, during the Napoleonic Wars, the British and Germans fighting the French often had to speak in French to each other.

    English took over more than fifty years ago. Remember that English is a bastard language which came together a a mixture of many different languages (primarily Old Norse [like Icelandic], Old German and French). It is adaptable. It is easy to grasp the basics and communicate your intended meaning despite incredibly bad construction and grammar, unlike in most other languages. So lay off.

    woof.

    [1] excludes New York, Texas, California, Florida

  35. Re:Ignore to proselytising - don't use XML by Anonymous Coward · · Score: 0

    Actually I used XML for 3 months. WHy is it when
    anyone disagrees with someone on here they're
    immediately labelled as ignorant? Insecurity
    perhaps?

  36. Migration of data to unicode sets by Alan+Cox · · Score: 3, Informative

    Make sure you use UTF8. Firstly because unlike UCS2 (16bit) it can encode all the characters not a subset of them. Eventually 16bit won't be enough for you. Secondly its 7bit ASCII equivalent so there is no real problem with migration over time.
    Thirdly since ascii 7bit is UTF8 ascii space there isnt any data migration to be done to set this up.

    1. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 1, Informative


      Even though UCS2 is 16bit, UTF-16 is 16 or 32 bit. You can see the Unicode 3.1 standard for more details. Unicode can have surrogate pairs. So all of Unicode can be represented. This allows for 1 million different Unicode characters. I think Windows XP and MacOS 10.1 can handle surrogates.



      One open source project that can do UTF-16 is from IBM. You can go to http://oss.software.ibm.com/icu/ for more details on ICU4C. It also gives some information on converting charater sets to and from Unicode. It also does collation, number formatting, and all sorts of Unicode manipulation.



      If all your data is ASCII there is probably no problem, but once you encounter character sets like ShiftJis and other East Asian encodings, you will encounter a huge number of variants. Finding the right kind converter can be difficult. Fortunately ICU4C (see previous address) allows you to similate several platforms' conversion behaviour, and if it doesn't, you can modify it yourself. It's all under the X license :-)

    2. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 1, Informative
      UCS-2 is limited to 16 bits (65,536 code points) but UTF-16 is not - it uses a "surrogate" area to support multi-word encodings much like UTF-8 allows multi-byte encodings, with about 1,114,112 code points (17 * 65,536) and this is enough to support all of the characters that are ever likely to be defined.

      It is not necessarily true that you'll eventually need more than 65,536 code points; the code points outside Plane 0 of Unicode are used to encode historical characters and languages such as historical (archaic literary) Chinese characters, Egyptian hieroglyphics, Etruscan, etc. Even many dead writing systems such as Aramaic and the Northern European runes are supported in Plane 0. In addition many font formats do not support more than 65,536 characters, making working with such large character sets awkward. It really depends on your application whether you might need such a large character set, though if you can write your low-level routines now to support it then you can add the display layer later if and when it becomes necessary.

    3. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 0

      In Unicode 3 Chinese has spilled _way_ beyond 65535. 16 bits is certainly not enough for full language support.

    4. Re:Migration of data to unicode sets by Anonymous Coward · · Score: 0
      Strictly speaking, the extra Chinese characters were added for Unicode 3.1, not Unicode 3. But yes, for full language support you need more than 65536 code points. The point I was trying to make was twofold:
      1. Some applications may not need full language support.
      2. Many (most) computer hardware and software systems can't yet display characters beyond 65535.
      The team doing the conversion will need to evaluate what makes sense for their specific situation; certainly if it's not much more trouble then it makes a lot of sense to support more than 16 bits internally even if they can't yet display it.
  37. Re:everyone should learn English by dready24 · · Score: 1

    What we generalize as chinese is actually many different related languages (dialects) that cannot be mutally understood when spoken, but can be mutally understood when written in the standard manner.

  38. Unicode and ASCII by dotmaudot · · Score: 1

    I may have missed the point, but Unicode is a character set. Once you have converted the characters in Unicode, you still have to store them. Instead than using UCS-2 (two bytes per character), you may store them in UTF-8, where codes (0-127) are represented exactly like in ASCII.

  39. Re:everyone should learn English by THEbwana · · Score: 1

    I agree!
    - I've got a swedish keyboard at home, american and a swiss-german keyboards at work and from time to time I have to use the french keyboard. People around me speak english, high-german, swiss-german, italian, and french.
    CAN'T WE JUST STOP THIS MADNESS AND JUST USE ENGLISH! - It's the one language everyone understands!

  40. Re:Ignore to proselytising - don't use XML by dready24 · · Score: 1

    > Unicode is 2 bytes per char, ascii is 1. A > simple converstion program is trivial to write, > you simply have to find the mappings. This is not entirely true, there are many different ways to store unicode: utf-7,utf-8,utf-16,utf-32,ucs-2,ucs-4,etc... I believe you are talking about encodings, Unicode is like ascii, one numeric identifer = one character. Now how you store your numeric identifer is another matter...

  41. Re:Ignore to proselytising - don't use XML by dingbat_hp · · Score: 1

    Actually I used XML for 3 months

    So you know almost nothing then ?

    I keep getting involved in usenet flames over XML because I'm still a newbie (not quite 3 years) - and the other guys have something like 10 years experience (they're SGML dinosaurs). Flaming XML is fun - that's why the interesting work has moved beyond it - but if you're going to do this, then attack the real issues with XML, tell us what they are, and tell us what your solutions are.

    After all, if you know everything about XML from just 3 months experience, then you're obviously much smarter than we are.

  42. Usenet by dingbat_hp · · Score: 1

    Read Usenet and c.i.w.a.h. You'll get flamed to a crisp by them (they're a little dysfunctional, to put it mildly), but there are a couple of people thereabouts who know how to do this right.

  43. Hahaha by Anonymous Coward · · Score: 0

    And if you keep zipping a file, it will compress to zero bytes :-)

  44. 1 Terabyte database into XML? by Anonymous Coward · · Score: 0

    XML would make it take up several more terabytes if you used it for storage...

    This is expensive and slow...

    1. Re:1 Terabyte database into XML? by Domini · · Score: 2

      That's a common myth.

      Besides... as my initial post said:
      without tags

      Which means that the person's username would STILL be stored in the ACCI DB as :

      "John Smith"

      which is valid XML data, but any hyphenated characters would have to be translated to valid XML data character sequences... which is the exception.

      As far as speed is concerned, rather focus on algorythmic imporovements than linear improvements. There is hardware out that can handle XML natively already. I would not worry too much about speed.

    2. Re:1 Terabyte database into XML? by Isle · · Score: 1

      Errmm... valid SML data charaxter sequences?
      You specify in XML, what encoding you're using (ascii/latin1/uft8). XML is not an encoding in itself, although the wierd HTML-"reinvent the wheel"-codes sometimes are used.

  45. Re:everyone should learn English by Anonymous Coward · · Score: 0

    How about Esperanto instead? It is an artificle language as opposed to "natural". It is a rule-based language derived from Latin. No exceptions to the rules the way that Natural languages have (English in particular). Very amiable to computer parsing.

  46. Use UTF-8 encoding by Anonymous Coward · · Score: 1, Insightful

    If you use the UTF-8 encoding, of which ASCII is a subset, then you minimize the amount of code and text that has to change -- only the text that isn't expressable in ASCII changes, using multiple bytes per character, and ASCII string manipulations still "just work".

  47. What do you mean by "ASCII"? by Florian+Weimer · · Score: 4, Informative

    You first have to examine carfully the chracter set your current application can deal with. Is it ASCII? Or just the printable range? Or do most routines treat everything as sequences of 8-bit characters? Is the null character permitted in data? And so on.

    After that, you have to identify the operations which are character set specific. This can be quite a bit of work. Character set specific operations include case conversion, collating, normalizing, measuring string length and character width (for formatting plain text), text rendering in general, and so on.

    Now you look at your tools. Do they prefer some kind of Unicode encoding? For example, with Java or Windows, using UTF-16 is most convinient (some would say: mandated).

    Now you put the pieces together and look for a suitable internal representation (not necessarily "Unicode", i.e. UTF-8, UTF-16, or UTF-32), identify points at which data has to be converted (usually, it is a good idea to minimize this, but if you want to fit everything together, there is sometimes no other choice), and modules and external tools which have to be replaced because adjusting them or adapting to them is too much work.

    Your web page generation tools probably need a complete overhaul, so that they are able to minimize the charset being used (for example, German text is sent as ISO-8859-1, but Russian text as KOI8-R or something like that), since client-side Unicode support is mostly ready, but many people don't have the necessary fonts.

    1. Re:What do you mean by "ASCII"? by mughi · · Score: 1
      You first have to examine carfully the chracter set your current application can deal with. Is it ASCII? Or just the printable range? Or do most routines treat everything as sequences of 8-bit characters? Is the null character permitted in data? And so on.

      This is a key point. Is the person sure that their data is all ASCII? Often people confuse "ASCII" with "8-bit text" and get burned. Most common is calling data "ASCII" when it's really "Latin-1" or "Windows CodePage 1252". Others thinking of similar problems should be careful about getting terms correct. Remember, if the data has any values outside of the 0 through 127 range (more than 7-bit, more than 128 values), then it's not "ASCII".

      Also, if you can represent languages other than US English, Latin, Hawaiian and Swahili, then chances are good that you have something other than "ASCII" data.

  48. Re:Ignore to proselytising - don't use XML by pubjames · · Score: 2

    Actually I used XML for 3 months. WHy is it when anyone disagrees with someone on here they're immediately labelled as ignorant? Insecurity perhaps?

    I complained about your posting being modded as 'insightful' because the posting contained dumb comments:

    XML is just the current flavour of the month

    This is a dumb comment. XML is built on top the experience of SGML, which has been around for a long time. If you understand the issues involved in software integration across multiple systems then you should understand why XML is a very important standard.

    Unicode is 2 bytes per char, ascii is 1. A simple converstion program is trivial to write, you simply have to find the mappings.

    Saying this is dumb in the context of the orginal question and also demonstrates a lack of understanding of what's involved in enterprise level software development.

    Actually I used XML for 3 months.

    So? I am fluent in Spanish. That doesn't mean that I am qualified to make comments about South American politics.

    Seriously, there's a huge difference between someone with trivial experience and someone who has worked on major projects at an enterprise level. So I stick by my original comment - you don't know what you are talking about in this context.

  49. Space tradeoffs by d5w · · Score: 4, Informative

    But if your database is currently dominated by ASCII or even typical Latin-1 text, that's a reasonable tradeoff; no increase for ASCII text, a slight increase for Latin-1 text (100% on a minority of the characters in actual text; anyone have actual stats handy?), 50% increase for the rest of the 16-bit range, and the same maximum character size (U+10000 - U+1fffff take 4 bytes in both UTF-8 and UTF-16). And then you have the other advantages already mentioned: compatibility with 7-bit ASCII, NUL-terminated C strings, and ordinary 8-bit clean text channels. If you're currently in the ASCII or Latin-1 domain the question isn't even what you expect to store in the future, so much as how much cheaper disk space will be when you finally need to store it.

  50. It's easy by BasharTeg · · Score: 1

    char ascii;
    int unicode;
    unicode = (int)ascii;

    1. Re:It's easy by pne · · Score: 1

      char ascii;
      int unicode;
      unicode = (int)ascii;

      Unfortunately, that only works in-memory since files are sequences of octets (bytes), which only have 8 bits. So you have to convert your ints to octets somehow when saving. So you have to pick a Unicode Transformation Format... such as UTF-8 or UTF-16.

      Cheers,
      Philip

      --
      Esli epei etot cumprenan, shris soa Sfaha.
    2. Re:It's easy by BasharTeg · · Score: 1

      You do understand that was a joke right ?

      Besides, even if I was serious, what do you mean, convert your ints to octets when saving ? Are you saying a char is the only size of data you can print to a file ? Files are sequences of bytes, or sequences of words, or sequences of dwords, depending on how you interpret them. There's nothing to stop you from stamping ints directly into a file, 16 or 32 bits at a time.

      char ascii;
      int unicode;
      unicode = (int)ascii;
      fwrite(&unicode,sizeof(int),1,fptr);

      Get it ? It's still just a joke, but not only did you miss the joke, but you corrected me very poorly.

  51. Re:everyone should learn English by archen · · Score: 1

    You are aware of the difficulties of writing on computers in Chinese aren't you? Japanese manages to avoid this as you can break any kanji down to it's sound components (hirogana), which can be spelled via the keyboard (romanji), and then select from a small list. From my understanding, Chinese is far more complex, due to how the same english equivalent spelling can be pronounced a few different ways in Chinese. The alternative is to start changing every keyboard on the planet and redo every programming language (that's in english anyway) so that it can be easily done in Chinese. Either way it would be a mess.

  52. Re:everyone should learn English by bLanark · · Score: 1
    Unless you mean British English, in which case ASCII is not sufficient to produce a £ sign...

    Well, soon we won't need that pound currency symbol either. It'll be the Euro...

    Do you think a million users around the world are staring at their screens at a (hash | square | # | £ | some other symbol) and wondering what the hell we're talking about?

    --
    Note to ACs: I won't mod you up, even if you are being funny or insightful. So take a chance! It's not real life!
  53. Re:everyone should learn English by joemiah · · Score: 1

    Because Chinese-speaking people can speak English but not many non-Chinese can speak Chinese.

    Most Chinese (the majority living in P.R.China) do _not_ speak English.

    Hell, most Chinese can't talk to most other Chinese due to two primary written forms (old, simplified), 31 major dialects (Mandarin now primary, Cantonese losing place) and hundreds of "minor" dialects.

    (a) Given that a person has a good grasp of either simplified or traditional characters, it is normally not a major issue for them to read the other character set. (b) Mandarin (Putonghua) is the Chinese national language. People retain their spoken dialects from speaking with their parents and local community, but learn Putonghua in the PRC.

  54. Re:everyone should learn English by joemiah · · Score: 1

    (a)has only a standardised written form, not spoken form


    If you consider the standardised written form to be "simplified characters", then the standardised spoken form is Putonghua (Mandarin).



    (b)that written form is especially annoying to represent digitally.


    Do you mean to store, or to input the data? Both are easy. There are many popular input schemes used (based upon personal preference) and a proficient typist will have no issue with this. As for storage, I believe the most popular encoding atm (for simplified chinese) is GB2312.



    c) it is a tonal language, and therefore not very easy to learn unless you have been raised from birth speaking it, since your brain won't have developed the requisite pitch analysis. There are many more non-tonal than tonal language speakers in the world, so standardising on a tonal language would place ALL of them at a disadvantage. It's easy for a tonal language speaker to go the other way though.


    I've met plenty of people who have had no issues in learning Chinese due to their "non-tonal" upbringing. Hell, there seems to be plenty of Mormon missionaries walking around Chinatown speaking with near perfect Mandarin.

  55. That don't make no sense... by FatSean · · Score: 1

    XML isn't a character set encoding.

    --
    Blar.
  56. This IS what UTF-8 was designed for by Anonymous Coward · · Score: 0
    absolute ASCII compatibility.


    ASCII is a 7 bit standard encoding. UTF-8 uses those first seven bits EXACTLY the same as regular ASCII.


    The 8th bit of every btye is used to encode higher unicode characters.


    It was designed, in part by Ken Thompson, to allow unicode in existing ascii software (such as the unix family of os's) without code changes or even recompilation.

  57. Re:everyone should learn English by pubjames · · Score: 2

    Spanish-speaking people can also speak English [1]

    [1] excludes New York, Texas, California, Florida

    Ha. That's funny. What you mean to say is that most latin americans living in some parts of the USA can speak English. The world is a big place you know. The vast majority of the people of this world cannot speak English.

    Why don't you take a trip to China or Columbia and see how you get on only speaking English?

  58. Am I wrong or will Unicode double your DB size? by Control-Z · · Score: 1
    We have a database of around 300MB that fits nicely on a CD-ROM.

    I'm assuming converting to Unicode would double the size and we would have to introduce some sort of compression to fit it on a CD-ROM?

    1. Re:Am I wrong or will Unicode double your DB size? by Control-Z · · Score: 1
      Well one important thing I forgot to mention is that everything else included on the CD is nother 150 MB or so!

    2. Re:Am I wrong or will Unicode double your DB size? by Edward+Kmett · · Score: 2

      Your best bet would be to use UTF8 to encode the information rather than UTF16. If your data is all ASCII right now, then you shouldn't see an increase in size (unless you use an over abundance of high (0x80+) ascii characters) The increase for high unicode characters later on becomes incremental. A lot of Unicode's bad name comes from the bloated oversimplistic nature of the UTF16 and UTF32 formats. They are useful as internal representations for small buffers, but not for large amounts of data.

      --
      Sanity is a sandbox. I prefer the swings.
    3. Re:Am I wrong or will Unicode double your DB size? by Anonymous Coward · · Score: 0
      UTF-8 does have some useful properties, but it may not be the most appropriate format for external storage for every application.

      Using UTF-16 or UCS-2 would only double the size of the database if the entire database was stored as ASCII or Latin-1 (no binary data for example). Using UTF-8 would probably not make such a database much larger than the same database stored in ASCII or Latin-1. On the other hand, if the database contains a lot of Asian characters, UTF-8 is likely to make the database larger, not smaller, than the same database in UTF-16 or UCS-2.

      There are also transformation formats that are specifically designed to minimize the size of the stored data; these do not have some of the other useful properties of UTF-8 such as preserving the sorting order, being "safe" for processing by the C runtime, being able to find the "beginning" of a character without having to scan forward from the start of the string, etc. In applications where space is at a premium these may be worth investigating. For example look at SCSU at http://www.unicode.org/unicode/reports/tr6/ which gives good compression on many Unicode strings at considerable cost in convenience.

    4. Re:Am I wrong or will Unicode double your DB size? by spitzak · · Score: 2
      The inefficiency of UTF-8 for storing Asian text is way overstated. This is because people are not making realistic measurements of the number of spaces, punctuation marks, numbers, control characters, and english words that are in real Asian text.

      I think all other attempts at byte coding other than UTF-8 can be safely ignored. If you want compression you can use normal byte-based data compression methods like gzip. This will work on both UTF-8 and even on UTF-16 to reduce them to much smaller than any standard encoding scheme can.

    5. Re:Am I wrong or will Unicode double your DB size? by Edward+Kmett · · Score: 2

      Yes. However, I was replying to an individual who already had a large ASCII database. Hence I feel it is reasonable to assume that his existing data would transcode nicely to UTF-8 with little if no size-bloat. I was trying to dispel the myth that Unicode encoding would flat-out double the size of his existing data. UTF-16 would guarantee that.

      --
      Sanity is a sandbox. I prefer the swings.
  59. Learn your ASCII by Anonymous Coward · · Score: 0

    8-bit Ascii, Code 156 is the pound symbol.

    1. Re:Learn your ASCII by mutende · · Score: 1

      There's no such thing as 8bit ASCII, it's called iso-8859-1 (a.k.a. Latin1).

      --
      Unselfish actions pay back better
    2. Re:Learn your ASCII by Anonymous Coward · · Score: 1, Informative

      Learn your ASCII and discover that's it's only 7 bits.

  60. That's a tough one -- some ideas though: by rjstanford · · Score: 3, Interesting
    The hardest problem to solve is the business one. Storing the data is easy -- scaling from 1TB to 2TB (or more) is a solved problem. The hard part is deciding what to do when an ASCII client requests information that you only have in Unicode.

    Does your application support multiple languages now? If it does, it probably has a default language for everything that should be present in case the specific language asked for is missing. Rather than have that be "en_us" (or whatever), make that "US English ASCII-friendly". You can then add a new language "US English Unicode". Then alter your mandate so that everything has at least that language. I'd add Unicode and ASCII flavors for all other languages too, although anything that doesn't use high chars can just be stored as ASCII with the Unicode encoding generated (if space is that much of an issue).

    If your application database is not multi-lingual already, then you have some serious architechture work to do. I'd look at it from that standpoint though -- there is a wealth of reference material describing how to add language support to existing data and apps. Think of Unicode as another language.

    Concentrate on these issues, and let the technical issues (such as encoding scheme) be decided after you know what you want to do. As far as that specific one goes (seems to have the most interest on this page so far), just use whatever you DBMS supports most natively.

    -Richard

    --
    You're special forces then? That's great! I just love your olympics!
  61. Mainframe Address Database - Unicode Problems by justanyone · · Score: 1
    I worked on a magazine subscription website cds.com in 1996 and had severe problems with addresses being too U.S.A. specific. The mainframe code (5000+ mainly Cobol programs) presumed fixed-byte records with a 13 character City Name field (USPS standards abbreviate all U.S. cities to 13 chars) and 7 chars for postal code.

    The issue of International addresses (city names like 'Petrakalinosorvabad') was STICKY. Further complication: conversions between ASCII and EBCDIC !! So, the whole Unicode problem is thus even further afield.

    This illustrates the size of the problem with huge amounts of legacy code that runs well, is debugged, but is out of date in international markets of today. UNICODE would solve some of this with addresses being printed in the actual national language. Imagine delivering mail to Saudi Arabia with the address in (gasp) ARABIC !
    Americans presume that foreign postal workers read Engligh characters.

    Mainframes that speak Linux and run RDBMS's are a first step to rewriting / converting this legacy code to the new international age. There's a lot of room for better service and greater efficiency - by encouraging non-U.S. postal workers to not have to speak english and therefore deliver our packages faster!

    Unicode will solve problems, but create them, too.

    1. Re:Mainframe Address Database - Unicode Problems by Tablizer · · Score: 1

      (quote) The mainframe code (5000+ mainly Cobol programs) presumed fixed-byte records with a 13 character City Name field (USPS standards abbreviate all U.S. cities to 13 chars) and 7 chars for postal code. ......The issue of International addresses (city names like 'Petrakalinosorvabad') was STICKY. Further complication: conversions between ASCII and EBCDIC !! So, the whole Unicode problem is thus even further afield. (end quote)

      Perhaps use those 13 characters to store a CityID instead of a name. Then you can create a new City table with Unicode or ASCII-XML names. It likely would require some modifications to the old system, but at would not require changing the width of the old city field.

  62. Use UTF8 by The+Panther! · · Score: 2, Insightful

    While I know XML is a favored silver bullet by the popular press and developers, I still haven't decided if the infatuation with a complicated packaging scheme is really worthwhile. It's nice in a sense that there are off the shelf readers that can interpret the data for you, sure, but ultimately it's still up to your code to pull out the data in a meaningful way. A good XML reader will do two things for you: 1) provide a regular format for all data, and 2) handle string conversions to and from various encoding schemes.

    It seems to me quite silly to bother dealing with all sorts of encoding schemes if you can control the data from the get-go. Convert from whatever your input data is to UTF8 as early as possible. With that, you immediately have support as if you wrote everything as wide characters, but don't have to change much, if any of your code. UTF8 is narrow, with reserved codes for multi-byte encoding. UTF8 doesn't require changing your string functions* that depend on a single terminating null, and you never really have to think about the encoding again. We've migrated from ASCII to UTF8 and now support whatever languages come in as an XML input format, but we immediately convert to UTF8 and forget the XML once we hit our database.

    * Caveat: Poorly encoded UTF8 can represent the same wide character in many ways. For this reason, a straight byte comparison of UTF8 strings is sometimes incorrect. Either you should test all strings at conversion time to see if they are minimally encoded, or convert to UCS2 and back again, just so all strings go through the same manipulative process, and give you the same byte stream. I learned this the hard way. With that out of the way, it's just like using normal ASCII.

    --
    Any connection between your reality and mine is purely coincidental.
    1. Re:Use UTF8 by Anonymous Coward · · Score: 0
      UTF-8 text that does not use the minimal encoding for all characters is not only considered poor encoding, but incorrect encoding. Ideally, if there is any way for such text to be generated in the system, a UTF-8 aware program should reject such input.

      Note also that using UTF-8 doesn't necessarily mean that you don't have to change any of your string functions, but that the need for such changes is reduced. For example, anything that looks at a single "character" (eg, in C, the routines toupper(), strchr(), etc) may not work correctly on UTF-8 text.

    2. Re:Use UTF8 by spitzak · · Score: 2
      All implementations reading UTF-8 should treat characters coded using more bytes than necessary as errors. Otherwise serious security vulnerabilities are possible due to disagreement between various pieces of software about the equality of characters.

      Normally these errors are turned into a single error Unicode character (0xFFEF?). However I favor an implementation where the error is turned into the same number of characters as there are bytes in the error, with each character equal to the original byte. Due to the design of UTF-8 the resulting characters will be in the 0x80-0xFF range. The reason for this is to allow recovery of ISO-8859-1 text that is mistakenly put into a UTF-8 stream.

  63. Tcl! by Anonymous Coward · · Score: 0

    Sure, Tcl isn't as buzzword compliant as Java, and it isn't popular with the "in" crowd, but it's a good workhorse, and Unicode and UTF-8 are two things it has been doing well for a while. Check it out at www.tcl-tk.net.

  64. Re:everyone should learn English by steveo777 · · Score: 1
    Every natively english speaking person I know that has learned Hebrew, or Greek believes strongly that english is the worst language in the world. And I know quite a few of these people. Something about the way Greek is easier to understand and English sounds like gibberish.
    I've also heard it said that English is the hardest language in the world to learn, I'm not saying it it, but that's what I'm told by exchanges students and their ilk.

    Just my 3.14159 cents.

    --
    This sig isn't original enough, it's time to come up with something witty...
  65. Re:everyone should learn English by mutende · · Score: 1
    Well, soon we won't need that pound currency symbol either. It'll be the Euro...

    Except the British aren't that keen on the €, are they? And both the € and the £ symbol require an "8bit clean" charset.

    --
    Unselfish actions pay back better
  66. Be like Mike (rosoft) by Anonymous Coward · · Score: 0

    Don't offer an upgrade path, just require users to buy the new software and create all new apps/databases for your new software.

  67. Re:Capability levels & preserving language tag by darkonc · · Score: 2

    For old clints that dont' return a capablility level, find a backwards compatible way for them to indicate a capability level (it may simply be in the form of them doing a query in a specific form-- those that don't do it are considered to be 'ancient'.

    --
    Sometimes boldness is in fashion. Sometimes only the brave will be bold.
  68. IBM's International Components for Unicode by mughi · · Score: 1
    IBM's ICU is a good place to look. It's based on research originally from Taligent. The transliteration and message formatting capabilities might be of special interest, and the site might be a good jumping-off point.
    The International Components for Unicode(ICU) is a C and C++ library that provides robust and full-featured Unicode support on a wide variety of platforms. The library provides:
    • Calendar support
    • Character set conversions
    • Collation (language-sensitive)
    • Date & time formatting
    • Locales (170+ supported)
    • Resource Bundles
    • Message formatting
    • Normalization
    • Number & currency formatting
    • Time zones
    • Transliteration
    • Word, line & sentence breaks
  69. just in case... avoid #define UNICODE by mughi · · Score: 4, Informative

    Just in case any of this work is being done on Microsoft Windows, you should avoid "#define UNICODE", TCHAR, and _T(). These are mainly legacy tricks used to help Windows 3.1 developers cross-compile their code for NT. Microsoft themselves doesn't use them, and insted goes with pure Unicode through the app. Even COM in Win32 since the first release of Windows 95 is all Unicode (BSTRs).

    Of course, this would preclude you from using MFC, but then again, many think that avoiding it is a good thing (again, Microsoft is among those who avoid using it). But aside from other benefits, you'd end up with not needing to build two separate binaries: one for Windows NT/2K and one for Win9X.

    Oh, and one other thing. If you are doing any portable code, remember that the Microsoft documentation lies and that wchar_t is not always 16-bit like they say. In fact, the spec recomends that it be 32-bit, and most other platforms (Linux included) define it thus.

    1. Re:just in case... avoid #define UNICODE by Malc · · Score: 1

      If you go with 100% Unicode under Windows, every system call will have to be intercepted, and translated if you're on Win9x. You say build two different binaries, one for Win9x, and one for NT... how do you propose doing that without using the TCHAR.H stuff? Maintain two separate sets of code? The TCHAR stuff isn't just for legacy Win3.1 people, it also for all Win9x OSes too. Finally, just because COM objects only accept BSTRs doesn't make them Unicode... their coclass implementation can still be ANSI (probably translating all incoming and outgoing strings).

    2. Re:just in case... avoid #define UNICODE by mughi · · Score: 1
      If you go with 100% Unicode under Windows, every system call will have to be intercepted, and translated if you're on Win9x.

      That's right. Of course, you can go ANSI like most Windows apps, and when it runs on NT, 2K and XP every single API call will get the data translated... So you'd be hit by that problem on one of the two platforms.

      However, ever since Office 97, this is exactly what Microsoft has been doing with it's entire office suite. And it's been well worth it for them The person in charge of all Office development gave a good presentation on all this at the 13th Unicode conference. So... if it's good enough for Microsoft's flagship product...

      You say build two different binaries, one for Win9x, and one for NT... how do you propose doing that without using the TCHAR.H stuff? Maintain two separate sets of code?

      You missed my point. I was saying that if you use #define UNICODE and TCHAR, then you will need to build two separate binaries. And quadruple your testing, QA, etc.

      My point is that because separate binaries is just not a good idea, one should avoid TCHAR, et al.

      The TCHAR stuff isn't just for legacy Win3.1 people, it also for all Win9x OSes too.
      Well... kinda. It's for MFC. Besides, if you aren't building two different binaries, then you're just typing in a bunch of unneeded NOOPs. If you are only ever building a Win9X/ME version of a program, and never need to compile a NT build, then you have no need for TCHAR at all.
      Finally, just because COM objects only accept BSTRs doesn't make them Unicode... their coclass implementation can still be ANSI (probably translating all incoming and outgoing strings).
      Well, it could be. But then again, it might not. And so you'd need to convert your string data in Win9x programs at some point. You could choose to do it when you get input from an old-school Win32 API call, or you could choose to do it before feeding into the new-school COM/ActiveX call. The more programs get to COM/ActiveX/whatever, the more you'll be translating.

      So, given that it has to occur somewhere, doing it before it gets into your program is more and more of a good thing. (Remember, however, that I pointed out that MFC is a different thing, but I don't think MFC is all that great either).

    3. Re:just in case... avoid #define UNICODE by mughi · · Score: 1

      Oops. Don't think I specifically addressed these points:

      Finally, just because COM objects only accept BSTRs doesn't make them Unicode...

      Well... COM/ActiveX _is_ the API. As long as that is used, it doesn't matter what's on the back-end... VB, Java, C, C++... And the API is purely Unicode. So COM is Unicode. Implementations might do things differently on the back end, but that doesn't change the programming contract.

      their coclass implementation can still be ANSI (probably translating all incoming and outgoing strings).

      And if that were the case, sticking with the approach you suggest would make the problem worse. For every single time you'd call into COM, you'd first convert your ANSI [using Microsoft's questionable terminology] data to Unicode and make a new BSTR. Then you'd feed it through COM. Then the back-end would reverse that transformation and process data in ANSI. It would then have to convert from ANSI back to Unicode to make a new BSTR to pass back to you. Then you'd convert that BSTR from Unicode back into ANSI for further processing on your side. Whew!

      On the other hand, your data was kept in Unicode, then your program would cut in half all the conversions on your data per call into COM. And if you happened to call something that did implement it's code in Unicode (like any of the components from Office), then you'd end up with zero conversions per COM call.

    4. Re:just in case... avoid #define UNICODE by Anonymous Coward · · Score: 0

      "Of course, this would preclude you from using MFC..."

      MFC is a PITA. I love Linux, but my summer job was coding Windows apps. I chose Win32 over MFC, grabbed gcc and xemacs, popped up MSDN's (incredibly slow) library site, and started running. Used a free resource editor (forget the name now...)

      Win32 works pretty darn well, you avoid a lot of the bloat of MFC, etc. It's definitely wanting in modernerity compared to glib/gdk/gtk+, but it works. You really *do not* need Visual Studio or friends to do great Windows development...and it'll especially be worth it when you actually know what the hell is going on if anything goes wrong, as opposed to looking blankly at a screen full of MSVS MFC forms, knowing that something's broken but not knowing where.

      "If you are doing any portable code"

      Portability is a *joke* on Windows platforms. UNIX has better portability, sad as it is. I needed to write an app that would run on 95 and upwards. The sheer amount of *stuff* that is missing in one or another MS OS or works differently depending on the OS is amazing. You don't notice this if you aren't worrying about backwards compatibility, but it's just plain idiotic. I mean, there isn't a *dithering* function guaranteed to be available on all copies of Windows! For chrissake, the MacOS was doing gorgeous Floyd-Steinberg when DOS had barely entered the world.

      This is NOT my homepage. I'm not joking, people.

  70. Re:everyone should learn English by gweihir · · Score: 1

    I've also heard it said that English is the hardest language in the world to learn, I'm not saying it it, but that's what I'm told by exchanges students and their ilk.

    Try learning German. Then you will think that English if fairly easy to learn. Even German kids have a lot of problems with all the rules.

    I am a German native speaker, and frankly I like (British!) English a lot better than German.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted and ignored otherwise.
  71. Re:everyone should learn English by gweihir · · Score: 1

    ...and a swiss-german keyboards at work...

    Ugh. When I saw the first of these I really had some problems using it. The next thing I did was to get an US (actually EU) one.

    Why is it that most non US/EU keyboard layouts are pretty bad for programming, i.e. []{}\/ and the like are hard to type?

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted and ignored otherwise.
  72. Unicode?! by sulli · · Score: 2
    --

    sulli
    RTFJ.
  73. Unicode -- not just for multilingual by mughi · · Score: 1

    There's another aspect to "going Unicode" that should be brought up, and it has a strong bearing on the business case for doing so.

    It's not just for 'going international' ("We're not going to be selling in Europe, so don't bother"), or even for multilingual ("We don't care if the Spanish-speaking market in the US is big, stick to English"). It's also for cross-platform support.

    Western versions of MS Windows uses CodePage 1252, which is close enough to Latin-1 for most marketing types. However, Macs use (suprise, suprise) MacRoman, and the upper half of the character set is wildly differnt. Go Unicode, and any servers or applications can easily support Windows, Linux, Unix and Macintosh data. And especially with Mac OS X being how it is, suddenly you have an extra market in the US to sell into.

  74. Advantages and Disadvantages of UTF-8 by Anonymous Coward · · Score: 3, Insightful
    There seem to be a lot of posts advocating the use of UTF-8 without explaining what the advantages and disadvantages are. Also, some of the posts are simply incorrect.

    Here are some of the advantages and disadvantages of UTF-8:

    • UTF-8 allows you to encode any character in the entire ISO-10646 character set (which is potentially much larger than Unicode since it is a 31-bit code, rather than Unicode which is only a little over 20 bits, or 17 * 65,536 code points). This is probably not of great interest since it is not expected that the ISO character set will ever need to define any characters outside the Unicode range.
    • Strings encoded in UTF-8 can be processed by standard C language routines. A binary 0 embedded in the string can be used as a string terminator just as in 1-byte character sets. Note that routines like strlen() will return the number of bytes rather than the number of characters in a string.
    • UTF-8 preserves the Unicode sorting order so that string comparisons work the way you'd expect without having to convert to Unicode to do the comparison (but note that the Unicode sorting order is not likely to be a useful "language sensitive" sorting order if that matters for your application, so you may still need some way to perform that kind of sort).
    • If you have an arbitrary byte in a string, it is possible to determine unambiguously whether it is the starting byte for a character, and if not you can probe backwards for the starting byte. This is not true of all multibyte character set encodings. This can be very useful for some applications and not at all for others of course.
    • Characters within the ASCII range (00-7f) are transmitted unchanged.
    • Most alphabetic characters (including Hebrew and Arabic characters) are transmitted with only 2 bytes - the same as if you'd stored them as UCS-2 or UTF-16, but not as compact as if you'd stored them with their corresponding ISO 8859-x character set.
    • Ideographic characters and the remaining rare alphabetics within Unicode Plane 0 are transmitted with 3 bytes, which is 50% larger than if they'd been stored with UCS-2 or UTF-16 or (often) with their native computer character set like Shift-JIS.
    • All other Unicode characters (mostly historical Chinese and Japanese characters and character sets for dead languages) can be transmitted in at most 4 bytes.
    • Depending on your display systems, you may need transformation routines to convert to and from other formats used by those systems. For example, many printers or computer fonts that support large character sets might be arranged for use as Shift-JIS or Big5 rather than for Unicode.
    • Because it preserves a certain degree of compatibility with 1-byte character streams, many existing programs and subsystems can coexist with UTF-8 with little or no modification. That does not mean you can count on UTF-8 being safe anywhere that ASCII is safe; you need to evaluate each system on its own merits. However it is quite likely to make your conversion easier.
    Even if you don't use UTF-8 for the external storage format, many projects have found that its advantages make it ideal for processing data in memory. Other times using a fixed-with (16 or 32-bit format) is desirable; fortunately the conversion between UTF-8 and the fixed-width Unicode formats is quite easy and quick.
  75. Re:everyone should learn English by dbremner · · Score: 1

    Um, no. There are three major Romanization standards for Chinese, and pinyin is the standard for the PRC. It's possible to use accent marks and such to indicate pronounciation. Changing keyboards isn't really necessary - there are several input methods. Programming languages don't necessarily have to change, but it's not impossible to do so - Algol68 supported several non-English languages.

    --

    Life is a psychology experiment gone awry.
  76. UTF-8 by Anonymous Coward · · Score: 0

    Use UTF-8 fuckwit. Why do STUPID questions get posted to slashdot?

  77. Re:everyone should learn English by Anonymous Coward · · Score: 0

    Why don't you take a trip to China or Columbia and see how you get on only speaking English?

    I don't know about China, but it's very easy to get along in Colombia while speaking only English.

    It's not as easy as Belgium, and perhaps not as easy as Germany, but one who speaks only English should have no trouble taking a trip in Colombia.

  78. Unicode not adequate for internationalization by Genus+Marmota · · Score: 1
    Is this really the way you want to go, especially given the high cost of porting? It seems unlikely that unicode will emerge as the ultimate standard, at least for the web. It doesn't support a large enough character space to include all the characters needed by various Chinese, Japanese, Korean et.al. writing forms.

    There's some interesting articles on the subject, among them:

    http://www.hastingsresearch.com/net/04-unicode-lim itations.shtml

    1. Re:Unicode not adequate for internationalization by Genus+Marmota · · Score: 1

      Aaack! The url got munged. Can't seem to get it to appear correctly. Remove the extraneous space :-(

    2. Re:Unicode not adequate for internationalization by Anonymous Coward · · Score: 3, Informative
      The problem is that there aren't currently any reasonable alternatives for handling the problems that you mention. All of the various national character sets and vendor character sets are subsets of Unicode, so if you want to write something today you have little practical alternative.

      There are two basic problems with Unicode: Han unification and ideographic character variations. Essentially all of the various Asian national character sets imply some form of Han unification, and their internal structures are quite different. In either event you are left with having to indicate the original language in order to display the "best possible" glyph, with the added burden that if you use the national character sets you'd have to have multiple interpretation and display systems to handle the very different character set encoding structures.

      The other issue is that of character variations and nuances. Unfortunately there aren't any character coding standards (as opposed to ideas that have been kicked around) that address this at all; if you include the Plane 2 characters in Unicode then it comes closer to handling this than any one national standard.

      I agree that Unicode isn't ideal, but there's nothing on the immediate horizon that looks much better, especially if you need to to be able to display text in any language. But if you can restrict yourself to a single language family (European, Hebrew, Arabic, Japanese, Chinese, etc) then there are already alternatives out there. Unicode is designed for applications where you don't have that luxury.

      If you have the need to handle multiple languages simultaneously, you're still probably better off converting to Unicode first and then converting to whatever "ultimate" encoding system emerges in 20 years or so.

    3. Re:Unicode not adequate for internationalization by BJH · · Score: 1

      All of the various national character sets and vendor character sets are subsets of Unicode, so if you want to write something today you have little practical alternative.

      Not true. TRON, for example, has defined its own character set that includes characters not available in Unicode (130,000 in total, currently). See here and here for more info.

  79. Migrating Applications from ASCII to Unicode by xnetinc · · Score: 2, Informative

    It sounds like part of your system is using code pages to communicate is various languages like a web baised application. The data portions is not the linguistic text but just items that can be represented in ASCII. Some of you application can only support ASCII and all the data in your database is ASCII. If it is truly ASCII 0 - 127 (0x7F) (7 bit clean)then you data can often just redefine the database to declace that it contains UTF-8 (Unicode) data. But you must be sure that is is 7-bit clean first. Ont of the best Unicode support packages for C/C++ code (I assume that this is C) is ICU. http://oss.software.ibm.com/icu/ ICU uses UTF-16, but there is xIUA http://www.xnetinc.com/xiua/ which is also free open source software that add UTF-8 support to ICU. Even better it will allow you to add support and still run in code page first and then you the same code to support Unicode. It makes it easy to develop hybred application that may use Unicode in one part of the application and not in another. It will also allow you to use UTF-8 for database access. UTF-32 to interfece with Linux Unicode wchar_t and a mix of code page and UTF-8 requests to a browser.

  80. Encodings by osolemirnix · · Score: 3, Informative
    There is an additional problem with unicode in that you can convert from/to any encoding to unicode, but the encodings are not necessarily compatible.

    E.g. we had that with two different japanese kanji encodings (on Sun workstations and Windowze boxes). Both encodings converted to Unicode and back, but they both had characters not present in the other encoding. So if you created, say, a filename on one system, converted the string to unicode and back to the other encoding on the other system, then all you got was a lot of gibberish.

    So storing your data in unicode alone doesn't solve all your problems. All the clients that access that data need to support the same encodings used. (e.g. your american windowze box cannot handle unicode with kanji stuff unless you have the right language pack installed)

    Essentially it boils down to: all your clients and servers must use the same encoding, wether you use unicode or something else.

    --

    Idempotent operation: Like MS software, wether you run it once or often, that doesn't make it any better.
  81. Don't forget this meta tag. by Mustang+Matt · · Score: 2



    We converted Bridge.com to unicode a couple of years ago. I don't remember all the specifics. We had to change encoding on a few characters. It wasn't that big of a deal. The only catch I remember is that for one of the Chinese translations we couldn't use Unicode for some wierd reason.

    --
    The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
  82. XML & Unicode libraries by melatonin · · Score: 2, Informative
    Apple's CoreFoundation does a great job of dealing with Unicode and XML. It's an OO library written in C, and as such it has string objects and an xml parser/generator that works with its array and dictionary objects. It does an excellent job of abstracting Unicode messiness when working with XML.

    I've found CF a bit cumbersome to use by itself. A wrapper in an OO language like C++ or Objective-C is very convenient. Your Objective-C wrapper is commonly called the Cocoa Foundation framework :)

    It's been ported to Linux and FreeBSD, and I'd recommend it to anyone doing Unicode or XML work. The parser is currently non-validating, but there are so many other 'gifts' that come with CF that makes it worthwhile.

    Hey, it was good enough to build an OS on.

    --
    Moderators should have to take a reading comprehension test.
  83. Re:everyone should learn English by Anonymous Coward · · Score: 0
    everyone would be spitting all over eachother. That's just the way the language is.


    He's got a point. Ever watch the Spanish channel? They use so many words they never stop speaking. In English, inflection can be used to indicate all sorts of meaning. But when you are talking 100 words a second, you speak monotone. Also, using 2 or three words of Spanish to every word in English is bad for conceptualization. Everything is 'con' or 'de' of something else. No complex word constructions necessary in science and engineering. Like the physics thing. When your language inhibits higher conceptualization, you're disadvantaged.

  84. Re:everyone should learn English by Anonymous Coward · · Score: 0

    A good point. The lack of accent marks and similar stuff makes written English simple. The spelling is quite difficult, making it hard to learn, but it also makes it a very rich language. English has more words than any other language. Advantage? I think so. You can say more things more ways, and I like that a lot. I personally like to spell wonderful as wunderful. I think it's wonderfully wonderful, with some extra wonder on top.

  85. Hastings Article by forgetmenot · · Score: 1

    This whole discussion reminds me of that (inane?) Unicode article by Hastings Research that got flamed to death here on Slashdot some time ago. Here's the link (to the article) if you're interested in reading it again.

  86. Re:Been there, done that, still doing it by Dave_i18n · · Score: 1

    We make b2b2c ecommerce applications and frameworks. We are currently internationalizing all our current CRM, SCM, etc applications to enable them to run in any locale. Converting all to Unicode is the smaller task, as we found out. To maintain compatibility to legacy char sets and code pages you need to update using both ways, a big hassle. Chances of running into Unicode encodings that do not have a legacy encoding is slim.
    The bigger challenges are database improvements to handle multiple concurrent languages and to have them sorted per locale. Each piece of data that gets transfered contains info about its locale, especially important when you have to handle multiple currencies at once (like with multiple supplier quotes from different countries). Not even to mention when and how the exchange rates get updated. Oh, and when does an offer expire, in the client timezone or in the server timezone?
    We have to support multiple platforms - MS NT & Win2000, Solaris, AIX, HP-UX, Oracle, DB2, IE 5x, Netscape ...... oh, and the 3rd party SW nightmares for webservers .... and Java is different everywhere too. C++ also has it's issues, we are moving to ICU for Java and C++, since Java doesn't cut it for internationalization.
    Converting all to Unicode and maintaining compatibility with legacy installations was the easiest task of them all.
    Dave

  87. Re:everyone should learn English by de+Selby · · Score: 1

    English is one of the hardest languages to learn CORRECTLY, but no one learns to use it correctly, so it doesn't matter.

    English is really easy because you don't have to obey _any_ of the rules to be understood. It's just getting everything correct to impress someone that's hard.

  88. Re:everyone should learn English by NeoSkandranon · · Score: 1

    I did try german, for two semesters anyhow. not going back unless i have to...vocabulary and phonics wasn't so bad for me, but i couldnt deal with all the gender-dependants stuff.

    --
    If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
  89. Re:everyone should learn English by Anonymous Coward · · Score: 0
    It's very easy in Germany. My mother, brother and a few friends have already visited me here, and not one can speak the language.

    woof.

  90. Re:everyone should learn English by Anonymous Coward · · Score: 0
    I didn't say most. I tried to make a brief statement that there's a lot more Chinese speakers who do English than the other way 'round.

    You accept, however, that there are major differences and that people grouped as speaking "one language" do not do so. Why did I write "Putonghua" instead of "Mandarin"? How many people besides the few pedantic linguistics like us know it? Three? Hell, they wouldn't use that one for the million-dollar question!

    "Given that a person has a good grasp..."
    Not quite given. China has a bit of a literacy problem outside of the demonstration/showcase cities. Almost anyone can read the "chu" char (box with piercing line meaning "through" or "centre") but reading the "xiao" character may be a bit beyond the average farmer, much less beyond the abilities of most non-Asians.

    woof.

    posted anon to save mods wasting their points. you're welcome.

  91. From UTF-8 to ASCII (ISO-8859-1) conversion by sumengen · · Score: 1

    How about converting UTF-8 to ASCII. Are there any free tools to do that?

    1. Re:From UTF-8 to ASCII (ISO-8859-1) conversion by Anonymous Coward · · Score: 0

      ok, I'll bite.
      ASCII is a very standard 7 bit encoding,
      ISO-8859-1 is one possible 8 bit encoding.
      If your text contains only characters in the ASCII range, then your UTF-8 string *is* an ASCII string.

      If your text contains characters in the ISO-8859-1 range, then you need to convert your UTF-8 string to UCS-2, then keep only the low byte of each character.
      It fits in a dozen lines of code, and can be written easily by reading how UTF-8 works.
      (a search for it on the web would probably bring up a bunch of samples as well.)

  92. Moderators, This should not be score0! by Yudit · · Score: 1


    + absolute ASCII compatibility.
    + ASCII is a 7 bit standard encoding. UTF-8 uses
    + those first seven bits EXACTLY the same as
    + regular ASCII.

    This IS informative.

  93. Unicode - new possibilities for trolling? by Anonymous Coward · · Score: 0
    Advanced Troll Research presents:

    &# 20006; &# 20016; &# 20026; &# 20036; &# 20046; &# 20056; &# 20066; &# 20076; &# 20086; &# 20096; &# 20106; &# 20116; &# 20126; &# 20136; &# 20146; &# 20156; &# 20166; &# 20176; &# 20186; &# 20196; &# 20206; &# 20216; &# 20226; &# 20236; &# 20246; &# 20256; &# 20266; &# 20276; &# 20286; &# 20296; &# 20306; &# 20316; &# 20326; &# 20336; &# 20346; &# 20356; &# 20366; &# 20376; &# 20386; &# 20396; &# 20406; &# 20416; &# 20426; &# 20436; &# 20446; &# 20456; &# 20466; &# 20476; &# 20486; &# 20496; &# 20506; &# 20516; &# 20526; &# 20536; &# 20546; &# 20556; &# 20566; &# 20576; &# 20586; &# 20596; &# 20606; &# 20616; &# 20626; &# 20636; &# 20646; &# 20656; &# 20666; &# 20676; &# 20686; &# 20696; &# 20706; &# 20716; &# 20726; &# 20736; &# 20746; &# 20756; &# 20766; &# 20776; &# 20786; &# 20796; &# 20806; &# 20816; &# 20826; &# 20836; &# 20846; &# 20856; &# 20866; &# 20876; &# 20886; &# 20896; &# 20906; &# 20916; &# 20926; &# 20936; &# 20946; &# 20956; &# 20966; &# 20976; &# 20986; &# 20996; &# 21006; &# 21016; &# 21026; &# 21036; &# 21046; &# 21056; &# 21066; &# 21076; &# 21086; &# 21096; &# 21106; &# 21116; &# 21126; &# 21136; &# 21146; &# 21156; &# 21166; &# 21176; &# 21186; &# 21196; &# 21206; &# 21216; &# 21226; &# 21236; &# 21246; &# 21256; &# 21266; &# 21276; &# 21286; &# 21296; &# 21306; &# 21316; &# 21326; &# 21336; &# 21346; &# 21356; &# 21366; &# 21376; &# 21386; &# 21396; &# 21406; &# 21416; &# 21426; &# 21436; &# 21446; &# 21456; &# 21466; &# 21476; &# 21486; &# 21496; &# 21506; &# 21516; &# 21526; &# 21536; &# 21546; &# 21556; &# 21566; &# 21576; &# 21586; &# 21596; &# 21606; &# 21616; &# 21626; &# 21636; &# 21646; &# 21656; &# 21666; &# 21676; &# 21686; &# 21696; &# 21706; &# 21716; &# 21726; &# 21736; &# 21746; &# 21756; &# 21766; &# 21776; &# 21786; &# 21796; &# 21806; &# 21816; &# 21826; &# 21836; &# 21846; &# 21856; &# 21866; &# 21876; &# 21886; &# 21896; &# 21906; &# 21916; &# 21926; &# 21936; &# 21946; &# 21956; &# 21966; &# 21976; &# 21986; &# 21996; &# 22006; &# 22016; &# 22026; &# 22036; &# 22046; &# 22056; &# 22066; &# 22076; &# 22086; &# 22096; &# 22106; &# 22116; &# 22126; &# 22136; &# 22146; &# 22156; &# 22166; &# 22176; &# 22186; &# 22196; &# 22206; &# 22216; &# 22226; &# 22236; &# 22246; &# 22256; &# 22266; &# 22276; &# 22286; &# 22296; &# 22306; &# 22316; &# 22326; &# 22336; &# 22346; &# 22356; &# 22366; &# 22376; &# 22386; &# 22396; &# 22406; &# 22416; &# 22426; &# 22436; &# 22446; &# 22456; &# 22466; &# 22476; &# 22486; &# 22496; &# 22506; &# 22516; &# 22526; &# 22536; &# 22546; &# 22556; &# 22566; &# 22576; &# 22586; &# 22596; &# 22606; &# 22616; &# 22626; &# 22636; &# 22646; &# 22656; &# 22666; &# 22676; &# 22686; &# 22696; &# 22706; &# 22716; &# 22726; &# 22736; &# 22746; &# 22756; &# 22766; &# 22776; &# 22786; &# 22796; &# 22806; &# 22816; &# 22826; &# 22836; &# 22846; &# 22856; &# 22866; &# 22876; &# 22886; &# 22896; &# 22906; &# 22916; &# 22926; &# 22936; &# 22946; &# 22956; &# 22966; &# 22976; &# 22986; &# 22996; &# 23006; &# 23016; &# 23026; &# 23036; &# 23046; &# 23056; &# 23066; &# 23076; &# 23086; &# 23096; &# 23106; &# 23116; &# 23126; &# 23136; &# 23146; &# 23156; &# 23166; &# 23176; &# 23186; &# 23196; &# 23206; &# 23216; &# 23226; &# 23236; &# 23246; &# 23256; &# 23266; &# 23276; &# 23286; &# 23296; &# 23306; &# 23316; &# 23326; &# 23336; &# 23346; &# 23356; &# 23366; &# 23376; &# 23386; &# 23396; &# 23406; &# 23416; &# 23426; &# 23436; &# 23446; &# 23456; &# 23466; &# 23476; &# 23486; &# 23496; &# 23506; &# 23516; &# 23526; &# 23536; &# 23546; &# 23556; &# 23566; &# 23576; &# 23586; &# 23596; &# 23606; &# 23616; &# 23626; &# 23636; &# 23646; &# 23656; &# 23666; &# 23676; &# 23686; &# 23696; &# 23706; &# 23716; &# 23726; &# 23736; &# 23746; &# 23756; &# 23766; &# 23776; &# 23786; &# 23796; &# 23806; &# 23816; &# 23826; &# 23836; &# 23846; &# 23856; &# 23866; &# 23876; &# 23886; &# 23896; &# 23906; &# 23916; &# 23926; &# 23936; &# 23946; &# 23956; &# 23966; &# 23976; &# 23986; &# 23996; &# 24006; &# 24016; &# 24026; &# 24036; &# 24046; &# 24056; &# 24066; &# 24076; &# 24086; &# 24096; &# 24106; &# 24116; &# 24126; &# 24136; &# 24146; &# 24156; &# 24166; &# 24176; &# 24186; &# 24196; &# 24206; &# 24216; &# 24226; &# 24236; &# 24246; &# 24256; &# 24266; &# 24276; &# 24286; &# 24296; &# 24306; &# 24316; &# 24326; &# 24336; &# 24346; &# 24356; &# 24366; &# 24376; &# 24386; &# 24396; &# 24406; &# 24416; &# 24426; &# 24436; &# 24446; &# 24456; &# 24466; &# 24476; &# 24486; &# 24496; &# 24506; &# 24516; &# 24526; &# 24536; &# 24546; &# 24556; &# 24566; &# 24576; &# 24586; &# 24596; &# 24606; &# 24616; &# 24626; &# 24636; &# 24646; &# 24656; &# 24666; &# 24676; &# 24686; &# 24696; &# 24706; &# 24716; &# 24726; &# 24736; &# 24746; &# 24756; &# 24766; &# 24776; &# 24786; &# 24796; &# 24806; &# 24816; &# 24826; &# 24836; &# 24846; &# 24856; &# 24866; &# 24876; &# 24886; &# 24896; &# 24906; &# 24916; &# 24926; &# 24936; &# 24946; &# 24956; &# 24966; &# 24976; &# 24986; &# 24996;

    Thank you.

    1. Re:Unicode - new possibilities for trolling? by Anonymous Coward · · Score: 0










































































































































































































      Wow.

  94. Re:everyone should learn English by Anonymous Coward · · Score: 0

    Why is it that most non US/EU keyboard layouts are pretty bad for programming

    Oh, it must be that the evil humanists are trying to convince programmers that they should be working to make other people's life easier and not turn it to a bobdamn nightmare.

    Who do you think use keyboards more, programmers or non-programmers? Hm? Do they possible need to type in ä or ö frequently?

    I really hope your name included a foreign character so that you would think who's typing in your name in your paycheck.

  95. Don't by Alex+Belits · · Score: 3, Informative

    Unicode does not solve any problems with multilingual text processing -- what it solves is not a problem (having non-iso8859-1 native language, I am qualified to testify that displaying and respresenting data in various languages wasn't a problem for at least 30 years already), and real problems -- rules, matching, hyphenation, spell checking, etc. remain problems with Unicode just like they are without it.

    To make it possible to process, transfer and store the data in multiple languages one does not need Unicode -- in fact Unicode usually only adds additional step that requires some knowledge of language context that may be unknown, unavailable for some kind of processing, or simply not disclosed by end-users. What is necessary is byte-value transparency, so text in multiple languages at least will not be distorted by "too smart" procedures that cut the upper bits or make some other ASCII-centric assumptions. If/when users will care about marking languages in a way more advanced than iso 2022, they probably will find byte-value transparent channels to be suitable for whatever they will use.

    However if/when real usable languages-handling infrastructure that will solve those problems will be created, it won't need unicode because it will have language metadata attached to the text already, and without metadata, text, in unicode or in native charsets, is not usable for most of applications if it's not somehow already known what language it is supposed to be in.

    --
    Contrary to the popular belief, there indeed is no God.
  96. What about all those dirty malloc's? by chriscera · · Score: 1

    I know several people who do this to declare an array:

    buf = (char *) malloc (size_array)

    when it should be this:

    buf = (char *) malloc (size_array * sizeof(char) );

    Won't a switch to unicode break all the code that looks like scenario #1 above? Is there any way to switch this on a system-wide level? I guess redefining malloc as a variable-length argument function or an elaborate #define might work. I dunno, just a core dump here but I haven't seen any of the followups address this issue. Even if this isn't an issue, could somebody please followup with a solution.

    Now that I'm aware that a unicode switch in the future might be inevitable I try to do the right thing, but I've seen this in soooo many places, especially old-school C hacker types.

    --
    -- Who needs windows and gates in a world w/o walls and fences?
    1. Re:What about all those dirty malloc's? by Alex+Belits · · Score: 2

      sizeof(char) will always be 1, with Unicode, multibyte encodings, variable-length encodings (for example Unicoders' favorite monstrosity UTF-8) or anything else. One just shouldn't treat one char variable as one displayed glyph, and may have to use wide character type to describe them instead of char.

      --
      Contrary to the popular belief, there indeed is no God.
  97. really? by Anonymous Coward · · Score: 0

    Do you really need to update it?

  98. Re:everyone should learn English by Teratogen · · Score: 1

    I'll reply with a quote from Heinlein:

    "English is the largest of the human tongues, with several times the vocabulary of the second largest language - this alone made it inevitable that English would eventually become, as it did, the lingua franca of this planet, for it is therby the richest and most flexible - despite its barbaric accretions ... Its very variety, subtlety, and utterly irrational, idiomatic complexity makes it possible to say things in English which simply cannot be said in any other language.

    --
    --- even the safest course is fraught with peril
  99. UTF-8 and Unicode FAQ for Unix/Linux by Anonymous Coward · · Score: 0
  100. Re:everyone should learn English by Anonymous Coward · · Score: 0

    [off-topic, but trying to make a point]

    "Everyone should program in BASIC. It's the language that everyone understands."

    "MS Paintbrush should be the standard, because almost everyone has it on their PC."

    Etc., etc.

    English is a terribly primitive language when you compare it with German, Portuguese, etc. In fact, that's why everyone can speak it, just like everyone can leanr BASIC quickly. But there are some things you just can't do right in BASIC. I speak seven languages (four fluently) and I think that fact plays a major role in my ability to program computers. Everyone should learn to speak at the very least two different languages.

    As someone said, language gives shape to your thoughts and determines what you can (and cannot) think about.

  101. Re:everyone should learn English by Anonymous Coward · · Score: 0

    That's a brilliant idea. You can't cope with the multilingual environment you've put yourself in, the craftsmen don't want headaches, and the businessmen don't want to spend money on the markets they are going to suck all the money out of. The solution: Make everyone use English. We are really solving the world's problems here.

    Fucking brilliant. Especially if you consider the world only Europe and certain former British colonies...oh, and if you 'forget' that even in these countries people don't necessarily speak English.