Slashdot Mirror


Migrate a MySQL Database Preserving Special Characters

TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."

21 of 98 comments (clear)

  1. Migration by dfetter · · Score: 4, Informative

    Better still, install DBI-Link http://pgfoundry.org/projects/dbi-link/ inside PostgreSQL, migrate once and have done ;)

    --
    What part of "A well regulated militia" do you not understand?
    1. Re:Migration by alexandreracine · · Score: 2, Informative

      Yeah, but he's talking about mysql, not pg. Use my solution : http://gallery.menalto.com/node/61073#comment-2238 08

      --
      No sig for now.
  2. Pffft Easy... by Frogbert · · Score: 4, Funny

    First you get the names of every table in the old database

    Then you create these tables in the new one. Just so there are no problems with data types you should probably just make every field varchar(100) in the new one.

    Then you fire up MS Access, the older the better. I try to stick to Access 95.

    Then you create two ODBC links, one to your old one and one to the new one.

    Then you use the linked table manager to link each table to ms access.

    Then you open both the new table and the old table and select all, copy and paste the data into the new table.

    It's so simple even a child could do it!

    1. Re:Pffft Easy... by Negatyfus · · Score: 4, Funny

      Linux doesn't use databases. Flat text files and grep work just as well as this overly complicated "SQL" crap.

    2. Re:Pffft Easy... by Jacco+de+Leeuw · · Score: 2, Funny

      So very simple,
      that only a child can do it!

      --
      -------
      Warning: Slashdot may contain traces of nuts.
  3. Re:how about ... by Majik+Sheff · · Score: 3, Informative

    It usually works as long as you're staying on the same architecture. I successfully pulled this off when a client's DB server died horribly with no functional backups in sight. I salvaged the vast majority of the binary tables and dropped them into a fresh install of MySQL. After the migration of the binaries I renamed the tables to *_bak and told MySQL to dump the contents into freshly created data files. Then I set my client on the task of assessing damage to the data. I would only recommend this tactic if you're doing crash recovery on a borked system, as there are risks. P.S. Have you hugged your backups today?

    --
    Women are like electronics: you don't know how damaged they are until you try to turn them on.
  4. Unicode integration woes by DJ+Rubbie · · Score: 4, Interesting

    As I understand it, the problem arises from the fact that mysqldump uses utf8 for character encoding while, more often than not, mysql tables default to latin1 character encoding. (If you were smart enough to manually set the character encoding to utf8, then you'll have no problems - everyone running mysql 4.0 or early will be using latin1 since it didn't support any other encodings.) So lets say we have a database named example_db with tables that have varchar and text columns. If you have special characters that are really UTF-8 encoded characters stored in the db, it works just fine until you try to move the db to another server.

    That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.

    Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).
    --
    Please direct all bug reports to /dev/null
  5. mysqldump is too think by hpavc · · Score: 2, Interesting

    This guys mysqldump statement could use some args, too much is packed in his my.cnf defaults to make this truely useful as a how to. He could easily cause more problems than he is solving.

    --
    members are seeing something, your seeing an ad
  6. Re:Useful? This is damned awesome! by jamshid · · Score: 5, Informative

    Then send the wordpress developers this link:

    http://www.joelonsoftware.com/articles/Unicode.htm l
    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  7. Re:How is this news? by kestasjk · · Score: 5, Funny

    If there's a chance of starting a PostgreSQL vs MySQL flamewar, it's news.

    --
    // MD_Update(&m,buf,j);
  8. What's with the sudden influx of gnubies? by rylin · · Score: 4, Insightful

    Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
    If you do insert unicode data into a latin1 table, you will get unexpected results.

    What you do is make sure that your:
    a) database(s) are set to utf8 by default
    b) table(s) are set to utf8 by default
    c) column(s) are set to utf8 by default
    d) connection defaults to utf8
    (provided, of course, that it's utf8 you're after)

    That way, it'll "Just Work"(tm)

    We've gone through upgrades from 3.23 -> 4.0 -> 4.1 -> 5.0 and never had a problem; and yes, our tables were all latin1 from the beginning.

    1. Re:What's with the sudden influx of gnubies? by cortana · · Score: 2, Insightful

      Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
      If you do insert unicode data into a latin1 table, you will get unexpected results. Ah, I love MySQL. They should fix it so that if you insert unicode data where latin1 data is expected, you get an error instead of silent data corruption.
    2. Re:What's with the sudden influx of gnubies? by AaronLawrence · · Score: 2, Interesting

      The whole point of UTF-8 is that it can silently be inserted in places that were designed to handle ASCII. So no, there is no way for something which is handling latin1 to know that what you gave it is actually UTF8 and therefore not legal.

      --
      For every expert, there is an equal and opposite expert. - Arthur C. Clarke
  9. Re:How is this news? by arivanov · · Score: 2, Interesting

    True.

    As well as a chance of posting an arcane method of database transition involving MySQL to start an ACID war.

    As well as on the original subject of the article - the best way to migrate an application is to load all of the data from one datasource and dump it into another datasource. If the application fails this trivial test its database access libraries are broken. If the app sticks strictly to dynamic SQL, high level DBI functions and does no manual escaping - it just works. The escaping portion of the SQL libs take care of it and ensure it is mapped correctly both ways. If the app tries to escape by hand, sticks data into teh SQL statement itself, etc - it fails. Same for utf/latin transitions and the like.

    --
    Baker's Law: Misery no longer loves company. Nowadays it insists on it
    http://www.sigsegv.cx/
  10. Re:smart quotes? more like stupid fucked up quotes by frisket · · Score: 2, Funny

    Real IT professionals don't use Outlook.

  11. Re:Useful? This is damned awesome! by Hognoxious · · Score: 2, Interesting

    I'm not American, and I'm sitting here supporting a multinational IT system (Italy, Belgium, Netherlands, UK, Italy, Spain & Portugal) and it works fine without unicode. While I'm generally a fan of Joel I think he overstates the case here.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  12. Big Trouble in Little China. Don't use UCS-2. by argent · · Score: 5, Informative

    UCS-2 only covers plane zero (the Basic Multilingual Plane, or BMP). It doesn't cover code points outside that. Unicode actually supports the entire UCS, all 1.1 million (and growing) code points.

    In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...

    And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.

    So you have to support the full UCS encoding anyway.

    There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.

    If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.

    Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.

    1. Re:Big Trouble in Little China. Don't use UCS-2. by argent · · Score: 2, Informative

      Assuming he's ONLY using Windows string APIs.

      First, you need to be religious about it. But if you are, then the choice of internal encoding is really a performance issue only, and the choice of external encoding is a matter of following the principle of least astonishment. Your code shouldn't know nor care what encoding the string APIs use internally. The program should work the same whether wchar_t is (unsigned char), (unsigned short), (unsigned long), or even (double).

      Second, there's a lot of overhead in canned string code. Depending on the problem space that may be OK. For short strings or simple operations, or where there's significant per-token overhead otherwise, the overhead of dipping into the API for each character isn't significant. When performance matters, though, even inlined code can slow the critical path in the program down orders of magnitude. Getting decent performance requires a parser be character-set aware.

      So the advice should be:

      1. "Use a canned string handling library that's unicode-aware". That means using at least the ISO C wide character libraries and NOT depending on implementation details like whether they're UCS-2 or UTF-16 or UCS-4 or UTF-8 or for god's sake UTF-1 internally.
      2. "Only export data in UTF-8".
      3. "When performance matters, figure out what's most efficient for your problem space, and use that." And if you're not sure, performance probably doesn't matter as much as you think it does. And don't forget that cache matters and branches cost dozens of instructions.

    2. Re:Big Trouble in Little China. Don't use UCS-2. by tepples · · Score: 2, Informative

      Why can't you maintain characters in composed form internally? Because the set of composite characters that have a composed form differs from Unicode version to Unicode version.
    3. Re:Big Trouble in Little China. Don't use UCS-2. by argent · · Score: 2, Interesting

      Not to mention that if you did this, you suddenly need a whole bunch of code to take all incoming text and fix it up so that everything is precomposed

      That's no different from "you need a bunch of code to take the incoming text and convert it to UCS-2 (or UTF-8, or UTF-16, or UCS-4)".

      it's possible to create legal combinations which have no single unichar replacement.

      Are they meaningful as well as legal, or ar they like the "n with an umlaut" in "Spinal Tap"?

      I'm of the opinion that there should be *no* precomposed characters, or they should *all* be precomposed.

      In any case, you can always use a guaranteed unused code and use a lookup table on input and output.

  13. Small teapot in Big China. UCS-2 slices and dices. by epine · · Score: 3, Interesting
    That was a good post, but I don't understand your premise whatsoever. There seems to be two tactics at work here: arbitrary line drawing, and the belief that if you can't make everyone happy the best compromise is to make everyone unhappy. I read that post by Joel long ago, and I just read it again. I don't think he could have done a better job in the space devoted to it.

    My one criticism of Joel was passing himself the "get out of jail free" card. Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. This is a fair disclaimer, but it makes it impossible to judge where Joel was simplifying deliberately and where he simplified because he didn't know any better. The correction would be for Joel to state "I'm going to simplify issues X, Y, and Zed". Then mistakes in the middle of the alphabet would be entirely his own. Just as there is no such thing as a string without a coding system, there is no such thing as a useful disclaimer that doesn't specify precisely what it disclaims. It amused me to see Joel invoke the ASCII standard of accountability.

    Concerning the claim that Joel has made the same mistake [over again], this same claim comes up all the time concerning address arithmetic. How much existing code is portable to a 128 bit address size? We're sure to need this by 2050. Or perhaps not. People tend to neglect the observation that we're talking about a doubly exponential progression in codespace: (2^2^3)^2^N, with the values N=0,1,2,3,4 plausible in photolithographic solid state. On the current progression, for N=5 transistors would need to be subatomic. As far as the present transition from 32 bits to 64 bits of address space, it makes sense that operating systems and file systems are 64-bit native, while 99% of user space applications continue to run in less time and space compiled for 32 bits. Among the growing sliver of applications that do run better in 64-bits are a few applications of especially high importance.

    I worked extensively with CJK languages in the early 1990s, and my opinion at the time was that UCS-4 primarily catered to the SETI crowd, and potentially, belligerent Mandarins in mainland China. I recall more argument at the time about Korean, which is a syllabic script masquerading as ideographic blocks.

    http://en.wikipedia.org/wiki/Hangul

    I've always had a lot of trouble understanding the opposition to Han unification. Many distinctions in the origins of the English language were lost in the adoption of ASCII, such as the ae ligature and the old-English thorn (which causes many Hollywood sets to feature "Ye old saloon").

    http://en.wikipedia.org/wiki/Han_unification
    http://en.wikipedia.org/wiki/Thorn_(letter)

    ... Unicode now encodes far more [Han] characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.

    Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.

    And all this fits quite nicely in UCS-2 as advocated by Joel.

    A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph -- the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware o