Migrate a MySQL Database Preserving Special Characters

← Back to Stories (view on slashdot.org)

Migrate a MySQL Database Preserving Special Characters

Posted by kdawson on Sunday May 6, 2007 @05:30PM from the encoding-juju dept.

TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."

8 of 98 comments (clear)

Migration by dfetter · 2007-05-06 17:38 · Score: 4, Informative

Better still, install DBI-Link http://pgfoundry.org/projects/dbi-link/ inside PostgreSQL, migrate once and have done ;)

--
What part of "A well regulated militia" do you not understand?
Pffft Easy... by Frogbert · 2007-05-06 17:53 · Score: 4, Funny

First you get the names of every table in the old database

Then you create these tables in the new one. Just so there are no problems with data types you should probably just make every field varchar(100) in the new one.

Then you fire up MS Access, the older the better. I try to stick to Access 95.

Then you create two ODBC links, one to your old one and one to the new one.

Then you use the linked table manager to link each table to ms access.

Then you open both the new table and the old table and select all, copy and paste the data into the new table.

It's so simple even a child could do it!
1. Re:Pffft Easy... by Negatyfus · 2007-05-06 18:43 · Score: 4, Funny
  
  Linux doesn't use databases. Flat text files and grep work just as well as this overly complicated "SQL" crap.
Unicode integration woes by DJ+Rubbie · 2007-05-06 18:15 · Score: 4, Interesting

As I understand it, the problem arises from the fact that mysqldump uses utf8 for character encoding while, more often than not, mysql tables default to latin1 character encoding. (If you were smart enough to manually set the character encoding to utf8, then you'll have no problems - everyone running mysql 4.0 or early will be using latin1 since it didn't support any other encodings.) So lets say we have a database named example_db with tables that have varchar and text columns. If you have special characters that are really UTF-8 encoded characters stored in the db, it works just fine until you try to move the db to another server.

That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.

Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).

--
Please direct all bug reports to /dev/null
Re:Useful? This is damned awesome! by jamshid · 2007-05-06 18:38 · Score: 5, Informative

Then send the wordpress developers this link:

http://www.joelonsoftware.com/articles/Unicode.htm l
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Re:How is this news? by kestasjk · 2007-05-06 18:39 · Score: 5, Funny

If there's a chance of starting a PostgreSQL vs MySQL flamewar, it's news.

--
// MD_Update(&m,buf,j);
What's with the sudden influx of gnubies? by rylin · 2007-05-06 19:12 · Score: 4, Insightful

Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
If you do insert unicode data into a latin1 table, you will get unexpected results.

What you do is make sure that your:
a) database(s) are set to utf8 by default
b) table(s) are set to utf8 by default
c) column(s) are set to utf8 by default
d) connection defaults to utf8
(provided, of course, that it's utf8 you're after)

That way, it'll "Just Work"(tm)

We've gone through upgrades from 3.23 -> 4.0 -> 4.1 -> 5.0 and never had a problem; and yes, our tables were all latin1 from the beginning.
Big Trouble in Little China. Don't use UCS-2. by argent · 2007-05-06 23:35 · Score: 5, Informative

UCS-2 only covers plane zero (the Basic Multilingual Plane, or BMP). It doesn't cover code points outside that. Unicode actually supports the entire UCS, all 1.1 million (and growing) code points.

In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...

And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.

So you have to support the full UCS encoding anyway.

There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.

If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.

Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.