Migrate a MySQL Database Preserving Special Characters
TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."
Better still, install DBI-Link http://pgfoundry.org/projects/dbi-link/ inside PostgreSQL, migrate once and have done ;)
What part of "A well regulated militia" do you not understand?
There's about 8,000 wordpress blog's out there that could use this. Pity I can't mod an article insightful
Me failed English...
FreeBSD over Linux. If my comments seem odd, this may explain...
First you get the names of every table in the old database
Then you create these tables in the new one. Just so there are no problems with data types you should probably just make every field varchar(100) in the new one.
Then you fire up MS Access, the older the better. I try to stick to Access 95.
Then you create two ODBC links, one to your old one and one to the new one.
Then you use the linked table manager to link each table to ms access.
Then you open both the new table and the old table and select all, copy and paste the data into the new table.
It's so simple even a child could do it!
+5 Insightful.
... simply moving/copying binary data files?
How is this news??? Later on today: "Data from POST/GET and special characters"?
That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.
Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).
Please direct all bug reports to
This guys mysqldump statement could use some args, too much is packed in his my.cnf defaults to make this truely useful as a how to. He could easily cause more problems than he is solving.
members are seeing something, your seeing an ad
Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
If you do insert unicode data into a latin1 table, you will get unexpected results.
What you do is make sure that your:
a) database(s) are set to utf8 by default
b) table(s) are set to utf8 by default
c) column(s) are set to utf8 by default
d) connection defaults to utf8
(provided, of course, that it's utf8 you're after)
That way, it'll "Just Work"(tm)
We've gone through upgrades from 3.23 -> 4.0 -> 4.1 -> 5.0 and never had a problem; and yes, our tables were all latin1 from the beginning.
I share your attitude. Sometimes people where I work send snippets of code and whatnot over Outlook, and it gets corrupted with smiley faces.
Anyone noticing a strong similarity between this article and about 95% of the stuff posted on digg every day? What's next, "The Top 10 Photoshop Filters of All Time?" :) Tagged slashdigg.
Real IT professionals don't use Outlook.
Oh & you can change the encoding at anytime, so even if you initially forget you just change it later, then dump.
* Game Over * High Score: 264,846,927 -- Your Score: 14
I would say "always" but nobody's *always* anything... but often enough I can't pull up a specific example of a Joel On Software article that's not off the top of my head. :)
speaking of which, does anyone know where i can find info on the perl module named Audio:[smiley-sticking-his-toungue-out]lay? CPAN seems to be missing this one.
UCS-2 only covers plane zero (the Basic Multilingual Plane, or BMP). It doesn't cover code points outside that. Unicode actually supports the entire UCS, all 1.1 million (and growing) code points.
In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...
And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.
So you have to support the full UCS encoding anyway.
There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.
If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.
Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.
Upgrade and copy the old datadir to new server
...
mysqldump --default-character-set=binary --compat=mysql40 olddb | mysql --default-character-set=cp1250 newdb
New default charset doesn't matter, the data is converted to whatever charet you set for the newdb when you created it. Important stuff is only the cp1250, that's the real charset of the data you had in the olddb. If it's different from yours, you should recejt it and substitute your own
“Freedom is untidy.”
—Donald Rumsfeld
Remember, kids: Violence and mayhem are symptoms of our nation's greatness. May the gun-control terrorists never rob us of our untidy freedoms, nor our foreign colonials of theirs.
> What you do is make sure that your:
> a) database(s) are set to utf8 by default
> b) table(s) are set to utf8 by default
> c) column(s) are set to utf8 by default
> d) connection defaults to utf8
Please tell me that if I specify a database is UTF8 that I don't also have to tell it that each column is utf8 as well!
And why should I have to tell the connections? Doesn't that get resolved automatically when the client connects to the server?
And why should I have to tell a utility what the target database codepage is? Can't they talk? Why the opportunity to manually do something trivial and screw it up?
And...'by default'? Doesn't mysql know how to automatically perform codepage conversions?
And, please tell me that MySQL wouldn't let somebody insert utf8, say asian characters, into a latin1 database!
> That way, it'll "Just Work"(tm)
ugh, i don't think that expression means what you think it means
does mysql have a list of prioritized missing basic features and functionality so that you get can a sense of when this kind of stupidity will be resolved?
So I have to remember to do EACH and EVERY one of those things so it "Just Works?"
Unbelievable.
It's been asked here several times and I beg the moderators: do not mod up any link pointing to Joel Spolsky up again. He's pretending to know something but he's basically wrong on everything.
/. ids of people posting links to Joel's lame wrong rantings.
It was stated here in a very rational argument that the guy knows nothing besides some Windows-centric VB and VC++ (and perhaps now C#).
He's full of himself because he worked on the MS Office codebase. Please, stop the insanity.
Once more a link to an article spreading misconception by Joel: this time it's about what Joel poorly understood about Unicode.
Such an article only highlights what several people here have been saying since quite some time now: Joel Spolsky is a mister nobody spreading misconception.
His article about Unicode is very flawed.
As argent (/. id 18001) pointed out: the explanation of Unicode by Joel is completely shortsighted. He's thinking that Unicode 3.0 (which only support 65 536 codepoints) is "the one true Unicode". This is a gigantic mistake. Moreover he's confusing character sets, code points and encodings.
The random babbling he's making only works for, as it has been stated, he's lucky to work using tools that are broken in the same ways: this is simply inexcusable.
Let me state it once and for all: anything that Joel writes is only half-truth. The guy completely lacks rigor and his writings are not interesting. And he always comes up with dramatic title like "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" though this should really be reworded to something like "The misconception Joel Spolsky has about Unicode in its outdated Windows-centric world". And this is not just about Unicode, it's like that with everything that this uninteresting person writes.
So moderators, please, mod down posts linking to Joel and mod -1
My one criticism of Joel was passing himself the "get out of jail free" card. Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. This is a fair disclaimer, but it makes it impossible to judge where Joel was simplifying deliberately and where he simplified because he didn't know any better. The correction would be for Joel to state "I'm going to simplify issues X, Y, and Zed". Then mistakes in the middle of the alphabet would be entirely his own. Just as there is no such thing as a string without a coding system, there is no such thing as a useful disclaimer that doesn't specify precisely what it disclaims. It amused me to see Joel invoke the ASCII standard of accountability.
Concerning the claim that Joel has made the same mistake [over again], this same claim comes up all the time concerning address arithmetic. How much existing code is portable to a 128 bit address size? We're sure to need this by 2050. Or perhaps not. People tend to neglect the observation that we're talking about a doubly exponential progression in codespace: (2^2^3)^2^N, with the values N=0,1,2,3,4 plausible in photolithographic solid state. On the current progression, for N=5 transistors would need to be subatomic. As far as the present transition from 32 bits to 64 bits of address space, it makes sense that operating systems and file systems are 64-bit native, while 99% of user space applications continue to run in less time and space compiled for 32 bits. Among the growing sliver of applications that do run better in 64-bits are a few applications of especially high importance.
I worked extensively with CJK languages in the early 1990s, and my opinion at the time was that UCS-4 primarily catered to the SETI crowd, and potentially, belligerent Mandarins in mainland China. I recall more argument at the time about Korean, which is a syllabic script masquerading as ideographic blocks.
http://en.wikipedia.org/wiki/Hangul
I've always had a lot of trouble understanding the opposition to Han unification. Many distinctions in the origins of the English language were lost in the adoption of ASCII, such as the ae ligature and the old-English thorn (which causes many Hollywood sets to feature "Ye old saloon").
http://en.wikipedia.org/wiki/Han_unification
http://en.wikipedia.org/wiki/Thorn_(letter)
And all this fits quite nicely in UCS-2 as advocated by Joel.
My objection is that he's saying "use UCS-2, it solves the problem" when UCS-2 doesn't solve the problem and it creates new ones.
What he should have said is "use a wide character library that supports Unicode-2, no matter what it uses internally, and make sure your code still works if they change the encoding behind your back".
I don't know why you're going on about "I have a tough time accepting the premise that the British Commonwealth was well served by ASCII". I didn't say that anyone was well served by ASCII.
Real IT professionals don't use Outlook.
I agree. They use Outlook Express, because it comes with every Pee Cee and is completely free.
Send as plain text and not HTML or Rich Text.
In any case, you can always use a guaranteed unused code and use a lookup table on input and output.
Until you run out of such codes, anyway.
You have planes 15 and 16 available. If you have a *single* object that is using more than 131072 unique combinations of composing characters I'll be impressed, and you might even have to start using the as-yet-unused 12 bits of UCS-4.
when doing (non-'real-time'-critical) migrations... I find it much less troublesome to just scp the files and then run mysqlcheck to repair the tables on the new server (if required).
http://www.seanodonnell.com/code/?id=66
That process prevents the syntax conversion issues, storage and bandwidth requirements, and processing-time requirements caused by using mysqldump.
the only permanence in existence, is the impermanence of existence.
Access is okay if you link it to a real SQL backend database and never link any "editable" fields to the table. I use all unbound controls and then populate everything in code, then at "save" time run a check of every control, and only if everything passes the check does an UPDATE or INSERT SQL statement get generated in my code then executed to the back-end. ADO and even DAO in VBA are your friends!
Locking is no problem. If needed, I implement it myself or use SQL direct to allow use of transactions.