[nycphp-talk] MySQL 4 vs. 5 and character encoding (?)
Mikko Rantalainen
mikko.rantalainen at peda.net
Thu May 11 04:26:54 EDT 2006
Rahmin Pavlovic wrote:
> On 5/10/06 11:53 AM, "Daniel Convissor" <danielc at analysisandsolutions.com>
> wrote:
>
>> On Wed, May 10, 2006 at 10:57:53AM -0400, Rahmin Pavlovic wrote:
>>> I already tried setting the charset / collation to utf-8 -- both on tables /
>>> fields -- but resaving data produces the same mangled results.
>> How did you do that? ALTER TABLE? If so, that's too late.
>>
>> Open up the SQL script created by mysqldump. What's the default charset
>> defined in the CREATE TABLE statements?
>>
>
> latin1_swedish_ci. When I ALTER TABLE, it successfully updates to
> utf8_general_ci. Why is it too late? Do I need to re-dump?
Since you say that you changed it to utf8, I guess the original data
was supposed to be with UTF-8 character set, but because MySQL5
dumped it with latin1_swedish_ci you never did tell MySQL5 about the
fact (this happens pretty much automatically if you upgrade from
MySQL4 which pretty much didn't have a clue about UTF-8).
However, MySQL5 in its infinite wishdom replaces invalid characters
in the mysqldump output with a literal "?" (U+003F) instead of
leaving those bytes as is. Latin1 doesn't define byte range 128-159
so MySQL5 replaces those bytes with literal "?" if you use mysqldump
without correct settings (UTF-8 encoded strings may include bytes in
range 128-159 and if you replace those bytes with a literal "?"
you'll end up with invalid UTF-8 encoding). You will not notice this
problem if you just do INSERT, UPDATE and SELECT from some other
application because this modification doesn't happen in that case.
In short, yes, you have to re-dump from MySQL5. Hopefully you have
the binary table data still available.
--
Mikko
More information about the talk
mailing list