[nycphp-talk] utf-8, iso-8859-1...
Anirudh Zala
arzala at gmail.com
Fri May 7 23:47:48 EDT 2010
On Thursday 06 May 2010 21:16:03 David Mintz wrote:
> I don't really have a good understanding of issues around character sets,
> encoding, what have you, though I am starting to work on it.
>
> My problem involves a MySQL database and accented characters such as those
> you find in Spanish and French. My web server sends a "content-type:
> text/html; charset=iso-8859-1" header and my docs have an equivalent meta
> tag. My mysql's config says
>
> default-character-set = latin1
> character_set_server = latin1
> collation_server = latin1_general_ci
Here in mysql's configuration files, you could permanently set utf-8 as default
character set and collation so that for new databases/tables it will be taken
automatically. At this level you have solved problem of storing and retrieving
utf-8 data.
>
> and my data tables "SHOW CREATE" typically look like
>
> CREATE TABLE `people` (
> `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
> `lastname` varchar(40) COLLATE latin1_general_ci NOT NULL,
> `firstname` varchar(40) COLLATE latin1_general_ci NOT NULL,
> /* etc */
> ) ENGINE=MyISAM AUTO_INCREMENT=546 DEFAULT CHARSET=latin1
> COLLATE=latin1_general_ci
You will need to convert existing data from latin1_* to utf8_* for consistent
storage and retrieval of new and old data.
>
> So what's the problem? Generally there is none. Characters like ó and ñ
> render correctly. The snag I am hitting now is writing a regular expression
> to whitelist the characters I can accept in proper names. I would think
> that the regex
>
> /^[-a-zA-Z\xC0-\xFF ']+$/
>
> would test for anything that isn't a "letter" in most western european
> languages, or a space, or an apostrophe. But it is returning true (meaning
> yes there is an illegal character) in the name Barceló, where false is what
> I would like to hear.
Biggest problem with utf-8 data is text processing (sorting, searching,
validation etc.) that is why full utf-8 support is lacking in many languages.
But there extensions like mb_string, iconv and ICU which can be helpful in
processing utf-8 data at satisfactory level.
>
> Would this regex work if the data were utf-8? Should I consider converting
> everything and working in utf-8, and if so, how painful is it to convert a
> MySQL database? My initial research suggests that it isn't painless.
Yes, moreover you also need to change meta information in your web-pages to
tell browser and server that your text is utf-8 encoded and not iso-8859...
Finally your editor must be set to write you code/information in utf-8 format
only. I don't think that at web server, OS and http level you need make any
changes since now a days they have native support to handle utf-8 data.
Thanks
Anirudh Zala
More information about the talk
mailing list