NYCPHP Meetup

NYPHP.org

[nycphp-talk] utf-8, iso-8859-1...

Anirudh Zala arzala at gmail.com
Fri May 7 23:47:48 EDT 2010


On Thursday 06 May 2010 21:16:03 David Mintz wrote:
> I don't really have a good understanding of issues around character sets,
> encoding, what have you, though I am starting to work on it.
> 
> My problem involves a MySQL database and accented characters such as those
> you find in Spanish and French. My web server sends a "content-type:
> text/html; charset=iso-8859-1" header and my docs have an equivalent meta
> tag. My mysql's config says
> 
> default-character-set = latin1
> character_set_server = latin1
> collation_server     = latin1_general_ci

Here in mysql's configuration files, you could permanently set utf-8 as default 
character set and collation so that for new databases/tables it will be taken 
automatically. At this level you have solved problem of storing and retrieving 
utf-8 data.

> 
> and my data tables "SHOW CREATE" typically look like
> 
> CREATE TABLE `people` (
>   `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
>   `lastname` varchar(40) COLLATE latin1_general_ci NOT NULL,
>   `firstname` varchar(40) COLLATE latin1_general_ci NOT NULL,
>   /* etc */
> ) ENGINE=MyISAM AUTO_INCREMENT=546 DEFAULT CHARSET=latin1
> COLLATE=latin1_general_ci

You will need to convert existing data from latin1_* to utf8_* for consistent 
storage and retrieval of new and old data.

> 
> So what's the problem? Generally there is none. Characters like ó and ñ
> render correctly. The snag I am hitting now is writing a regular expression
> to whitelist the characters I can accept in proper names. I would think
> that the regex
> 
>       /^[-a-zA-Z\xC0-\xFF ']+$/
> 
> would test for anything that isn't a "letter" in most western european
> languages, or a space, or an apostrophe. But it is returning true (meaning
> yes there is an illegal character) in the name Barceló, where false is what
> I would like to hear.

Biggest problem with utf-8 data is text processing (sorting, searching, 
validation etc.) that is why full utf-8 support is lacking in many languages. 
But there extensions like mb_string, iconv and ICU which can be helpful in 
processing utf-8 data at satisfactory level.

> 
> Would this regex work if the data were utf-8? Should I consider converting
> everything and working in utf-8, and if so, how painful is it to convert a
> MySQL database? My initial research suggests that it isn't painless.

Yes, moreover you also need to change meta information in your web-pages to 
tell browser and server that your text is utf-8 encoded and not iso-8859...

Finally your editor must be set to write you code/information in utf-8 format 
only. I don't think that at web server, OS and http level you need make any 
changes since now a days they have native support to handle utf-8 data.

Thanks

Anirudh Zala



More information about the talk mailing list