[nycphp-talk] utf-8, iso-8859-1...
David Mintz
david at davidmintz.org
Thu May 6 11:46:03 EDT 2010
I don't really have a good understanding of issues around character sets,
encoding, what have you, though I am starting to work on it.
My problem involves a MySQL database and accented characters such as those
you find in Spanish and French. My web server sends a "content-type:
text/html; charset=iso-8859-1" header and my docs have an equivalent meta
tag. My mysql's config says
default-character-set = latin1
character_set_server = latin1
collation_server = latin1_general_ci
and my data tables "SHOW CREATE" typically look like
CREATE TABLE `people` (
`id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`lastname` varchar(40) COLLATE latin1_general_ci NOT NULL,
`firstname` varchar(40) COLLATE latin1_general_ci NOT NULL,
/* etc */
) ENGINE=MyISAM AUTO_INCREMENT=546 DEFAULT CHARSET=latin1
COLLATE=latin1_general_ci
So what's the problem? Generally there is none. Characters like ó and ñ
render correctly. The snag I am hitting now is writing a regular expression
to whitelist the characters I can accept in proper names. I would think that
the regex
/^[-a-zA-Z\xC0-\xFF ']+$/
would test for anything that isn't a "letter" in most western european
languages, or a space, or an apostrophe. But it is returning true (meaning
yes there is an illegal character) in the name Barceló, where false is what
I would like to hear.
Would this regex work if the data were utf-8? Should I consider converting
everything and working in utf-8, and if so, how painful is it to convert a
MySQL database? My initial research suggests that it isn't painless.
--
Support real health care reform:
http://phimg.org/
--
David Mintz
http://davidmintz.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20100506/0b1e6c80/attachment.html>
More information about the talk
mailing list