[nycphp-talk] Any alternatives to mbstring for PHP+UTF-8?
Paul Houle
paul at devonianfarm.com
Thu May 10 21:31:28 EDT 2007
Jakob Buchgraber wrote:
> Hey!
>
> I was wondering whether there are alternatives to mbstring for
> handling UTF-8 encoded data with PHP?
> I am asking, because I'd like to play around with as many
> "technologies" as possible before I actually start developing.
> I somehow also looked at the way Joomla! did it, but I don't really
> like their solution.
>
Sometimes you can process UTF-8 without doing anything special. For
instance, if you want to pull some text out of a MySQL database and
display it on a web page, you can pass the UTF-8 text through without
using mbstring in PHP: the one thing you need to do is set the
character encoding of the HTML document to UTF-8.
A big strength of UTF-8 is that UTF-8 is compatible with US-ASCII;
all US-ASCII characters are the same in UTF-8. This means that you can
explode on ",", "\t", "\n" or a space just like you always do.
Any regex on Unicode 'characters' can be translated to a regex that
works on UTF-8 bytes. This may be awkwards sometimes, but it can be an
efficient way to do many operations, including those that "get under
the hood" of your language.
Avoid unnecessary character conversions. If you can take UTF-8 in,
process it as UTF-8, and output UTF-8, that's really the best. People
who work with languages like Java, that do character conversions for
you, often find they're not in control of their character conversions.
Years ago I discovered that the contents of a postgres database were
double-encoded... The bytes that made up the first UTF-8 encoding were
treated as iso-latin-1 characters, and re-encoded in Unicode... If
you're working with Unicode, you'll probably need to deal with problems
like this from time to time.
The main weakness of UTF-8 is that it's a variable-length encoding.
That means it's hard to pick out the N'th character of a string.
mbstring has a function that lets you do this, but be careful how you
use it. Getting the N'th character of a UTF-8 string is an O(N)
operation, and iterating over the whole string is O(N^2)... Ouch.
Efficient algorithms for UTF-8 tend to work sequentially -- and quite a
few of them can be translated to string algorithms over the bytes.
There's no substitute for understanding how Unicode and UTF-8 and
related representations work -- if you work with it enough, you'll see
all kinds of malformed text and you'll need to be able to deal with it.
More information about the talk
mailing list