[nycphp-talk] (off-list) Re: enforcing Latin-1 input
Allen Shaw
ashaw at polymerdb.org
Tue Nov 29 16:40:13 EST 2005
Dear Mikko,
Thanks very much for your help figuring out this issue. Clearly I'm new
in this area.
Also, I guess it's obvious that I'm looking for solutions for a specific
problem in a specific project. It seems proper that I should ask you
before using your code. This particular project is an in-house
application for my non-profit employer; I'm also considering using it
within the Polymer Project (http://polymerdb.org), which is GPL'd, and
on which our in-house project is based. May I use a modified version of
the code below in that GPL'd project?
Best Regards,
Allen Shaw
Mikko Rantalainen wrote:
> I'm using the following function:
>
> function isValidUTF8String($Str)
> {
> # correct UTF-8 stream has every character starting with zero bit
> # or first byte has <length of encoding> high bits set and all
> # following bytes have highest bits set to 10.
> for ($i=0; $i<strlen($Str); $i++)
> {
> if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
> else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
> else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
> else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
> else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
> else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
> else return false; # invalid byte
> # verify that n bytes matching bit sequence 10bbbbbb
> # follow where bbbbbb is not 000000
> # failing this test means that input is "overlong UTF-8
> # encoding", which is not allowed.
> for ($j=0; $j<$n; $j++)
> if ((++$i == strlen($Str))
> || ((ord($Str[$i]) & 0xC0) != 0x80))
> return false;
> }
> # couldn't find errors, it's probably valid UTF-8 data.
> return true;
> }
--
Allen Shaw
Polymer (http://polymerdb.org)
More information about the talk
mailing list