NYCPHP Meetup

Tue Nov 29 16:40:13 EST 2005

Dear Mikko,

Thanks very much for your help figuring out this issue.  Clearly I'm new 
in this area.

Also, I guess it's obvious that I'm looking for solutions for a specific 
problem in a specific project.  It seems proper that I should ask you 
before using your code.  This particular project is an in-house 
application for my non-profit employer; I'm also considering using it 
within the Polymer Project (http://polymerdb.org), which is GPL'd, and 
on which our in-house project is based.  May I use a modified version of 
the code below in that GPL'd project?

Best Regards,
Allen Shaw

Mikko Rantalainen wrote:
> I'm using the following function:
> 
> function isValidUTF8String($Str)
> {
> # correct UTF-8 stream has every character starting with zero bit
> # or first byte has <length of encoding> high bits set and all
> # following bytes have highest bits set to 10.
> for ($i=0; $i<strlen($Str); $i++)
> {
> 	if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
> 	else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
> 	else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
> 	else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
> 	else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
> 	else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
> 	else return false; # invalid byte
> 	# verify that n bytes matching bit sequence 10bbbbbb
> 	# follow where bbbbbb is not 000000
> 	# failing this test means that input is "overlong UTF-8
> 	# encoding", which is not allowed.
> 	for ($j=0; $j<$n; $j++)
> 		if ((++$i == strlen($Str))
> 		    || ((ord($Str[$i]) & 0xC0) != 0x80))
> 			return false;
> }
> # couldn't find errors, it's probably valid UTF-8 data.
> return true;
> }

-- 
Allen Shaw
Polymer (http://polymerdb.org)

NYCPHP Meetup

NYPHP.org

[nycphp-talk] (off-list) Re: enforcing Latin-1 input