[nycphp-talk] enforcing Latin-1 input (follow-up)
Allen Shaw
ashaw at polymerdb.org
Tue Nov 29 16:16:34 EST 2005
Hi List,
With much thanks to Mikko for his help, I've finally figured this out
enough come up with a way to test for valid Latin-1 input. If I am
right about this, then the following modification to Mikko's code will
report whether or not a particular string, assumed to be UTF-8 encoded,
is within the Latin-1 character set:
-----------8<-----------
function isValidLatin1String($Str) {
$latinHex = array ('20', //
'21', // !
'22', // "
// snip for brevity...
'c3bf' // ÿ
);
# While checking for valid UTF-8 stream, compile each character
# as hex codes and match with latinHex array;
# correct UTF-8 stream has every character starting with zero bit
# or first byte has <length of encoding> high bits set and all
# following bytes have highest bits set to 10.
for ($i=0; $i<strlen($Str); $i++)
{
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # invalid byte
# verify that n bytes matching bit sequence 10bbbbbb
# follow where bbbbbb is not 000000
# failing this test means that input is "overlong UTF-8
# encoding", which is not allowed.
$char = bin2hex($Str[$i]);
for ($j=0; $j<$n; $j++) {
$chara .= bin2hex($Str[++$i]);
if (($i == strlen($Str))
|| ((ord($Str[$i]) & 0xC0) != 0x80)) {
return false;
}
}
if (!in_array($char, $latinHex)) {
return false;
}
}
# couldn't find errors, it's probably valid Latin-1 data.
return true;
}
-----------8<-----------
Initial testing seems to confirm confirm what I think I've figured out
already. Thanks to Mikko for lots of clues and advice from his own
experience.
Mikko Rantalainen wrote:
> But the problem is that unless you're using UTF-8, you cannot always
> identify between iso-8859-1 and say windows-1255. The safest way I
> can think about is to require UTF-8 encoding and then check that the
> real data I'm getting only uses characters that can be represented
> with iso-8859-1.
--
Allen Shaw
Polymer (http://polymerdb.org)
More information about the talk
mailing list