[nycphp-talk] enforcing Latin-1 input
Mikko Rantalainen
mikko.rantalainen at peda.net
Wed Nov 23 10:39:07 EST 2005
Allen Shaw wrote:
> Mikko Rantalainen wrote:
>
>>The problem is that you cannot accurately identify different 8 bit
>>encodings from each other. Latin-1 (iso-8859-1) and Latin-9
>>(iso-8859-15) text may contain identical byte sequences and still
>>different content so you have no way to know which one user intended
>>to use.
>
>>Some 8 bit encodings have different *probabilities* for different
>>byte sequences and you could make an educated guess which encoding
>>the user agent really used. That would still be just a guess.
>>
>>The way I do it is that I send the html with UTF-8 encoding (I also
>>have <form accept-charset="UTF-8" ...> in case some user agent
>>supports that, most user agents just use the same encoding the page
>>with the form used) and I check that the user input is valid UTF-8
>>byte sequence. [snip...]
>
> I'm very curious how you test this.
I'm using the following function:
function isValidUTF8String($Str)
{
# correct UTF-8 stream has every character starting with zero bit
# or first byte has <length of encoding> high bits set and all
# following bytes have highest bits set to 10.
for ($i=0; $i<strlen($Str); $i++)
{
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # invalid byte
# verify that n bytes matching bit sequence 10bbbbbb
# follow where bbbbbb is not 000000
# failing this test means that input is "overlong UTF-8
# encoding", which is not allowed.
for ($j=0; $j<$n; $j++)
if ((++$i == strlen($Str))
|| ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
# couldn't find errors, it's probably valid UTF-8 data.
return true;
}
I put all user input through that function and if the input isn't a
valid UTF-8 string, then all user input get's through the
Latin1->UTF-8 conversion (utf8_encode() helps here).
You could do additional checking if the input is something else but
Latin1 in case UTF-8 test fails, but I think it's not worth the effort.
> Also, I'm continuing to read more on all of this (and cripes, there's a
> lot to read...), but just so I don't lose momentum here, I want to ask
> what you think of this half-baked idea:
>
> A form on a document with iso-8859-1 encoding will apparently (according
> to a few quick tests) encode its user input into Latin-1 also. If I put
> something else in there, say that Japanese string I gave you, it gets
> encoded into
> "大阪市浪速区のマ
> ンション"
You cannot trust that behavior. Specification only says (IIRC) that
the user agent MUST not send characters outside iso-8859-1 on such a
form. MSIE is known to automagically convert from its internal
character mapping to this SGML entity presentation but there's one
major problem with it -- it doesn't differentiate in any way if user
inputted such data verbatim or if the result was due automatic
conversion by user agent. The only idea is that numeric entities
always use only US-ASCII and that should be always safe.
The problem that you cannot differentiate between user input and the
automatic conversion is just one reason why this conversion is not a
good idea. Another one is that the user agent has no idea if the
inputted content is going to be printed through HTML. If it's going
to database and the get's printed to a ticket, for example, the user
will be really surprised when he sees such code inside his surname,
for example.
User agents that behave according to the spec are expected to send
literal "?" for every character that cannot be send with the current
encoding or prompt the user to decide what to do. Or to disallow
input of any character at all that cannot be transferred.
The only reasonable safe way to get the input to the server the way
the user intended to is to use UTF-8 encoded form. If somebody knows
still better way, I'd be interested to hear about it, too.
As a side note, I'd like to add that in some sources it has been
suggested that one should embed a hidden field that contains a known
payload and the server then examines how that payload has been
encoded. For example:
<input type="hidden" name="test" value="xンxäx浪x" />
Note that the user agent is *supposed* to convert the above
numerical references to real characters and then submit those
characters with the encoding it *really* uses. Then the server
proceeds to check if value of test is "xンxäx浪x". If not, proceed
to test if the value matches with with any other (incorrect)
encoding of that string you know how to fix.
In the real world, user agents that don't support UTF-8 don't
usually know how to even incorrectly represent characters outside
iso-8859-1 either.
I guess that what I'm trying to tell you is that to *force*
iso-8859-1 input only, you're going to have to use UTF-8 for the
form and you'ge going to have to use UTF-8 internally. That's the
only way you can really get in iso-8859-1 encoding the same data the
user really tried to input.
--
Mikko
More information about the talk
mailing list