[nycphp-talk] enforcing Latin-1 input
Allen Shaw
ashaw at polymerdb.org
Tue Nov 22 13:57:25 EST 2005
Mikko Rantalainen wrote:
> The problem is that you cannot accurately identify different 8 bit
> encodings from each other. Latin-1 (iso-8859-1) and Latin-9
> (iso-8859-15) text may contain identical byte sequences and still
> different content so you have no way to know which one user intended
> to use.
> Some 8 bit encodings have different *probabilities* for different
> byte sequences and you could make an educated guess which encoding
> the user agent really used. That would still be just a guess.
>
> The way I do it is that I send the html with UTF-8 encoding (I also
> have <form accept-charset="UTF-8" ...> in case some user agent
> supports that, most user agents just use the same encoding the page
> with the form used) and I check that the user input is valid UTF-8
> byte sequence. [snip...]
I'm very curious how you test this.
Also, I'm continuing to read more on all of this (and cripes, there's a
lot to read...), but just so I don't lose momentum here, I want to ask
what you think of this half-baked idea:
A form on a document with iso-8859-1 encoding will apparently (according
to a few quick tests) encode its user input into Latin-1 also. If I put
something else in there, say that Japanese string I gave you, it gets
encoded into
"大阪市浪速区のマ
ンション"
So, if I can find user input matching a regex pattern like '&#\d+;', I
know the user is either intentionally typing HTML numeric entities into
my form, or trying to enter some non-Latin-1 characters. Of course,
what to do with that information is an app-level question, but the
question here is: is this really a valid test, or will I be getting
false positives and false negatives?
Any thoughts?
--
Allen Shaw
Polymer (http://polymerdb.org)
More information about the talk
mailing list