Facebook charset detection engine?

Today I looked through the facebook.com HTML code and found something like this:

<input type="hidden" value="€,´,€,´,水,,Є" name="charset_test"/> 

It is repeated two times inside the <form>...</form> .

Any idea that this code might be useful for any client-side client encoding? As far as I know, the browser encoding is in any case transmitted in an HTTP request ("Accept-Charset" header).

+6
html php facebook forms character-encoding
source share
4 answers

Any idea why this code might be useful - some kind of server-side client encoding definition?

Apparently so.

The Euro sign is useful for detecting a character set because there are so many ways to encode it:

  • E2 82 AC to UTF-8
  • 88 in windows-1251
  • 80 in other encodings windows-125x
  • A4 to ISO-8859-7, -15 and -16
  • A2 E3 to GB18030
  • 85 40 in Shift-JIS
  • and etc.

As far as I know, the encoding of the browser is in any case transmitted in the HTTP request ("Accept-Charset" header).

It is assumed that it is passed in the HTTP Content-Type header, but this does not mean that user agents actually understand this correctly.

+4
source share

I think they match this in the receiving script, to make sure the client sent the request correctly encoded as UTF-8, and perhaps even because it knows what characters to expect in order to determine the actual encoding on the fly.

If I remember correctly - I had to deal with it once - in some cases there were problems with the encoding of the form in IE6.

+3
source share
 &euro;,&acute;,€,´,水,,Є 

I assume some browser sends &euro; same as and &acute; same as ´ ,

Therefore, they can check how charset_test [0] == charset_test [2] and charset_test [1] == charset_test [3]

For other other characters, I have no idea. 水 probably check out CJK.

0
source share

According to Pekka, this means that you can detect the encoding of the request. HTTP does not provide a way to specify the encoding of the request. Because of this, you have to rely on agreements outside the protocol. Browsers are usually predictable, but this trick is the only way to be 100% sure.

See also: http://www.phpwact.org/php/i18n/charsets

0
source share

All Articles