Is a unicode user agent legal inside an HTTP header?

The application that I support loads user agents retrieved from weblogs into a column of a MySQL table using the encoding "latin1". Sometimes it cannot load a user agent that looks like this:

Mozilla/5.0 (Iâ?; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML^C like Gecko) Version

I suspect he is choking on Iâ? . I am working to find out if this should be supported, or if corruption is introduced by the upstream journal system. Is this a legitimate user agent in an HTTP header?

+7
source share
3 answers

RFC 2616 (HTTP 1.1) says that the contents of the message header should be "consisting of either *TEXT or a combination of tokens, delimiters, and quotation marks." If you look at the definitions for TEXT, etc., you will find that legal characters are characters with byte values ​​not in the range [0, 31] and not equal to 127; therefore characters such as â as far as I can tell are legal by specification.

+13
source

Technically, octets> 127 are allowed in the comments. RFC 2616 makes them the default ISO-8859-1, but HTTPbis (the upcoming revision of RFC 2616) removed this rule so that sometimes in the distant future we might move on to a reasonable encoding.

Recommendation: divide all octets> 127.

+3
source

HTTP 1.1 RFC2616 refers to ISO-8859-1, which is a single-byte character set based on the Latin alphabet.

Given that HTTP traffic should be one byte, I also use the latin1 character set for my similar logs. The solution was just to make my indexes smaller.

If you use UTF8 with VARCHAR, only characters that are multibyte require extra bytes, so in a table space this is not much more. However, indexes are kept fixed in width, so they are filled with spaces just in case you need them (UTF8 indexes are three times larger than latin1 indexes).

This does not affect me if a random odd title is unreadable. However, if you do not index the column, you can also use UTF8.

+2
source

All Articles