Handling character encoding in Tomcat URIs

On the website I'm trying to help with, the user can enter the URL in the browser, for example, the following Chinese characters,

http://localhost:8080?a=ζ΅‹θ―• 

On the server we get

  GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1 

As you can see, it is encoded in UTF-8, then the URL is encoded. We can handle this correctly by setting the encoding to UTF-8 in Tomcat.

However, sometimes we get Latin1 encoding in some browsers,

  http://localhost:8080?a=ß 

turns into

  GET /a=%DF HTTP/1.1 

In any case, is this correct to handle in Tomcat? It looks like the server should do some reasonable guesswork. We do not expect the correct processing of the Latin language 100%, but something is better than what we are doing now, assuming that all of this is UTF-8.

Tomcat 5.5 server. Supported browsers are IE 6+, Firefox 2+, and Safari on the iPhone.

+11
java encoding tomcat internationalization servlets
Aug 05 '09 at 12:55
source share
1 answer

Unfortunately, the UTF-8 encoding is β€œmandatory” in the specification. The Wikipedia entry has a good table of valid and invalid bytes).

Less reliable would be viewing the "Accept-Charset" header in the request. I do not think this header is necessary (did not check the HTTP specification for validation), and I know that Firefox will at least send a whole list of valid values. Selecting the first value in the list may work, or it may not be so.

Finally, have you done any log analysis to make sure that a particular user agent will use this encoding consistently?

+5
Aug 05 '09 at 13:49
source share



All Articles