HTML: form does not send UTF-8 input

I visited each of the questions about UTF-8 encoding in HTML, and nothing seems to make it work as expected.

I added the meta tag: nothing has changed.
I added the accept-charset attribute to the form : nothing has changed.


Jsp file

 <%@ page pageEncoding="UTF-8" %> <%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %> <!DOCTYPE html> <html> <head> <meta charset="UTF-8" /> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <title>Editer les sous-titres</title> </head> <body> <form method="post" action="/Subtitlor/edit" accept-charset="UTF-8"> <h3 name="nameOfFile"><c:out value="${ nameOfFile }"/></h3> <input type="hidden" name="nameOfFile" id="nameOfFile" value="${ nameOfFile }"/> <c:if test="${ !saved }"> <input value ="Enregistrer le travail" type="submit" style="position:fixed; top: 10px; right: 10px;" /> </c:if> <a href="/Subtitlor/" style="position:fixed; top: 50px; right: 10px;">Retour à la page d'accueil</a> <c:if test="${ saved }"> <div style="position:fixed; top: 90px; right: 10px;"> <c:out value="Travail enregistré dans la base de donnée"/> </div> </c:if> <table border="1"> <c:if test="${ !saved }"> <thead> <th style="weight:bold">Original Line</th> <th style="weight:bold">Translation</th> <th style="weight:bold">Already translated</th> </thead> </c:if> <c:forEach items="${ subtitles }" var="line" varStatus="status"> <tr> <td style="text-align:right;"><c:out value="${ line }" /></td> <td><input type="text" name="line${ status.index }" id="line${ status.index }" size="35" /></td> <td style="text-align:right"><c:out value="${ lines[status.index].content }"/></td> </tr> </c:forEach> </table> </form> </body> </html> 

Servlet

 for (int i = 0 ; i < 2; i++){ System.out.println(request.getParameter("line"+i)); } 

Exit

 Et ton père et sa soeur Il ne sera jamais parti. 
+7
source share
6 answers

I added the meta tag: nothing has changed.

This really has no effect when a page is sent via HTTP instead of, for example, from the local disk file system (i.e. the URL of the page is http://... instead of, for example, file://... ). HTTP will use the header in the header of the HTTP response. You have already installed it as shown below:

 <%@page pageEncoding="UTF-8"%> 

This will not only output an HTTP response using UTF-8, but also set the charset attribute in the Content-Type response header.

This will be used by the web browser to interpret the response and encode any parameters of the HTML form.


I added the accept-charset attribute to the form : nothing has changed.

It is valid only in Microsoft Internet Explorer. Even then it does it wrong. Never use it. All real web browsers will use the charset attribute specified in the Content-Type header of the response. Even MSIE will do it right if you don't specify the accept-charset attribute. As already mentioned, you already installed it correctly through pageEncoding .


Get rid of the meta tag and accept-charset . They have no beneficial effect, and they will only confuse themselves in the long run and even worsen the situation when the end user uses MSIE. Just stick with pageEncoding . Instead of repeating pageEncoding across all JSP pages, you can also set it globally in web.xml , as shown below:

 <jsp-config> <jsp-property-group> <url-pattern>*.jsp</url-pattern> <page-encoding>UTF-8</page-encoding> </jsp-property-group> </jsp-config> 

As said, this will say that the JSP engine should write an HTTP response using UTF-8 and set it also in the HTTP response header. Webbrowser will use the same encoding to encode the parameters of the HTTP request before sending it back to the server.

The only missing step is to tell the server that it must use UTF-8 to decode the HTTP request parameters before returning to getParameterXxx() calls. How to do this globally depends on the HTTP request method. Given that you use the POST method, this is relatively easy to achieve with the following servlet filter class, which automatically intercepts all requests:

 @WebFilter("/*") public class CharacterEncodingFilter implements Filter { @Override public void init(FilterConfig config) throws ServletException { // NOOP. } @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { request.setCharacterEncoding("UTF-8"); chain.doFilter(request, response); } @Override public void destroy() { // NOOP. } } 

It's all. In Servlet 3.0+ (Tomcat 7 and later) you do not need additional web.xml configuration.

You only need to remember that it is very important that the setCharacterEncoding() method is called before , the POST request parameters were obtained for the first time using any of the getParameterXxx() methods. This is because they are analyzed only once at first access and then cached in server memory.

So, for example, the sequence below is incorrect :

 String foo = request.getParameter("foo"); // Wrong encoding. // ... request.setCharacterEncoding("UTF-8"); // Attempt to set it. String bar = request.getParameter("bar"); // STILL wrong encoding! 

Running the setCharacterEncoding() job in the servlet filter ensures that it setCharacterEncoding() in a timely manner (at least until any servlet).


If you want to instruct the server to decode GET request parameters (not POST) using UTF-8 (these parameters that you see after the ? Character in the URL, you know), then you basically need to configure it on the server. It is not possible to configure it using the servlet API. If you use, for example, Tomcat as a server, then it is a matter of adding the attribute URIEncoding="UTF-8" to the <Connector> element of Tomcat own /conf/server.xml .

If you still see Mojibake on the console output from System.out.println() calls, then it is likely that stdout itself is not configured to use UTF-8. How to do this depends on who is responsible for the interpretation and presentation of the project. If you use, for example, Eclipse as an IDE, then it is a matter of installing Window> Preferences> General> Workspace> Text File Encoding on UTF-8.

See also:

+26
source

Warm up

Let me start with the universal fact that we all know that a computer understands nothing but bits - 0 and 1.

Now, when you submit the HTML form via HTTP, and the values ​​move through the wire to reach the target server, then essentially a lot of bits occur - 0 and 1.

  • Before sending data to the server, the HTTP client (browser or curl, etc.) will encode it using some encoding scheme and expects the server to decode it using the same scheme so that the server knows exactly what the client sent.
  • Before sending a response to the client, the server will encode it using some encoding scheme and expects the client to decode it using the same scheme so that the client knows exactly what the server sent.

Analogue for this, it can be: I am sending you a letter and tell you whether it is written in English or French or Dutch, so that you receive the exact message that I intended to send you. And in answering me, you will also indicate which language I should read.

It is important to remove that the fact that when the data leaves the client, it will be encoded, and the same will be decoded on the server side, and vice versa. If you do not specify anything, the content will be encoded in accordance with application / x-www-form-urlencoded before going from the client side to the server side.

Core concept

Reading a workout is important. There are a few things you need to get the expected results.

  • The correct set of encodings before sending data from the client to the server.
  • The correct decoding and encoding installed on the server side to read the request and reply to the write back to the client ( which is why you did not get the expected results )
  • Make sure that where the same coding scheme is used, it should not happen that on the client you are encoding using ISO-8859-1, and on the server you are decoding using UTF-8, otherwise there will be an error (by my analogy, I write to you in English, and you read in French).
  • The correct encoding set for your log viewer if you are trying to verify log usage using the Windows command line or Eclipse log viewer, etc. (this was the reason for your problem, but this was not the main reason, because first of all, your data read from the request object was incorrectly decoded. Windows cmd or Eclipse encoding the log view also matters, read here )

The correct set of encodings before sending data from the client to the server

To verify this, there are several ways to talk, but I will say that use the HTTP-Accept-Charset request header field . According to your provided code snippet, you are already using and using it correctly so that you are good from this point of view.

There are people who say that they do not use it or are not implemented, but I would very humbly disagree with them. Accept-Charset is part of the HTTP 1.1 specification (I provided the link), and a browser that implements HTTP 1.1 will implement the same. They may also claim to use the attribute attribute of the request-header , but

  • Actually, it’s not there, check the link for the “Accept header request” field that I provided.
  • Mark

I provide you with all the data and facts, not just words, but if you are not satisfied, perform the following tests using different browsers.

  • Set accept-charset="ISO-8859-1" in your HTML form and POST / GET form with Chinese or advanced French characters to the server.
  • On the server, decode the data using the UTF-8 scheme.
  • Now repeat the same tests, exchanging client and server encoding.

You will see that you have never seen the expected characters on the server. But if you use the same coding scheme, you will see the expected character. Thus, browsers implement Accept-Charset and its effect is triggered.

Having the correct decoding and encoding installed on the server side to read the request and write the answer back to the client

There are many ways to talk about what you can do to achieve this (sometimes some configuration may be required based on a specific scenario, but below solves 95% of cases and is well suited to your case). For example:

  • Use the character encoding filter to set the encoding on demand and response.
  • Use setCharacterEncoding on request and response
  • Configure the web server or application server to correctly encode characters with -Dfile.encoding=utf8 etc. More info here
  • Etc.

My favorite one will solve your problem - "Character Encoding Filter" due to the following reasons:

  • All coding logic of data processing is in one place.
  • You have all the power through configuration, change in one place, and all if they are happy.
  • You don't have to worry about any other code reading the request stream or flushing the response stream before I can set the character encoding.

1. Character encoding filter

You can do the following to implement your own character encoding filter. If you use some frameworks, such as Springs, etc., you do not need to write your own class, but just configure it in web.xml

The basic logic below is very similar to what Spring does, besides a lot of dependency, a bean-aware thing that they do.

web.xml (configuration)

 <filter> <filter-name>EncodingFilter</filter-name> <filter-class> com.sks.hagrawal.EncodingFilter </filter-class> <init-param> <param-name>encoding</param-name> <param-value>UTF-8</param-value> </init-param> <init-param> <param-name>forceEncoding</param-name> <param-value>true</param-value> </init-param> </filter> <filter-mapping> <filter-name>EncodingFilter</filter-name> <url-pattern>/*</url-pattern> </filter-mapping> 

EncodingFilter (character encoding implementation class)

 public class EncodingFilter implements Filter { private String encoding = "UTF-8"; private boolean forceEncoding = false; public void doFilter(ServletRequest request, ServletResponse response, FilterChain filterChain) throws IOException, ServletException { request.setCharacterEncoding(encoding); if(forceEncoding){ //If force encoding is set then it means that set response stream encoding as well ... response.setCharacterEncoding(encoding); } filterChain.doFilter(request, response); } public void init(FilterConfig filterConfig) throws ServletException { String encodingParam = filterConfig.getInitParameter("encoding"); String forceEncoding = filterConfig.getInitParameter("forceEncoding"); if (encodingParam != null) { encoding = encodingParam; } if (forceEncoding != null) { this.forceEncoding = Boolean.valueOf(forceEncoding); } } @Override public void destroy() { // TODO Auto-generated method stub } } 

2. ServletRequest.setCharacterEncoding ()

This is essentially the same code as in the character encoding filter, but instead of doing it in the filter, you do it in your servlet or controller class.

The idea again uses request.setCharacterEncoding("UTF-8"); to set the encoding of the HTTP request stream before reading the http request stream.

Try entering the code, and you will see that if you do not use any filter to set the encoding in the request object, then the first log will be NULL, and the second log will be "UTF-8".

 System.out.println("CharacterEncoding = " + request.getCharacterEncoding()); request.setCharacterEncoding("UTF-8"); System.out.println("CharacterEncoding = " + request.getCharacterEncoding()); 

The following is an important excerpt from setCharacterEncoding Java docs . One more thing to note: you must provide a valid encoding scheme, otherwise you will get an UnsupportedEncodingException

Overrides the name of the character encoding used in the body of this request. This method must be called before requesting to read parameters or enter data using getReader () . Otherwise, it has no effect.

Wherever necessary, I tried to provide you with official links or accepted StackOverflow answers so you can build trust.

+5
source

Based on your published output, it seems that the parameter is sent as UTF8, and then the Unicode bytes of the string are interpreted as ISO-8859-1.

The following snippet demonstrates your observed behavior.

 String eGrave = "\u00E8"; // the letter è System.out.printf("letter UTF8 : %s%n", eGrave); byte[] bytes = eGrave.getBytes(StandardCharsets.UTF_8); System.out.printf("UTF-8 hex : %X %X%n", bytes[0], bytes[1], bytes[0], bytes[1] ); System.out.printf("letter ISO-8859-1: %s%n", new String(bytes, StandardCharsets.ISO_8859_1) ); 

Exit

 letter UTF8 : è UTF-8 hex : C3 A8 letter ISO-8859-1: è 

For me, the form sends the correct UTF8 encoded data, but later this data is not processed as UTF8.

edit Some other points to try:

display the character encoding of your request

 System.out.println(request.getCharacterEncoding()) 

force UTF-8 to retrieve a parameter (untested, just an idea)

 request.setCharacterEncoding("UTF-8"); ... request.getParameter(...); 
+2
source

You can use ISO-related strings in your encodings and definitions in the code section in the JSP code.

Like charset = "ISO-8859-1" and pageEncoding = "ISO-8859-1".

0
source

There is a bug in tomcat that can catch you. The first filter determines the encoding on which the request is based.

Every other filter or servlet behind the first filter can no longer change the encoding of the request.

I do not think this error will be fixed in the future, because current applications can rely on encoding.

0
source

You can try writing this in .jsp:

{<% @page language = "java" contentType = "text / html; charset = ISO-8859-1" pageEncoding = "UTF-8"%>}

The problem is solved for me with this.

0
source

All Articles