How is character encoding specified in the multipart / form-data HTTP POST request?

Question

How is character encoding specified in the multipart / form-data HTTP POST request?

The HTML 5 specification describes an algorithm for selecting a character encoding to be used in a multi-part representation (e.g. UTF-8), however, it is unclear how the selected encoding should be relayed to the server so that the content can be correctly decoded on the receiving side.

Often character encodings are represented by adding the "charset" parameter to the value of the Content-Type request header. However, this parameter is not defined for the multipart/form-data MIME type:

https://tools.ietf.org/html/rfc7578#section-8

Each part in a multi-page presentation form may contain its own Content-Type header; however, RFC 7578 notes that "in practice, many widely deployed implementations do not provide a charset parameter in every part, but rather rely on the concept of" default encodings "for an instance of multipart / form-data."

RFC 7578 continues to suggest that a hidden field of the form "_charset_" can be used for this purpose. However, neither Safari (9.1) nor Chrome (51) seem to fill this field and provide any information for each part.

I looked at the request headers created by both browsers and I don't see any obvious character encoding information. Does anyone know how the browser passes this information to the server?

+5

html post multipartform-data utf-8

Greg brown Jun 23 '16 at 18:07

source share

1 answer

Remy lebeau · Accepted Answer · 2016-06-24T20:35:35+0000

HTML 5 uses RFC 2388 (deprecated by RFC 7578), however HTML 5 explicitly removes the Content-Type header from non-file fields, while RFC does not:

Parts of the generated multipart / form-data resource that correspond to non-file fields should not contain the Content-Type header. Their names and values must be encoded using the character encoding selected above (field names, in particular, cannot be converted to 7-bit secure encoding, as proposed in RFC 2388).

RFCs are designed to use multipart/form-data in other contexts other than just HTML (although this is its most common use). In other contexts, Content-Type allowed. Just not in HTML 5 (but allowed in HTML 4).

Without a Content-Type header, the hidden form field _charset_ , if present, is the only way the HTML 5 <form> sender can explicitly indicate which encoding is used.

According to the specification of the HTML 5 algorithm, the selected encoding MUST be selected from the <form> element accept-charset attribute, if present, otherwise it is the encoding used by HTML itself if it is ASCII compatible, otherwise UTF-8. This is explicitly stated in the specification of the algorithm, as well as in RFC 7578 Section 5.1.2 when accessing HTML 5.

Thus, there is no need for the encoding to be explicitly indicated by the web browser, since the receiver of the presentation form needs to know what encoding to expect by virtue of how the <form> was created, and thus can verify this encoding (s) when parsing the view. If the receiver wants to know a specific encoding, it must include the _charset_ hidden field in the <form> .

How is character encoding specified in the multipart / form-data HTTP POST request?

More articles: