Does Nginx support raw unicode in paths?

Question

Does Nginx support raw unicode in paths?

Url browsers encode Unicode characters in% ## by default.

However, I can make a request through CURL to http://localhost:8080/与 , and nginx sees the path as " 与 ". How is this possible? Does nginx allow arbitrary unicode in this way?

For example, with this configuration, I can set an additional header to see what nginx saw:

 location ~* "(*UTF8)([^\w/\.\-\\% ])" { add_header "response" $1; return 200; }

Request:

 * Connected to localhost (127.0.0.1) port 8080 (#0) > GET /与 HTTP/1.1 > User-Agent: curl/7.30.0 > Host: localhost:8080 > Accept: */* > < HTTP/1.1 200 OK * Server nginx/1.4.6 (Ubuntu) is not blacklisted < Server: nginx/1.4.6 (Ubuntu) < Date: Tue, 20 Jan 2015 21:44:51 GMT < Content-Type: application/octet-stream < Content-Length: 0 < Connection: keep-alive < response: 与 <--- SEE THIS? < * Connection #0 to host localhost left intact

However, when I remove the UTF8 token, the header contains "?" as if nginx could not understand the character (or just reading the first byte).

 location ~* "([^\w/\.\-\\% ])" { add_header "response" $1; return 200; }

Request:

 * Connected to localhost (127.0.0.1) port 8080 (#0) > GET /与 HTTP/1.1 > User-Agent: curl/7.30.0 > Host: localhost:8080 > Accept: */* > < HTTP/1.1 200 OK * Server nginx/1.4.6 (Ubuntu) is not blacklisted < Server: nginx/1.4.6 (Ubuntu) < Date: Tue, 20 Jan 2015 21:45:35 GMT < Content-Type: application/octet-stream < Content-Length: 0 < Connection: keep-alive < response: ? < * Connection #0 to host localhost left intact

Note. Changing this non-utf-8-regex to capture one or more ([^...]+) also sends a response: 与 header response: 与 (bytes and multibyte strings?)

Writing in accordance with the regular expression with the file leads to the recording of the request, for example:

 GET /\xE4\xB8\x8E HTTP/1.1

+7

url unicode nginx

Xeoncross Jan 20 '15 at 21:58

source share

2 answers

Doesn't your own testing already answer your question?

Yes, nginx supports Unicode in ways.

As a discussion point, nginx will normalize the URLs before matching the location, as indicated in the documentation at http://nginx.org/r/location . This is why various "strange" requests (for example, those containing ../ , or those encoding ? Like %3F , which makes it part of the file name, and does not mean parameters known as $args ), can still be served by one place which does not look like coincidence with the naked eye.

This normalization can also explain why the “same” line is displayed differently between access_log (previously normalized) and error_log (normalized).

+3

cnst Jan 26 '15 at 2:40

source share

nwellnhof · Accepted Answer · 2015-01-23T17:37:50+0000

Besides regular expressions and terminal configuration, this has nothing to do with Unicode. A short answer to your question: nginx is not interested in Unicode encodings, but it accepts non-ASCII bytes in URLs.

Here is a long answer that explains what you see. If you enter the command

 curl http://localhost:8080/与

and your terminal uses UTF-8 as an encoding, it will encode the character 与 (U + 4E0E) in a three-byte sequence of UTF-8

 0xE4 0xB8 0x8E

curl seems to accept non-ASCII bytes in URLs, although they are technically illegal. Then it will send an HTTP request with these bytes without ASCII. Since there is no default way to display these bytes, I will use C-style dedicated square hexadecimal screens such as \ x00 to represent them. So the query string sent by curl looks like this:

GET / \ xE4 \ xB8 \ x8E HTTP / 1.1

These are three bytes after the first / . If the terminal on which you view your logs also supports UTF-8, this will be displayed on your screen as

GET / 与 HTTP / 1.1

But this does not mean that your HTTP request has Unicode characters. At the HTTP level, we are dealing only with bytes.

nginx also seems to happily accept non-ASCII bytes in URLs. Then the following regular expression

 (*UTF8)([^\w/\.\-\\% ])

running in UTF-8 mode treats the sequence of bytes \ xE4 \ xB8 \ x8E as the character 与 that matches \w , so the header will be

answer: \ xE4 \ xB8 \ x8E

which your terminal displays as

answer: 与

Regex

 ([^\w/\.\-\\% ])

works directly with bytes, so it will only match one byte of your path or nothing at all. For some reason, he believes that the first byte of the sequence \ xE4 \ xB8 \ x8E matches \w (possibly because it assumes Latin1 or Windows-1252 strings), so the title would be:

answer: \ xE4

which your terminal decides to display as

answer:?

because byte \ xE4 followed by a new line is not valid UTF-8. The regular expression ([^\w/\.\-\\% ])+ matches the entire sequence of bytes, so it gives the same result as the regular expression UTF-8.

If you see something like

 GET /\xE4\xB8\x8E HTTP/1.1

in your logs because the authors of the registration code decided to use an escape sequence for bytes without ASCII. In general, this is a good idea, because it always produces the same output regardless of the terminal configuration and really shows what is happening: your HTTP request simply contains bytes without ASCII.

Does Nginx support raw unicode in paths?

More articles: