Besides regular expressions and terminal configuration, this has nothing to do with Unicode. A short answer to your question: nginx is not interested in Unicode encodings, but it accepts non-ASCII bytes in URLs.
Here is a long answer that explains what you see. If you enter the command
curl http:
and your terminal uses UTF-8 as an encoding, it will encode the character 与 (U + 4E0E) in a three-byte sequence of UTF-8
0xE4 0xB8 0x8E
curl seems to accept non-ASCII bytes in URLs, although they are technically illegal. Then it will send an HTTP request with these bytes without ASCII. Since there is no default way to display these bytes, I will use C-style dedicated square hexadecimal screens such as \ x00 to represent them. So the query string sent by curl looks like this:
GET / \ xE4 \ xB8 \ x8E HTTP / 1.1
These are three bytes after the first / . If the terminal on which you view your logs also supports UTF-8, this will be displayed on your screen as
GET / 与 HTTP / 1.1
But this does not mean that your HTTP request has Unicode characters. At the HTTP level, we are dealing only with bytes.
nginx also seems to happily accept non-ASCII bytes in URLs. Then the following regular expression
(*UTF8)([^\w/\.\-\\% ])
running in UTF-8 mode treats the sequence of bytes \ xE4 \ xB8 \ x8E as the character 与 that matches \w , so the header will be
answer: \ xE4 \ xB8 \ x8E
which your terminal displays as
answer: 与
Regex
([^\w/\.\-\\% ])
works directly with bytes, so it will only match one byte of your path or nothing at all. For some reason, he believes that the first byte of the sequence \ xE4 \ xB8 \ x8E matches \w (possibly because it assumes Latin1 or Windows-1252 strings), so the title would be:
answer: \ xE4
which your terminal decides to display as
answer:?
because byte \ xE4 followed by a new line is not valid UTF-8. The regular expression ([^\w/\.\-\\% ])+ matches the entire sequence of bytes, so it gives the same result as the regular expression UTF-8.
If you see something like
GET /\xE4\xB8\x8E HTTP/1.1
in your logs because the authors of the registration code decided to use an escape sequence for bytes without ASCII. In general, this is a good idea, because it always produces the same output regardless of the terminal configuration and really shows what is happening: your HTTP request simply contains bytes without ASCII.