Unicode in usernames (and passwords)?

After considering this issue, I realized that there are still a few questions regarding the topic.

Are there any characters that should be “left” for legitimate security purposes? This includes all characters, such as brackets, commas, apostrophes, and parentheses.

While on this issue I admittedly do not understand why admins seem to enjoy observing the rule "you can only use the alphabet, numbers and spaces". Is there anything else that could be a security flaw or break something that I don’t know about (even in ASCII)? As far as I saw during my coding days, there is absolutely no reason to believe that any character is forbidden to be in the username.

+7
source share
6 answers

Often, these characters can be used to inject malicious code into your program. For example, SQL injection (quotation marks, dashes, etc.), XSS / CSRF (quotes, fish bindings, etc.) or even programming language injections when eval() used elsewhere in your code.

These characters usually do not harm when you, as a developer, properly sanitize user-driven I / O, that is, everything that is included in the HTTP request; headers, parameters and body. For example. parameterized queries or using mysql_real_escape_string() when embedding them in an SQL query to prevent SQL injection and htmlspecialchars() when embedding them in HTML to prevent XSS. But I can imagine that administrators do not trust all developers, so they add these restrictions.

See also:

+2
source

There is no reason to protect some characters. If you process all the input correctly, it doesn't matter if you only process alphanumeric characters or Chinese.

It is easier to handle alphnum usernames only. You do not need to think about the ambiguity with the mappings in your database, encoding usernames in URLs and the like. But then again, if you handle this correctly, there are no technical reasons.

For practical reasons, passwords are often only alphanumeric. Most password inputs, for example, do not accept IME, so it is almost impossible to have a Japanese password. However, there is no reason to refuse the prohibition of characters other than the alphabet. On the contrary, the larger the alphabet used, the better.

+4
source

If your application handles Unicode input properly everywhere, I would of course allow non-ASCII characters in usernames and passwords with a few caveats:

  • If you use basic HTTP authentication, you cannot correctly support non-ASCII characters in user names and passwords, because the process of transferring this data includes the encode-to-bytes-in-base64 step, which currently browsers do not agree:

    • Safari uses ISO-8859-1 and breaks if characters other than 8859-1 are present;
    • Mozilla uses the low byte of each character encoded in UTF-16 code units (the same as ISO-8859-1 for these characters);
    • Opera and Chrome use UTF-8
    • IE uses the ANSI code page on the system on which it is installed, which can be anything but not compliant with ISO-8859-1 or UTF-8. Characters that do not match the encoding are arbitrarily distorted.
  • If you use cookies, you must make sure that any Unicode characters are encoded in some way (e.g. URL encoding), since retrying to send non-ASCII characters gives completely different results in different browsers.

"you can only use the alphabet, numbers and spaces"

Do you get spaces? Luxury!

+4
source

I don’t think there is a reason not to allow unicode in the username. Passwords are different stories, because you usually don’t see a password when entering it into a form, allowing only ASCII to make sense to prevent possible confusion.

I think it makes sense to use the email address as the login credentials rather than requiring the creation of a new username. Then the user can select any nickname using any Unicode characters and show this nickname next to user posts and comments.

Isn't that the way it is on Facebook?

+2
source

I think that most of the time when things (usernames or passwords) are compressed to ASCII, this is because someone is afraid that more complex character sets will cause a breakdown in some unknown component. Whether this fear is justified or not depends on the case, but trying to make sure that your entire stack does use Unicode correctly in all cases can be difficult. It gets better every day, but you can still find Unicode issues in some places.

I personally keep my ASCII names and passwords, and I even try not to use too much punctuation. One reason is that some input devices (such as some mobile phones) make it difficult to access some of the more esoteric characters. Another reason is that I often came across a system in which there were no restrictions on the contents of passwords, but then screwed up if you really used something other than a letter or number.

+1
source

There is a risk if some parts of your program assume that strings with different bytes are different, but other parts of the program will compare strings according to unicode semantics and assume that they are the same.

For example, file systems on Mac OS X provide a uniform representation of Unicode characters, so two different file names Ą ("A with ogonek") and A + ̨ (Latin A and then "ogonek concatenation") refer to the same file.

Similarly, you can create invalid UTF-8 byte sequences where 1-byte code points are encoded using multiple bytes (called overlapping sequences). If you normalize or reject UTF-8 input before processing it, it will be safe, but, for example, if you use a Unicode-uninformed programming language and a Unicode-compatible database, the two will see different inputs.

To avoid this:

  • You must filter the input of UTF-8 as early as possible. Reject invalid / overlapping sequences.

  • When comparing Unicode bites, always convert both sides of the comparison into the same Unicode normal form. For usernames, you might want NFKD to reduce the number of attacks with a homograph.

+1
source

All Articles