How do you know what encoding the user enters into the browser?

I read Joel's article on character sets , and so I take his advice on using UTF-8 on my web page and in my database, I cannot figure out what to do with user input. Joel says: "It doesn't make sense to have a string without knowing what encoding it uses." But how do I know which encoding the user input string uses? if I have

<input type="text" name="atextfield" >

on my page, how do I know what encoding I get from the user? What if the user places a special ASCII character, such as ♣ or ™, or something else? Is there a way to detect that user input gave me something unrecognized in UTF-8? Is there any standard for how to handle such things?

0
source share
3 answers

Check the HTTP headers for character encoding.

+1
source

If your webpage is using UTF-8, the browser will convert to UTF-8 for you. Thus, even special characters in ASCII will be represented as UTF-8.

However, you never know the itchy hand from the user who will return the page encoding to ISO-8859 - *.

You can use mb_detect_encoding , but not 100% bulletproof.

/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);
+1
source

, , , UTF-8 . , - UTF-8 UTF-8, UTF-8, ( ) UTF-8, , . html-, , :

<form action="/index.php" method="post" accept-charset="UTF-8"></form>

If the detection of user input encoding is not the whole goal of your application, this must be done with an error. Suppose the encoding is incorrect and convert it to UTF-8 in your application. Just as you should assume that your user input is malicious and will clear it before trying to insert it into your database.

In most languages ​​that use UTF-8 correctly, ASCII characters will survive the conversion, so don't worry about that.

+1
source

All Articles