I have a form filed in non-UTF-8 (its actually in Windows-1251). People, of course, place any characters they like there. The browser helps to convert non-representable characters in Windows-1251 to html objects so that I can recognize them. For example, if the user types →, I get →. This is partly wonderful, for example, if I just repeat it, the browser will correctly display → no matter what.
The problem is that I actually do htmlspecialchars () in the text before displaying it (its PHP function for converting special characters to HTML objects, for example, becomes &). My users sometimes enter things like —or ©, and I want to display them as relevant —or ©, not - and ©.
I am unable to distinguish → from →, because I get them like →. And, since I have htmlspecialchars () text, and I also get →for → from the browser, I return back →, which is displayed as →in the browser. Thus, user input is corrupted.
Is there a way to say, “Okay, I serve this form on Windows-1251, but could you just send me UTF-8 login and let me handle it myself?”
Oh, I know that it’s a good idea to switch all the software to UTF-8, but this is too much work, and I would be happy to fix it quickly. If that matters, the enctype forms are "multipart / form-data" (including the file loader, so no other enctype can be used). I am using Apache and PHP.
Thanks!