I am launching a forum designed to support an international math group. I recently switched it to Unicode for better support for international characters. In debugging this conversion, I found that not all Unicode characters are considered valid XHTML (the corresponding website looks like http://www.w3.org/TR/unicode-xml/ ), One of the steps that forum software goes through before submitting messages in the browser, is an XHTML validation / sanitation step. It seems like a reasonable idea that at this point it should remove any Unicode characters that XHTML doesn't like.
So my question is:
Is there a standard (or better) way to do this in PHP?
(By the way, the forum is written in PHP.)
I think fault tolerant would be simple str_replace (if this is also the best, do I need to do something else to make sure it works correctly with unicode?), But this will require me to go through the XHTML DTD (or the above W3 page) to figure out which characters to list in the str_replace search part, so if this is the best way, someone already did this so that I can steal, wrongly, is this?
(By the way, the character that caused the problem was U + 000C, "formfeed", which (according to page W3) is valid HTML but invalid XHTML!)
php xhtml unicode
Loop space
source share