What is the best way to remove Unicode characters that XHTML considers invalid with php?

I am launching a forum designed to support an international math group. I recently switched it to Unicode for better support for international characters. In debugging this conversion, I found that not all Unicode characters are considered valid XHTML (the corresponding website looks like http://www.w3.org/TR/unicode-xml/ ), One of the steps that forum software goes through before submitting messages in the browser, is an XHTML validation / sanitation step. It seems like a reasonable idea that at this point it should remove any Unicode characters that XHTML doesn't like.

So my question is:

Is there a standard (or better) way to do this in PHP?

(By the way, the forum is written in PHP.)

I think fault tolerant would be simple str_replace (if this is also the best, do I need to do something else to make sure it works correctly with unicode?), But this will require me to go through the XHTML DTD (or the above W3 page) to figure out which characters to list in the str_replace search part, so if this is the best way, someone already did this so that I can steal, wrongly, is this?

(By the way, the character that caused the problem was U + 000C, "formfeed", which (according to page W3) is valid HTML but invalid XHTML!)

+7
php xhtml unicode
source share
2 answers

I found a function that can do what you want phpedit.net .

I will send a function for the archive, I will do ltp on PHPEdit.net:

 /** * Removes invalid XML * * @access public * @param string $value * @return string */ function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $ret .= chr($current); } else { $ret .= " "; } } return $ret; } 
+2
source share

Assuming your input is utf8, you can remove Unicode ranges with something like

  preg_replace('~[\x{17A3}-\x{17D3}]~u', '', $input); 

Another and best approach is to delete everything by default and only whitelists that you want to see. Unicode properties (\ p) are very practical for this. For example, it deletes everything except (unicode) letters and numbers:

  preg_replace('~[^\p{L}\p{N}]~u', '', $input) 
+1
source share

All Articles