Htmlpurifier, third-party glut

UPDATE 2: http://htmlpurifier.org/phorum/read.php?3,5088,5113 The author has already identified the problem.

UPDATE: the problem seems exceptional for version 4.2.0. I lowered the rating to 4.1.0 and it works. Thanks for your help. Package author notified.

I read a few pages, for example:

http://form.horseracing.betfair.com/horse-racing/010108/Catterick_Bridge-GB-Cat/1215

According to the W3C validation, this is really XHTML Strict.

Then I use http://htmlpurifier.org/ to clear the HTML before loading into the DOMDocument. However, it returns only one line of content.

Output:

12:15 Catterick Bridge - Tuesday 1st January 2008 - Timeform | Betfair 

code:

 echo $content; # all good $purifier = new \HTMLPurifier(); $content = $purifier->purify($content); echo $content; # all bad 

BTW works for data received from another site, just as you say, it leaves a header for all pages from this domain.

Related Links

+4
source share
1 answer

You do not need an HTML cleaner. The DOMDocument class takes care of everything for you. However, this will result in an invalid html warning, so just do the following:

 $doc = new DOMDocument(); @$doc->loadHTML($content); 

Then the error will not be triggered, and you can do what you want using HTML.

If you clear links, I would recommend that you use SimpleXMLElement :: xpath (); This is much easier than working with DOMDocument. Another example:

 $xml = new SimpleXMLElement($content); $result = $xml->xpath('a/@href'); print_r($result); 

You can get much more complex xpaths that allow you to specify class names, identifiers, and other attributes. This is much more powerful than DOMDocument.

0
source

All Articles