Which clean Python library should I use to clean my website?

I currently have Ruby code used to clean some websites. I used Ruby because at the time I used Ruby on Rails for the site, it made sense.

Now I'm trying to port this to the Google App Engine and keep getting stuck.

I ported Python Mechanize to work with the Google App Engine, but it does not support DOM validation using XPATH.

I tried the built-in ElementTree, but it suffocated in the first block of HTML that I gave it when it came across "& mdash".

Am I still trying to hack ElementTree there, or am I trying to use something else?

thanks mark

+2
source share
5 answers

Beautiful soup.

+11
source

lxml - 100x better than elementtree

+6
source
+4

-, pyparsing, ( URL- yahoo.com) ( NIST- NTP-). pyparsing makeHTMLTags "<" + Literal(tagname) + ">" - makeHTMLTags , , , .. . Pyparsing , . , Python, ( ), GAE .

0

BeautifulSoup , API . ElementSoup, ElementTree BeautifulSoup.

0

All Articles