I need to download and parse a webpage using lxml and build a UTF-8 xml output. I think the scheme in pseudo-code is more indicative:
from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8))
Thus, the web file can be in any encoding (lxml should handle this). The output file should be in utf-8. I am not sure where to use coding / coding. Is this circuit ok? (I cannot find a good tutorial on lxml and encoding, but I can find many problems with this ...) I need a reliable solution.
Edit:
So, to send utf-8 to lxml I use
converted = UnicodeDammit(webfile, isHTML=True) if not converted.unicode: print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \ ', '.join(converted.triedEncodings) continue webfile = converted.unicode.encode('utf-8')
python lxml
Vojta Rylko Apr 21 '10 at 21:30 2010-04-21 21:30
source share