Python coding with lxml - complete solution

I need to download and parse a webpage using lxml and build a UTF-8 xml output. I think the scheme in pseudo-code is more indicative:

from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) 

Thus, the web file can be in any encoding (lxml should handle this). The output file should be in utf-8. I am not sure where to use coding / coding. Is this circuit ok? (I cannot find a good tutorial on lxml and encoding, but I can find many problems with this ...) I need a reliable solution.

Edit:

So, to send utf-8 to lxml I use

  converted = UnicodeDammit(webfile, isHTML=True) if not converted.unicode: print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \ ', '.join(converted.triedEncodings) continue webfile = converted.unicode.encode('utf-8') 
+10
python lxml
Apr 21 '10 at 21:30
source share
2 answers

lxml may be slightly dependent on entering encodings. It is best to send UTF8 and receive UTF8.

You might want to use the chardet module or UnicodeDammit to decode the actual data.

You need to do something undefined:

 import chardet from lxml import html content = urllib2.urlopen(url).read() encoding = chardet.detect(content)['encoding'] if encoding != 'utf-8': content = content.decode(encoding, 'replace').encode('utf-8') doc = html.fromstring(content, base_url=url) 

I'm not sure why you are moving between lxml and etree if you are not interacting with another library that already uses etree?

+18
Apr 22 '10 at 6:21
source share

The definition of lxml encoding is weak .

However, note that the most common problem with web pages is the lack (or presence of incorrect) coding ads. it is therefore often sufficient to use only BeautifulSoup encoding called UnicodeDammit and leave the rest in lxml's own HTML parser, which is several times faster.

I recommend detecting encoding with UnicodeDammit and parse with lxml . Alternatively, you can use the Content-Type http header (you need to extract charset = ENCODING_NAME ) to more accurately determine the encoding.

In this example, I use BeautifulSoup4 (also you need to set chardet for better autodetection because UnicodeDammit uses the inner shell ):

 from bs4 import UnicodeDammit if http_charset == "": ud = UnicodeDammit(content, is_html=True) else: ud = UnicodeDammit(content, override_encodings=[http_charset], is_html=True) root = lxml.html.fromstring(ud.unicode_markup) 

OR, to make the previous answer more complete, you can change it to:

 if ud.original_encoding != 'utf-8': content = content.decode(ud.original_encoding, 'replace').encode('utf-8') 

Why is this better than just using chardet?

  • You do not ignore the Content-Type HTTP header

    Content-Type: Text / html; encoding = UTF-8

  • You do not ignore the http-equiv meta tag. Example:

    ... http-equiv = "Content-Type" content = "text / html; charset = UTF-8" ...

  • In addition, you use the capabilities of chardet , cjkcodecs and iconvcodec codecs and many more .

+2
Jul 06 2018-12-17T00:
source share