HTML encoding and lxml analysis

I am trying to finally solve some encoding problems that occur when trying to clear HTML with lxml. Here are three examples of HTML documents I came across:

one.

<!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: ์€ โ€”'</title> <meta charset='utf-8'> </head> <body></body> </html> 

2.

 <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: ์€ โ€”'</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html> 

3.

 <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Unicode Chars: ์€ โ€”'</title> </head> <body></body> </html> 

My main script:

 from lxml.html import fromstring ... doc = fromstring(raw_html) title = doc.xpath('//title/text()')[0] print title 

Results:

 Unicode Chars: รฌ รขรข Unicode Chars: ์€ โ€”' Unicode Chars: ์€ โ€”' 

So, obviously the problem is with sample 1 and the missing tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> . The solution from here will correctly recognize pattern 1 as utf-8, and therefore it is functionally equivalent to my source code.

Lxml docs look conflicting:

From here, the example seems to suggest that we should use UnicodeDammit to encode markup as unicode.

 from BeautifulSoup import UnicodeDammit def decode_html(html_string): converted = UnicodeDammit(html_string, isHTML=True) if not converted.unicode: raise UnicodeDecodeError( "Failed to detect encoding, tried [%s]", ', '.join(converted.triedEncodings)) # print converted.originalEncoding return converted.unicode root = lxml.html.fromstring(decode_html(tag_soup)) 

However , it says:

[Y] ou will receive errors when trying to [parse] HTML data in a Unicode string that indicates the encoding in the header meta tag. In general, you should avoid converting XML / HTML data to unicode before passing them to parsers. It is slower and error prone.

If I try to execute the first sentence in lxml docs, now my code:

 from lxml.html import fromstring from bs4 import UnicodeDammit ... dammit = UnicodeDammit(raw_html) doc = fromstring(dammit.unicode_markup) title = doc.xpath('//title/text()')[0] print title 

Now I get the following results:

 Unicode Chars: ์€ โ€”' Unicode Chars: ์€ โ€”' ValueError: Unicode strings with encoding declaration are not supported. 

Example 1 now works correctly, but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag <?xml version="1.0" encoding="utf-8"?> .

Is there a proper way to handle all of these cases? Is there a better solution than the following?

 dammit = UnicodeDammit(raw_html) try: doc = fromstring(dammit.unicode_markup) except ValueError: doc = fromstring(raw_html) 
+7
python unicode web-scraping lxml beautifulsoup
Mar 08 '13 at 19:50
source share
2 answers

lxml has several issues related to Unicode processing. It is best to use bytes (now), explicitly specifying a character encoding:

 #!/usr/bin/env python import glob from lxml import html from bs4 import UnicodeDammit for filename in glob.glob('*.html'): with open(filename, 'rb') as file: content = file.read() doc = UnicodeDammit(content, is_html=True) parser = html.HTMLParser(encoding=doc.original_encoding) root = html.document_fromstring(content, parser=parser) title = root.find('.//title').text_content() print(title) 

Exit

 Unicode Chars: ์€ โ€”' Unicode Chars: ์€ โ€”' Unicode Chars: ์€ โ€”' 
+15
Mar 08 '13 at 23:44
source share

The problem is probably due to the fact that <meta charset> is a relatively new standard (HTML5, if I'm not mistaken or have not been used before).

Until the lxml.html library is updated to reflect it, you will need to handle this case specifically.

If you only care about ISO-8859- * and UTF-8 and can allow you to throw away encodings that do not support ASCII (for example, UTF-16 or traditional Asian encodings), you can replace the regular expression in the byte string by replacing the new <meta charset> to the older http-equiv format.

Otherwise, if you need the right solution, it is best to fix the library yourself (and make a correction while you are on it.) You might want to ask the lxml developers if they have any half-baked code that fits around this particular error or if they track a bug in their bug tracking system first.

+3
Mar 08 '13 at 23:34
source share



All Articles