I am trying to finally solve some encoding problems that occur when trying to clear HTML with lxml. Here are three examples of HTML documents I came across:
one.
<!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: ์ โ'</title> <meta charset='utf-8'> </head> <body></body> </html>
2.
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: ์ โ'</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html>
3.
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Unicode Chars: ์ โ'</title> </head> <body></body> </html>
My main script:
from lxml.html import fromstring ... doc = fromstring(raw_html) title = doc.xpath('//title/text()')[0] print title
Results:
Unicode Chars: รฌ รขรข Unicode Chars: ์ โ' Unicode Chars: ์ โ'
So, obviously the problem is with sample 1 and the missing tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> . The solution from here will correctly recognize pattern 1 as utf-8, and therefore it is functionally equivalent to my source code.
Lxml docs look conflicting:
From here, the example seems to suggest that we should use UnicodeDammit to encode markup as unicode.
from BeautifulSoup import UnicodeDammit def decode_html(html_string): converted = UnicodeDammit(html_string, isHTML=True) if not converted.unicode: raise UnicodeDecodeError( "Failed to detect encoding, tried [%s]", ', '.join(converted.triedEncodings))
However , it says:
[Y] ou will receive errors when trying to [parse] HTML data in a Unicode string that indicates the encoding in the header meta tag. In general, you should avoid converting XML / HTML data to unicode before passing them to parsers. It is slower and error prone.
If I try to execute the first sentence in lxml docs, now my code:
from lxml.html import fromstring from bs4 import UnicodeDammit ... dammit = UnicodeDammit(raw_html) doc = fromstring(dammit.unicode_markup) title = doc.xpath('//title/text()')[0] print title
Now I get the following results:
Unicode Chars: ์ โ' Unicode Chars: ์ โ' ValueError: Unicode strings with encoding declaration are not supported.
Example 1 now works correctly, but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag <?xml version="1.0" encoding="utf-8"?> .
Is there a proper way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html) try: doc = fromstring(dammit.unicode_markup) except ValueError: doc = fromstring(raw_html)