Please note the following:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
which displays a nice textual representation of the string xhtml.
But for the same XHTML document with an HTML5 document:
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
I get an exception:
xml.etree.ElementTree.ParseError: undefined entity: line 5, column 19
therefore, the analyzer cannot handle this, although I added nbspa dict to the entities.
The same thing happens if I use lxml:
from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')
arises:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 26
although I installed a parser to ignore entities.
Why is this and how is it possible to parse XHTML files with HTML5 doctype declaration?
A partial solution for lxml is to use the utility:
parser = etree.XMLParser(resolve_entities=False, recover=True)
but I'm still waiting for the best.