How to save namespace information when parsing HTML using lxml?

>>> from lxml.etree import HTML, tostring >>> tostring(HTML('<fb:like>')) '<html><body><like/></body></html>' 

Notice how the tag turns from <fb:like> to just <like> .

This makes processing pages that contain XFBML with lxml much more complicated. (The same thing happens with <g:plusone></g:plusone> )

Any help is appreciated.

+8
python html lxml xml-namespaces facebook-like
source share
2 answers

Try adding namespace prefix definitions that are missing. lxml will avoid namespaces otherwise , presumably to make it easier for you.

Most likely, the sites you are trying to analyze will not contain these namespace definitions, so you should add them.

Something like this: xmlns: adlcp = "http: // xxx / yy / zzz"

+1
source share

One way to fix this problem is patch libxml2 .

Referring to the source code of libxml2.9.2 (https: // git.gnome.org/browse/libxml2/tree/?id=v2.9.2), in SAX2.c (https://git.gnome.org/browse/libxml2 /tree/SAX2.c? id = v2.9.2) (the internal SAX parser used to create the DOM tree) in the attributes of line 1699 using xmlns are not parsed in HTML mode and they are parsed like any other attributes in the string and 1740. Therefore, it makes sense to adjust line 1622, which splits the name into a prefix and a local part. Change:

 name = xmlSplitQName(ctxt, fullname, &prefix); 

in

 if (!ctxt->html) { name = xmlSplitQName(ctxt, fullname, &prefix); } else { name = xmlStrdup(fullname); prefix = NULL; } 

Then libxml2 will consider tags, such as <o:p> , for elements named o:p , that is, a colon is included in the element name without a special value. This is the correct interpretation in HTML. For example, the HTML5 specification says :

In HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For example, the colon has no special meaning in the names of HTML elements.

We hope that this change will be approved for a future version of libxml2. There is an open bug report (https://bugzilla.gnome.org/show_bug.cgi?id=654146).

+1
source share

All Articles