How to create an html5lib parser to handle mixes of xml and html tags

I am new to BeautifulSoup and I am teaching how to solve my parsing problems. My html file consists of many separate documents downloaded as a package from lexisnexis (legal database). My first task is to split the html file into its constituent documents. I thought it would be easy, since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC> and so on. However, this <DOC> tag is an xml tag, not an html tag (all other tags in the file are html). Because of this, using a regular html parser, this tag is not available in the tree. How can I build a parser in bs4 that picks up this xml tag? I am attaching the corresponding section of the html file:

<!-- Hide XML section from browser <DOC NUMBER=1> <DOCFULL> --> BODY <!-- Hide XML section from browser </DOCFULL> </DOC> -->

Best marion

+4
source share
1 answer

You can specify xml in bs4 when instantiating the BeautifulSoup object:

 xml_soup = BeautifulSoup(xml_object, 'xml') 

This should take care of your problem. You can use the xml_soup object to parse the remaining html, however I would recommend creating an instance of another soup object specifically for html:

 soup = BeautifulSoup(html_object) 
+1
source

All Articles