How to create an html5lib parser to handle mixes of xml and html tags

Question

How to create an html5lib parser to handle mixes of xml and html tags

I am new to BeautifulSoup and I am teaching how to solve my parsing problems. My html file consists of many separate documents downloaded as a package from lexisnexis (legal database). My first task is to split the html file into its constituent documents. I thought it would be easy, since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC> and so on. However, this <DOC> tag is an xml tag, not an html tag (all other tags in the file are html). Because of this, using a regular html parser, this tag is not available in the tree. How can I build a parser in bs4 that picks up this xml tag? I am attaching the corresponding section of the html file:

 BODY 

Best marion

+4

python parsing beautifulsoup

user2054545 Mar 19 '13 at 19:57

source share

1 answer

That1guy · Answer 1 · 2013-03-25T20:43:19+0000

You can specify xml in bs4 when instantiating the BeautifulSoup object:

 xml_soup = BeautifulSoup(xml_object, 'xml')

This should take care of your problem. You can use the xml_soup object to parse the remaining html, however I would recommend creating an instance of another soup object specifically for html:

 soup = BeautifulSoup(html_object)

How to create an html5lib parser to handle mixes of xml and html tags

More articles: