How to prevent tags from closing in bad HTML using BeautifulSoup (python)?

I automatically translated the contents of the HTML pages into another language, so I need to extract all the text nodes from different HTML pages that are sometimes poorly written (I have no way to edit these HTML pages).

Using BeautifulSoup, I can easily extract these texts and replace it with translation, but when I display HTML after this operation: html = BeautifulSoup (source_html) - it is sometimes interrupted because BeautifulSoup automatically closes the tags (for example, the table tag is closed in the wrong place) .

Is there a way to prevent BeautifulSoup from closing these tags?

For example, this is my input:

html = "<table><tr><td>some text</td></table>" - tr is missing

after soup = BeautufulSoup (html) I get "<table><tr><td>some text</td></tr></table>"

and I want to get the same html as input ...

Is it possible at all?

+4
source share
1 answer

BeautifulSoup is superior in parsing and retrieving data from poorly formatted HTML / XML, but if the broken HTML is ambiguous, then it uses a set of rules to interpret tags (which may not be what you want). See the HTML Analysis section of the documents for an example that is very similar to your situation.

If you know what’s wrong with your tags and understand the rules that BeautifulSoup uses, you can slightly increase your HTML (perhaps remove or move certain tags) so that BeautifulSoup returns the desired result.

If you can post a short example, someone can provide you with more specific help.


Update (some examples)

For example, consider the example provided in the documents (see above):

 from BeautifulSoup import BeautifulSoup html = """ <html> <form> <table> <td><input name="input1">Row 1 cell 1 <tr><td>Row 2 cell 1 </form> <td>Row 2 cell 2<br>This</br> sure is a long cell </body> </html>""" print BeautifulSoup(html).prettify() 

The <table> will be closed before </form> to make sure that the table is correctly nested in the form, leaving the last <td> hanging.

If we understand the problem, we can get the correct closing tab ( </table> ) by deleting "<form>" before parsing:

 >>> html = html.replace("<form>", "") >>> soup = BeautifulSoup(html) >>> print soup.prettify() <html> <table> <td> <input name="input1" /> Row 1 cell 1 </td> <tr> <td> Row 2 cell 1 </td> <td> Row 2 cell 2 <br /> This sure is a long cell </td> </tr> </table> </html> 

If the <form> is important, you can add it after parsing. For instance:

 >>> new_form = Tag(soup, "form") # create form element >>> soup.html.insert(0, new_form) # insert form as child of html >>> new_form.insert(0, soup.table.extract()) # move table into form >>> print soup.prettify() <html> <form> <table> <td> <input name="input1" /> Row 1 cell 1 </td> <tr> <td> Row 2 cell 1 </td> <td> Row 2 cell 2 <br /> This sure is a long cell </td> </tr> </table> </form> </html> 
+3
source

All Articles