BeautifulSoup is superior in parsing and retrieving data from poorly formatted HTML / XML, but if the broken HTML is ambiguous, then it uses a set of rules to interpret tags (which may not be what you want). See the HTML Analysis section of the documents for an example that is very similar to your situation.
If you know whatβs wrong with your tags and understand the rules that BeautifulSoup uses, you can slightly increase your HTML (perhaps remove or move certain tags) so that BeautifulSoup returns the desired result.
If you can post a short example, someone can provide you with more specific help.
Update (some examples)
For example, consider the example provided in the documents (see above):
from BeautifulSoup import BeautifulSoup html = """ <html> <form> <table> <td><input name="input1">Row 1 cell 1 <tr><td>Row 2 cell 1 </form> <td>Row 2 cell 2<br>This</br> sure is a long cell </body> </html>""" print BeautifulSoup(html).prettify()
The <table> will be closed before </form> to make sure that the table is correctly nested in the form, leaving the last <td> hanging.
If we understand the problem, we can get the correct closing tab ( </table> ) by deleting "<form>" before parsing:
>>> html = html.replace("<form>", "") >>> soup = BeautifulSoup(html) >>> print soup.prettify() <html> <table> <td> <input name="input1" /> Row 1 cell 1 </td> <tr> <td> Row 2 cell 1 </td> <td> Row 2 cell 2 <br /> This sure is a long cell </td> </tr> </table> </html>
If the <form> is important, you can add it after parsing. For instance:
>>> new_form = Tag(soup, "form") # create form element >>> soup.html.insert(0, new_form) # insert form as child of html >>> new_form.insert(0, soup.table.extract()) # move table into form >>> print soup.prettify() <html> <form> <table> <td> <input name="input1" /> Row 1 cell 1 </td> <tr> <td> Row 2 cell 1 </td> <td> Row 2 cell 2 <br /> This sure is a long cell </td> </tr> </table> </form> </html>