How to include <p> in line breaks?

Say I have HTML with <p> and <br> tags. In the afternoon, I am going to remove the HTML to clear the tags. How can I turn them into line breaks?

I am using Python BeautifulSoup , if that helps at all.

+8
python html xml regex
source share
4 answers

Without any specifics, it's hard to be sure that it does exactly what you want, but it should give you an idea ... it assumes your b tags are wrapped inside p elements.

 from BeautifulSoup import BeautifulSoup import types def replace_with_newlines(element): text = '' for elem in element.recursiveChildGenerator(): if isinstance(elem, types.StringTypes): text += elem.strip() elif elem.name == 'br': text += '\n' return text page = """<html> <body> <p>America,<br> Now is the<br>time for all good men to come to the aid<br>of their country.</p> <p>pile on taxpayer debt<br></p> <p>Now is the<br>time for all good men to come to the aid<br>of their country.</p> </body> </html> """ soup = BeautifulSoup(page) lines = soup.find("body") for line in lines.findAll('p'): line = replace_with_newlines(line) print line 

Executing this result ...

 (py26_default)[mpenning@Bucksnort ~]$ python thing.py America, Now is the time for all good men to come to the aid of their country. pile on taxpayer debt Now is the time for all good men to come to the aid of their country. (py26_default)[mpenning@Bucksnort ~]$ 
+13
source share

get_text seems to do what you need

 >>> from bs4 import BeautifulSoup >>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>" >>> soup = BeautifulSoup(doc) >>> soup.get_text(separator="\n") u'This is a paragraph.\nThis is another paragraph.' 
+3
source share

This is the python3 @Mike Pennington Answer version (it really helps), I made a litter repository.

 def replace_with_newlines(element): text = '' for elem in element.recursiveChildGenerator(): if isinstance(elem, str): text += elem.strip() elif elem.name == 'br': text += '\n' return text def get_plain_text(soup): plain_text = '' lines = soup.find("body") for line in lines.findAll('p'): line = replace_with_newlines(line) plain_text+=line return plain_text 

To use this, simply pass the Beautifulsoup object to get_plain_text metond.

 soup = BeautifulSoup(page) plain_text = get_plain_text(soup) 
+1
source share

I'm not quite sure what you are trying to execute, but if you are just trying to remove HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All. Make sure you replace the <p> element, which also removes the closure ( </p> ). Also, only FYI, the correct HTML5 <br /> instead of <br> , but that doesn't really matter. Python would not be my first choice for this, so a bit out of my area of ​​expertise, sorry I could no longer help.

-5
source share

All Articles