How to include in line breaks?

Question

How to include in line breaks?

Say I have HTML with  and   tags. In the afternoon, I am going to remove the HTML to clear the tags. How can I turn them into line breaks?

I am using Python BeautifulSoup , if that helps at all.

+8

python html xml regex

TIMEX May 08 '12 at 1:10

source share

4 answers

Mike pennington · Answer 1 · 2012-05-08T01:42:21+0000

Without any specifics, it's hard to be sure that it does exactly what you want, but it should give you an idea ... it assumes your b tags are wrapped inside p elements.

 from BeautifulSoup import BeautifulSoup import types def replace_with_newlines(element): text = '' for elem in element.recursiveChildGenerator(): if isinstance(elem, types.StringTypes): text += elem.strip() elif elem.name == 'br': text += '\n' return text page = """<html> <body> <p>America,<br> Now is the<br>time for all good men to come to the aid<br>of their country.</p> <p>pile on taxpayer debt<br></p> <p>Now is the<br>time for all good men to come to the aid<br>of their country.</p> </body> </html> """ soup = BeautifulSoup(page) lines = soup.find("body") for line in lines.findAll('p'): line = replace_with_newlines(line) print line

Executing this result ...

 (py26_default)[mpenning@Bucksnort ~]$ python thing.py America, Now is the time for all good men to come to the aid of their country. pile on taxpayer debt Now is the time for all good men to come to the aid of their country. (py26_default)[mpenning@Bucksnort ~]$

naoko · Answer 2 · 2016-08-09T22:12:19+0000

get_text seems to do what you need

 >>> from bs4 import BeautifulSoup >>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>" >>> soup = BeautifulSoup(doc) >>> soup.get_text(separator="\n") u'This is a paragraph.\nThis is another paragraph.'

Geng jiawen · Answer 3 · 2015-10-18T10:38:35+0000

This is the python3 @Mike Pennington Answer version (it really helps), I made a litter repository.

 def replace_with_newlines(element): text = '' for elem in element.recursiveChildGenerator(): if isinstance(elem, str): text += elem.strip() elif elem.name == 'br': text += '\n' return text def get_plain_text(soup): plain_text = '' lines = soup.find("body") for line in lines.findAll('p'): line = replace_with_newlines(line) plain_text+=line return plain_text

To use this, simply pass the Beautifulsoup object to get_plain_text metond.

 soup = BeautifulSoup(page) plain_text = get_plain_text(soup)

Andrey · Answer 4 · 2012-05-08T01:42:12+0000

I'm not quite sure what you are trying to execute, but if you are just trying to remove HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All. Make sure you replace the  element, which also removes the closure (  ). Also, only FYI, the correct HTML5   instead of   , but that doesn't really matter. Python would not be my first choice for this, so a bit out of my area of expertise, sorry I could no longer help.

How to include <p> in line breaks?

More articles: