How to remove spaces in BeautifulSoup

Question

How to remove spaces in BeautifulSoup

I have a bunch of HTML that I am parsing with BeautifulSoup and everything is going fine, with the exception of one minor error. I want to save the output in a single line string, as my current output:

<li><span class="plaincharacterwrap break"> Zazzafooky but one two three! </span></li> <li><span class="plaincharacterwrap break"> Zazzafooky2 </span></li> <li><span class="plaincharacterwrap break"> Zazzafooky3 </span></li>

Ideally, I would like

 <li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There are a lot of redundant spaces that I would like to get rid of, but it is not necessarily replaced by strip() , and I cannot explicitly remove all spaces because I need to save the text. How can I do it? It seems like a fairly common problem that regex would be redundant, but is this the only way?

I don't have <pre> tags, so I can be a little stronger.

Thanks again!

+6

python regex html-parsing beautifulsoup

Rio Nov 24 '10 at 19:31

source share

3 answers

An old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try the following:

 description_el = about.find('p', { "class": "description" }) descriptions = list(description_el.stripped_strings) description = "\n\n".join(descriptions) if descriptions else ""

+7

twig 15 Sep '13 at 13:24

source share

 re.sub(r'[\ \n]{2,}', '', yourstring)

Regex [\ \n]{2} matches newlines and spaces (must be escaped) if there are more than two or more of them. A more thorough implementation is as follows:

 re.sub('\ {2,}', '', yourstring) re.sub('\n*', '', yourstring)

I would think that the first will replace only a few new lines, but it seems (at least for me) to work fine.

0

Raffettler Nov 24 '10 at 19:42

source share

Andrew Clark · Accepted Answer · 2010-11-24T19:49:03+0000

Here's how you can do it without regular expressions:

 >>> html = """ <li><span class="plaincharacterwrap break"> ... Zazzafooky but one two three! ... </span></li> ... <li><span class="plaincharacterwrap break"> ... Zazzafooky2 ... </span></li> ... <li><span class="plaincharacterwrap break"> ... Zazzafooky3 ... </span></li> ... """ >>> html = "".join(line.strip() for line in html.split("\n")) >>> html '<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

How to remove spaces in BeautifulSoup

More articles: