How to remove spaces in BeautifulSoup

I have a bunch of HTML that I am parsing with BeautifulSoup and everything is going fine, with the exception of one minor error. I want to save the output in a single line string, as my current output:

<li><span class="plaincharacterwrap break"> Zazzafooky but one two three! </span></li> <li><span class="plaincharacterwrap break"> Zazzafooky2 </span></li> <li><span class="plaincharacterwrap break"> Zazzafooky3 </span></li> 

Ideally, I would like

 <li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li> 

There are a lot of redundant spaces that I would like to get rid of, but it is not necessarily replaced by strip() , and I cannot explicitly remove all spaces because I need to save the text. How can I do it? It seems like a fairly common problem that regex would be redundant, but is this the only way?

I don't have <pre> tags, so I can be a little stronger.

Thanks again!

+6
python regex html-parsing beautifulsoup
source share
3 answers

Here's how you can do it without regular expressions:

 >>> html = """ <li><span class="plaincharacterwrap break"> ... Zazzafooky but one two three! ... </span></li> ... <li><span class="plaincharacterwrap break"> ... Zazzafooky2 ... </span></li> ... <li><span class="plaincharacterwrap break"> ... Zazzafooky3 ... </span></li> ... """ >>> html = "".join(line.strip() for line in html.split("\n")) >>> html '<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>' 
+9
source share

An old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try the following:

 description_el = about.find('p', { "class": "description" }) descriptions = list(description_el.stripped_strings) description = "\n\n".join(descriptions) if descriptions else "" 
+7
source share
 re.sub(r'[\ \n]{2,}', '', yourstring) 

Regex [\ \n]{2} matches newlines and spaces (must be escaped) if there are more than two or more of them. A more thorough implementation is as follows:

 re.sub('\ {2,}', '', yourstring) re.sub('\n*', '', yourstring) 

I would think that the first will replace only a few new lines, but it seems (at least for me) to work fine.

0
source share

All Articles