How to remove "& amp; nbsp" from html content?

Question

How to remove "& amp; nbsp" from html content?

I have an html page:

<div class="theater"> <div class="desc" id="theater_16109207495969942346"> <h2 class="name"><a href="/movies?near=pune&amp;tid=df8f66de0a592b4a" id="link_1_theater_16109207495969942346">Esquare Victory Camp</a></h2> <div class="info">site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 <a class="fl" href="" target="_top"></a> </div> </div> <div class="showtimes"> <div class="show_left"> <div class="movie"> <div class="name"><a href="/movies?near=pune&amp;mid=1cdcf90092189400">Hawaa Hawaai</a> </div><span class="info">Drama - Hindi</span> <div class="times"><span style="color:#666"><span style="padding:0 "></span> <!-- -->10:30am</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span> <!-- -->3:45</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span> <!-- -->6:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span> <!-- -->8:30pm</span> </div> </div> </div> <div class="show_right"> <div class="movie"> <div class="name"><a href="/movies?near=pune&amp;mid=6b59ad39004d895b">The Amazing Spider Man 2</a> </div><span class="info">Action/Adventure/Thriller - English - <a class="fl" href="/url?q=http://www.youtube.com/watch%3Fv%3DSCjCk59PIzw&amp;sa=X&amp;oi=movies&amp;ii=0&amp;usg=AFQjCNGpVM5U04h0acABA7eApb6EIO4Ejw">Trailer</a></span> <div class="times"><span style="color:#666"><span style="padding:0 "></span> <!-- -->1:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span> <!-- -->10:45pm</span> </div> </div> </div> <p class="clear"></p> </div> </div>

Where we can see that in many places we have &nbsp . There are many other Unicode characters. I want to extract the contents of this page. I'm doing it:

 def removeNonAscii(s): return "".join(i for i in s if ord(i)<128) myName = soup.findAll("div", {"class" : "theater"}) for x in myName: xt = str(x) print removeNonAscii(xt) print "<br>"

Result:

 Esquare Victory Camp site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 Hawaa Hawaai Drama - Hindi 10:30am &nbsp3:45 &nbsp6:00 &nbsp8:30pm The Amazing Spider Man 2 Action/Adventure/Thriller - English - Trailer 1:00 &nbsp10:45pm

Everything looks good except . I tried replacing & nbsp, and was looking for other solutions, but still has no solution. I think without ; creates a problem. How to remove ?

+5

python string html unicode beautifulsoup

impossible May 12, '14 at 15:01

source share

2 answers

lxml.html might be a better library for you, which will replace " & nbsp " and other HTML tags with the correct characters.

 import lxml.html import lxml.html.clean html = """your HTML""" doc = lxml.html.fromstring(html) cleaner = lxml.html.clean.Cleaner(style=True) doc = cleaner.clean_html(doc) text = doc.text_content()

0

Azure Mar 17 '18 at 9:58

source share

ofrommel · Accepted Answer · 2014-05-13T08:39:49+0000

Depending on the processing stage in which you want to remove unused space, this can be quite simple. For example, when processing an HTML fragment, you can simply remove the string "& nbsp" from text elements:

 s = """your HTML""" soup = BeautifulSoup(s) texts = soup.find_all(text=True) for t in texts: newtext = t.replace("&nbsp", "") t.replace_with(newtext)

How to remove "& amp; nbsp" from html content?

More articles: