How to remove "& amp; nbsp" from html content?
I have an html page:
<div class="theater"> <div class="desc" id="theater_16109207495969942346"> <h2 class="name"><a href="/movies?near=pune&tid=df8f66de0a592b4a" id="link_1_theater_16109207495969942346">Esquare Victory Camp</a></h2> <div class="info">site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 <a class="fl" href="" target="_top"></a> </div> </div> <div class="showtimes"> <div class="show_left"> <div class="movie"> <div class="name"><a href="/movies?near=pune&mid=1cdcf90092189400">Hawaa Hawaai</a> </div><span class="info">Drama - Hindi</span> <div class="times"><span style="color:#666"><span style="padding:0 "></span> <!-- -->10:30am</span><span style="color:#666"><span style="padding:0 "> &nbsp</span> <!-- -->3:45</span><span style="color:#666"><span style="padding:0 "> &nbsp</span> <!-- -->6:00</span><span style="color:"><span style="padding:0 "> &nbsp</span> <!-- -->8:30pm</span> </div> </div> </div> <div class="show_right"> <div class="movie"> <div class="name"><a href="/movies?near=pune&mid=6b59ad39004d895b">The Amazing Spider Man 2</a> </div><span class="info">Action/Adventure/Thriller - English - <a class="fl" href="/url?q=http://www.youtube.com/watch%3Fv%3DSCjCk59PIzw&sa=X&oi=movies&ii=0&usg=AFQjCNGpVM5U04h0acABA7eApb6EIO4Ejw">Trailer</a></span> <div class="times"><span style="color:#666"><span style="padding:0 "></span> <!-- -->1:00</span><span style="color:"><span style="padding:0 "> &nbsp</span> <!-- -->10:45pm</span> </div> </div> </div> <p class="clear"></p> </div> </div> Where we can see that in many places we have &nbsp . There are many other Unicode characters. I want to extract the contents of this page. I'm doing it:
def removeNonAscii(s): return "".join(i for i in s if ord(i)<128) myName = soup.findAll("div", {"class" : "theater"}) for x in myName: xt = str(x) print removeNonAscii(xt) print "<br>" Result:
Esquare Victory Camp site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 Hawaa Hawaai Drama - Hindi 10:30am  3:45  6:00  8:30pm The Amazing Spider Man 2 Action/Adventure/Thriller - English - Trailer 1:00  10:45pm Everything looks good except   . I tried replacing & nbsp, and was looking for other solutions, but still has no solution. I think   without ; creates a problem. How to remove   ?
+5
impossible
source share2 answers
Depending on the processing stage in which you want to remove unused space, this can be quite simple. For example, when processing an HTML fragment, you can simply remove the string "& nbsp" from text elements:
s = """your HTML""" soup = BeautifulSoup(s) texts = soup.find_all(text=True) for t in texts: newtext = t.replace(" ", "") t.replace_with(newtext) +5
ofrommel
source sharelxml.html might be a better library for you, which will replace " & nbsp " and other HTML tags with the correct characters.
import lxml.html import lxml.html.clean html = """your HTML""" doc = lxml.html.fromstring(html) cleaner = lxml.html.clean.Cleaner(style=True) doc = cleaner.clean_html(doc) text = doc.text_content() 0
Azure
source share