How to remove all script tags in BeautifulSoup?

Question

How to remove all script tags in BeautifulSoup?

I am scanning a table from a web link and would like to rebuild the table by removing all script tags. Here are the source codes.

response = requests.get(url) soup = BeautifulSoup(response.text) table = soup.find('table') for row in table.find_all('tr') : for col in row.find_all('td'): #remove all different script tags #col.replace_with('') #col.decompose() #col.extract() col = col.contents

How to remove all script tags? Take the following cell as an example, which includes the tag a , br and td .

 <td><a href="http://www.irit.fr/SC">Signal et Communication</a> <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a> </td>

Expected Result:

 Signal et Communication Ingénierie Réseaux et Télécommunications

+5

python html html-parsing beautifulsoup

Sparkandndine Jul 18 '15 at 17:44

source share

2 answers

Try calling col.string. This will give you only text.

+1

blasko Jul 18 '15 at 17:46

source share

alecxe · Accepted Answer · 2015-07-18T17:52:49+0000

You are asking about get_text() :

If you only need the text part of the document or tag, you can use get_text() . It returns all the text in the document or under the tag as a single Unicode string

 td = soup.find("td") td.get_text()

Note that .string will return None in this case, since td has several children:

If the tag contains several things, then it is not clear that the .string should reference, therefore .string is defined as None

Demo:

 >>> from bs4 import BeautifulSoup >>> >>> soup = BeautifulSoup(u""" ... <td><a href="http://www.irit.fr/SC">Signal et Communication</a> ... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a> ... </td> ... """) >>> >>> td = soup.td >>> print td.string None >>> print td.get_text() Signal et Communication Ingénierie Réseaux et Télécommunications

How to remove all script tags in BeautifulSoup?

More articles: