How to make beautifulsoup encode and decode the contents of a script tag

I try to use beautifulsoup to parse html, but whenever I click on a page using the built-in script tag, beautifulsoup encodes the content but does not decode it at the end.

This is the code I'm using:

from bs4 import BeautifulSoup if __name__ == '__main__': htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' soup = BeautifulSoup(htmlData) #... using BeautifulSoup ... print(soup.prettify() ) 

I want this output:

 <html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html> 

But I get this output:

 <html> <head> <script type="text/javascript"> console.log("&lt; &lt; not able to write these &amp; also these &gt;&gt; "); </script> </head> <body> <div> start of div </div> </body> </html> 
+4
source share
2 answers

You can try lxml :

 import lxml.html as LH if __name__ == '__main__': htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' doc = LH.fromstring(htmlData) print(LH.tostring(doc, pretty_print = True)) 

gives

 <html> <head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head> <body> <div> start of div </div> </body> </html> 
+1
source

You can do something like this:

 htmlCodes = ( ('&', '&amp;'), ('<', '&lt;'), ('>', '&gt;'), ('"', '&quot;'), ("'", '&#39;'), ) for i in htmlCodes: soup.prettify().replace(i[1], i[0]) 
-1
source

All Articles