How to make beautifulsoup encode and decode the contents of a script tag

Question

How to make beautifulsoup encode and decode the contents of a script tag

I try to use beautifulsoup to parse html, but whenever I click on a page using the built-in script tag, beautifulsoup encodes the content but does not decode it at the end.

This is the code I'm using:

from bs4 import BeautifulSoup if __name__ == '__main__': htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' soup = BeautifulSoup(htmlData) #... using BeautifulSoup ... print(soup.prettify() )

I want this output:

 <html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>

But I get this output:

 <html> <head> <script type="text/javascript"> console.log("&lt; &lt; not able to write these &amp; also these &gt;&gt; "); </script> </head> <body> <div> start of div </div> </body> </html>

+4

python tags beautifulsoup

user1557858 Dec 02 '12 at 18:13

source share

2 answers

unutbu · Answer 1 · 2012-12-02T18:37:19+0000

You can try lxml :

 import lxml.html as LH if __name__ == '__main__': htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' doc = LH.fromstring(htmlData) print(LH.tostring(doc, pretty_print = True))

gives

 <html> <head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head> <body> <div> start of div </div> </body> </html>

rofls · Answer 2 · 2012-12-02T18:23:28+0000

You can do something like this:

 htmlCodes = ( ('&', '&amp;'), ('<', '&lt;'), ('>', '&gt;'), ('"', '&quot;'), ("'", '&#39;'), ) for i in htmlCodes: soup.prettify().replace(i[1], i[0])

How to make beautifulsoup encode and decode the contents of a script tag

More articles: