<> changed to & lt; and & gt; when parsing html using beautifulsoup in python

Question

<> changed to & lt; and & gt; when parsing html using beautifulsoup in python

When processing html using Beautifulsoup, <and> were converted to < and > Since the tag binding has been converted, the whole soup has lost its structure, any suggestion?

+6

python html parsing beautifulsoup

flyingfoxlee Feb 03 '13 at 3:42

source share

2 answers

rkday · Answer 1 · 2013-02-03T10:38:08+0000

Setting formatter=None may help ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters ), but it may be a sign that your HTML is not valid.

If this does not work, can you provide sample code and HTML that reproduces the problem?

sonique · Answer 2 · 2019-04-06T00:44:04+0000

This may be due to an invalid character (due to encoding / decoding of the encoding), so BeautifulSoup has problems analyzing the input. I solve this by passing my string directly to BeautifulSoup without any encoding / decoding. In my case, I tried to convert UTF-16 to UTF-8 myself.

<> changed to & lt; and & gt; when parsing html using beautifulsoup in python

More articles: