Node.js Cheerio analyzer breaks UTF-8 encoding

I view my query with Cheerio as follows:

var url = http://shop.nag.ru/catalog/16939.IP-videonablyudenie-OMNY/16944.IP-kamery-OMNY-c-vario-obektivom/16704.OMNY-1000-PRO; request.get(url, function (err, response, body) { console.log(body); $ = cheerio.load(body); console.log($(".description").html()); }); 

And as an output, I see the contents, but in an unreadable weird encoding:

 //Plain body console.log(body) (ps russian chars): <h1><span style="font-size: 16px;"> 3 IP HD  OMNY -   </span></h1><p style // cheerio console.log $(".description").html() <h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY 

Destination URL in UTF-8 format. So why is Cheerio breaking my encoding?

Trying to use iconv to encode my body:

 var body1 = iconv.decode(body, "utf-8"); 

but console.log($(".description").html()); still returning weird text.

+13
source share
2 answers

Cheerio did not break anything. This is the output of HTML entities that will be displayed by any browser in the same way as HTML input. Run this snippet to see what I mean:

 <h1><span style="font-size: 16px;"> 3 IP HD  OMNY -   </span></h1> <h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY - &#x43F;&#x43E;&#x43F;&#x440;&#x43E;&#x431;&#x443;&#x439;&#x442;&#x435; &#x43D;&#x430;&#x439;&#x442;&#x438; &#x43B;&#x443;&#x447;&#x448;&#x435;</span></h1> 

&#x423; for example, the character encoded as an HTML object is the same as the &gt; represents > .

However, if you want to receive unencrypted text, you can set the decodeEntities parameter to false :

 const $ = cheerio.load( '<h1><span style="font-size: 16px;"> 3 IP HD  OMNY -   </span></h1>', { decodeEntities: false } ); console.log($('span').html()) // =>  3 IP HD  OMNY -    
 .as-console-wrapper{min-height:100%} 
 <script src="https://bundle.run/ cheerio@1.0.0-rc.3 "></script> 
+31
source

Today I had a problem when I tried to load a page from cheerio that had special characters like รง , รก , รฉ , etc.

The way cheerio works is that it tries to decode characters by nature and represents the numeric HTML encoding of a Unicode character

for example: instead of รง this will give us &#xE7; .

To deal with this problem, I just had to disable this configuration by adding: decodeEntities: false as the boot parameter of cheerio.

 const $ = cheerio.load(body, { decodeEntities: false }); 
0
source

All Articles