How to convert UTF-8 to text in an HTML object?

I have a bootloader program that downloads pages from the Internet. the encoding of each page is different, some of them are in UTF-8, and some of them are Unicode. For example: a which shows the character 'a'; pages filled with these characters. We must convert these encodings to plain text.

I used the UnicodeEncoding class in C #, but they do not help me.

How can I decode these encodings to real characters? Is there a class or method that converts this?

Thanks.

+4
source share
3 answers

This is html-encoded; try the HtmlDecode ? (you will need a link to System.Web.dll)

+6
source

The text in html pages that are in the form of a start and end is encoded in HTML format.

You can decode them using:

 string html = ...; //your html string decoded = System.Web.HttpUtility.HtmlDecode( html ); 

Also see The characters in the line changed after downloading HTML from the Internet for code to make sure that you are loading the page in the correct character set.

+5
source

You are confused between HTML / XML escaping and UTF-8 / Unicode.

If the page is valid for XML, life will be simpler - you can simply parse it like any other XML document, and then just get the corresponding text nodes ... all XML escaping will be "uninsulated" when you receive the text.

If this is arbitrary - and possibly invalid - HTML, then life is a little more complicated. You might want to normalize it first in valid HTML, then parse it and query the text nodes again.

If you can give us a more concrete example, it will be easier for you to advise.

The HtmlDecode method suggested in other answers may be very useful to you, but you definitely need to understand what happens first. For example, you may only want to decode some HTML fragments - if you decode the entire document, then you can get text that looks like HTML tags, but actually just contained the text in the original document.

+1
source

Source: https://habr.com/ru/post/1312641/


All Articles