How to read website content in C #?

Question

How to read website content in C #?

I want to read the text of the site without html tags and headers. I just need the text displayed in the web browser.

I do not need it

<html> <body> bla bla </td><td> bla bla <body> <html>

I just need the text "bla bla bla bla".

I used the webclient and httpwebrequest methods to retrieve the HTML content and share the resulting data, but this is not possible, because if I change the website, the tags can change.

So is it possible in any way to get only the displayed text on the website?

+7

html c # httpwebrequest webclient streamreader

Azeem akram May 14 '12 at 7:44

source share

5 answers

You need to use a special HTML parser. The only way to get the content of such a non-standard language.

See: What is the best way to parse html in C #?

+5

Tigran May 14 '12 at 7:48

source share

 public string GetwebContent(string urlForGet) { // Create WebClient var client = new WebClient(); // Download Text From web var text = client.DownloadString(urlForGet); return text.ToString(); }

0

user3059036 Jan 4 '14 at 15:40

source share

I think this link may help you.

 /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); }

-one

R4j May 14 '12 at 8:09

source share

 // Reading Web page content in c# program //Specify the Web page to read WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx"); //Get the response WebResponse response = request.GetResponse(); //Read the stream from the response StreamReader reader = new StreamReader(response.GetResponseStream()); //Read the text from stream reader string str = reader.ReadLine(); for(int i=0;i<200;i++) { str += reader.ReadLine(); } Console.Write(str);

-2

Jaiff May 14 '12 at 7:47

source share

yamen · Accepted Answer · 2012-05-14T08:10:39+0000

Here's how you do it using HtmlAgilityPack .

First your HTML sample:

 var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";

Download it (as a line in this case):

 var doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html);

If you get it from the Internet, similar:

 var web = new HtmlWeb(); var doc = web.Load(url);

Now select only text nodes with non-spaces and crop them.

 var text = doc.DocumentNode.Descendants() .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0) .Select(x => x.InnerText.Trim());

You can get this as one concatenated string if you want:

 String.Join(" ", text)

Of course, this will only work on simple web pages. Any complex will also return nodes with data that you clearly don't want, such as javascript functions, etc.

How to read website content in C #?

More articles: