How to read website content in C #?

I want to read the text of the site without html tags and headers. I just need the text displayed in the web browser.

I do not need it

<html> <body> bla bla </td><td> bla bla <body> <html> 

I just need the text "bla bla bla bla".

I used the webclient and httpwebrequest methods to retrieve the HTML content and share the resulting data, but this is not possible, because if I change the website, the tags can change.

So is it possible in any way to get only the displayed text on the website?

+7
source share
5 answers

Here's how you do it using HtmlAgilityPack .

First your HTML sample:

 var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>"; 

Download it (as a line in this case):

 var doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); 

If you get it from the Internet, similar:

 var web = new HtmlWeb(); var doc = web.Load(url); 

Now select only text nodes with non-spaces and crop them.

 var text = doc.DocumentNode.Descendants() .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0) .Select(x => x.InnerText.Trim()); 

You can get this as one concatenated string if you want:

 String.Join(" ", text) 

Of course, this will only work on simple web pages. Any complex will also return nodes with data that you clearly don't want, such as javascript functions, etc.

+4
source

You need to use a special HTML parser. The only way to get the content of such a non-standard language.

See: What is the best way to parse html in C #?

+5
source
 public string GetwebContent(string urlForGet) { // Create WebClient var client = new WebClient(); // Download Text From web var text = client.DownloadString(urlForGet); return text.ToString(); } 
0
source

I think this link may help you.

 /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } 
-one
source
 // Reading Web page content in c# program //Specify the Web page to read WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx"); //Get the response WebResponse response = request.GetResponse(); //Read the stream from the response StreamReader reader = new StreamReader(response.GetResponseStream()); //Read the text from stream reader string str = reader.ReadLine(); for(int i=0;i<200;i++) { str += reader.ReadLine(); } Console.Write(str); 
-2
source

All Articles