How to check if a 404 error page (page does not exist) using HtmlAgilityPack

Here I am trying to read the urls and get the images on the page. I need to exclude the page if it is 404, and stop receiving images from the 404 error page. How to do this using HtmlAgilityPack? Here is my code

var document = new HtmlWeb().Load(completeurl); var urls = document.DocumentNode.Descendants("img") .Select(e => e.GetAttributeValue("src", null)) .Where(s => !String.IsNullOrEmpty(s)).ToList(); 
+6
source share
2 answers

You need to register the PostRequestHandler event in an PostRequestHandler instance, it will be raised after each loaded document, and you will get access to the HttpWebResponse object. It has a property for StatusCode .

  HtmlWeb web = new HtmlWeb(); HttpStatusCode statusCode = HttpStatusCode.OK; web.PostRequestHandler += (request, response) => { if (response != null) { statusCode = response.StatusCode; } } var doc = web.Load(completeUrl) if (statusCode == HttpStatusCode.OK) { // received a read document } 

Looking at the HtmlAgilityPack code on GutHub, it’s even simpler, HtmlWeb has the StatusCode property, which is set with the value:

 var web = new HtmlWeb(); var document = web.Load(completeurl); if (web.StatusCode == HttpStatusCode.OK) { var urls = document.DocumentNode.Descendants("img") .Select(e => e.GetAttributeValue("src", null)) .Where(s => !String.IsNullOrEmpty(s)).ToList(); } 
+5
source

Pay attention to the version you are using!

I am using HtmlAgilityPack v1.5.1 and there is no PostRequestHandler event.

In v1.5.1 you need to use the PostResponse field. See the example below.

 var htmlWeb = new HtmlWeb(); var lastStatusCode = HttpStatusCode.OK; htmlWeb.PostResponse = (request, response) => { if (response != null) { lastStatusCode = response.StatusCode; } }; 

There are not many differences, but they are still there.

Hope this saves some time for someone.

+1
source

All Articles