How to extract full url using HtmlAgilityPack - C #

Question

How to extract full url using HtmlAgilityPack - C #

Ok with the method below, it only extracts the referring url, like this

extraction code:

foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]")) { lsLinks.Add(link.Attributes["href"].Value.ToString()); }

URL code

 <a href="Login.aspx">Login</a>

Highlighted URL

 Login.aspx

But I want to get a real link that the browser has analyzed, for example

 http://www.monstermmorpg.com/Login.aspx

I can do this by checking the URL containing the http, and if you don’t add the domain value, but this can cause some problems in some cases, and I think this is not a very wise decision.

C # 4.0, HtmlAgilityPack.1.4.0

+8

c # hyperlink extraction html-agility-pack

MonsterMMORPG Oct 13 '11 at 20:52

source share

2 answers

I can do this by checking the URL containing the http and if not adding the domain value

What you gotta do. The Html Agility Pack cannot help you with this:

 var url = new Uri( new Uri(baseUrl).GetLeftPart(UriPartial.Path), link.Attributes["href"].Value) );

+2

Darin Dimitrov Oct 13 '11 at 20:58

source share

Duncan smart · Accepted Answer · 2011-10-13T21:01:50+0000

Assuming you have a source url, you can combine the parsed URL like this:

 // The address of the page you crawled var baseUrl = new Uri("http://example.com/path/to-page/here.aspx"); // root relative var url = new Uri(baseUrl, "/Login.aspx"); Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx' // relative url = new Uri(baseUrl, "../foo.aspx?q=1"); Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1' // absolute url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/"); Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/' // other... url = new Uri(baseUrl, "javascript:void(0)"); Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'

Pay attention to the use of AbsoluteUri and do not rely on ToString() because ToString decodes the URL (to make it more "human-readable"), which is not usually what you want.

How to extract full url using HtmlAgilityPack - C #

More articles: