Other users may tell you, βNo, stop! Regular expressions should not be mixed with HTML! This is like mixing bleach and ammonia!β There is a lot of wisdom in this advice, but it is not a complete story.
The truth is, regular expressions are great for collecting commonly formatted links. However, itβs better to use a special tool for things like HtmlAgilityPack.
If you use regular expressions, you can match 99.9% of the links, but you can skip rare unforeseen corner cases or incorrect html data.
Here, the function I am collecting uses the HtmlAgilityPack to satisfy your requirements:
private static IEnumerable<string> DocumentLinks(string sourceHtml) { HtmlDocument sourceDocument = new HtmlDocument(); sourceDocument.LoadHtml(sourceHtml); return (IEnumerable<string>)sourceDocument.DocumentNode .SelectNodes("//a[@href!='#']") .Select(n => n.GetAttributeValue("href","")); }
This function creates a new HtmlAgilityPack.HtmlDocument file, loads a string containing HTML in it, and then uses the xpath request "// a [@href! = '#']" To select all links on the page that do not point to "# " Then I use the LINQ Select extension to convert the HtmlNodeCollection to a list of strings containing the value of the href attribute - where the link points to.
In this example, use:
List<string> links = DocumentLinks((new WebClient()) .DownloadString("http://google.com")).ToList(); Debugger.Break();
This should be much more efficient than regular expressions.
source share