Regular expression for parsing links from html code

Possible duplicate:
Regex to get the link in href. [asp.net]

I am working on a method that takes a string (html code) and returns an array containing all the links contained in.

I saw several options for things like the html feature package, but it seems a bit more complicated than this project requires

I am also interested in using regex because I don’t have much experience with it as a whole, and I think that would be a good opportunity to learn.

My code is still

WebClient client = new WebClient(); string htmlCode = client.DownloadString(p); Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase); string[] test = exp.Split(htmlCode); 

but I don’t get the results I want because I am still working on a regex

sudo for what I'm looking for is "

+4
source share
4 answers

If you are looking for a solution with flawless proof, regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably analyze links or other tags in this case from an HTML file due to the complexity of the HTML language.

Instead, you will need to use the actual DOM API to handle the links.

+3
source

Regular expressions are not a good idea for HTML.

see previous questions:

  • <a href = "/ questions / 639683 / when-is-it-wise-to-use-regular-expressions-with-html"> When is it appropriate to use regular expressions with HTML?
  • <a href = "/ questions / 1296116 / regexp-that-matches-all-the-text-content-of-a-html-input"> Regexp, which matches all the textual content of the HTML input

Rather, you want something that already knows how to parse the DOM; otherwise, you reinvent the wheel.

+2
source

Other users may tell you, β€œNo, stop! Regular expressions should not be mixed with HTML! This is like mixing bleach and ammonia!” There is a lot of wisdom in this advice, but it is not a complete story.

The truth is, regular expressions are great for collecting commonly formatted links. However, it’s better to use a special tool for things like HtmlAgilityPack.

If you use regular expressions, you can match 99.9% of the links, but you can skip rare unforeseen corner cases or incorrect html data.

Here, the function I am collecting uses the HtmlAgilityPack to satisfy your requirements:

  private static IEnumerable<string> DocumentLinks(string sourceHtml) { HtmlDocument sourceDocument = new HtmlDocument(); sourceDocument.LoadHtml(sourceHtml); return (IEnumerable<string>)sourceDocument.DocumentNode .SelectNodes("//a[@href!='#']") .Select(n => n.GetAttributeValue("href","")); } 

This function creates a new HtmlAgilityPack.HtmlDocument file, loads a string containing HTML in it, and then uses the xpath request "// a [@href! = '#']" To select all links on the page that do not point to "# " Then I use the LINQ Select extension to convert the HtmlNodeCollection to a list of strings containing the value of the href attribute - where the link points to.

In this example, use:

  List<string> links = DocumentLinks((new WebClient()) .DownloadString("http://google.com")).ToList(); Debugger.Break(); 

This should be much more efficient than regular expressions.

+2
source

You can search for anything that looks like the http / https URL scheme. This is not HTML proof, but you get what looks like the http addresses you need, I suspect. You can add more sachems and domains.
The regular expression searches for objects that look like URLs in the href attributes (not strictly).

 class Program { static void Main(string[] args) { const string pattern = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']"; var regex = new Regex(pattern); var urls = new string[] { "href='http://company.com'", "href=\"https://company.com\"", "href='http://company.org'", "href='http://company.org/'", "href='http://company.org/path'", }; foreach (var url in urls) { Match match = regex.Match(url); if (match.Success) { Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value); } } } } 

exit:

href = ' http://company.com ' β†’ http://company.com
href = "https://company.com" β†’ https://company.com
href = ' http://company.org ' β†’ http://company.org
href = ' http://company.org/ ' β†’ http://company.org
href = ' http://company.org/path ' β†’ http://company.org

0
source

All Articles