How can I match the first subpattern in C #?

Question

How can I match the first subpattern in C #?

I made this template to match the nested div:

(<div[^>]*>(?:\g<1>|.)*?<\/div>)

This works well, as you can see in regex101 .

However, when I write the code below in C #:

 Regex findDivs = new Regex("(<div[^>]*>(?:\\g<1>|.)*?<\\/div>)", RegexOptions.Singleline);

It gives me an error message:

 Additional information: parsing "(<div[^>]*>(?:\g<1>|.)*?<\/div>)" - Unrecognized escape sequence \g.

As you can see, \g does not work in C #. How can I match the first subpattern?

+5

c # regex

João Ferreira May 24 '16 at 18:26

source share

2 answers

What you want to do is iterate over capture groups. Here is an example:

 foreach (var s in test) { Match match = regex.Match(s); foreach (Capture capture in match.Captures) { Console.WriteLine("Index={0}, Value={1}", capture.Index, capture.Value); Console.WriteLine(match.Groups[1].Value); } }

0

user5684647 May 24, '16 at 19:18

source share

Wiktor stribiżew · Accepted Answer · 2016-05-24T19:50:04+0000

What you are looking for are balancing groups. The following is a mutual regex conversion in .NET:

 (?sx)<div[^>]*> # Opening DIV (?> # Start of atomic group (?:(?!</?div[^>]*>).)+ # (1) Any text other than open/close DIV | <div[^>]*> (?<tag>) # Add 1 "tag" value to stack if opening DIV found | </div> (?<-tag>) # Remove 1 "tag" value from stack when closing DIV tag is found )* (?(tag)(?!)) # Check if "tag" stack is not empty (then fail) </div>

Watch the regex demo

However, you can really use HtmlAgilityPack to parse HTML.

The main thing is to get XPath that will match all DIV tags that don't have ancestors with the same name. You might need something like this (untested):

 private List<string> GetTopmostDivs(string html) { var result = new List<KeyValuePair<string, string>>(); HtmlAgilityPack.HtmlDocument hap; Uri uriResult; if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp) { // html is a URL var doc = new HtmlAgilityPack.HtmlWeb(); hap = doc.Load(uriResult.AbsoluteUri); } else { // html is a string hap = new HtmlAgilityPack.HtmlDocument(); hap.LoadHtml(html); } var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]"); if (nodes != null) return nodes.Select(p => p.OuterHtml).ToList(); else return new List<string>(); }

How can I match the first subpattern in C #?

More articles: