What you are looking for are balancing groups. The following is a mutual regex conversion in .NET:
(?sx)<div[^>]*> # Opening DIV (?> # Start of atomic group (?:(?!</?div[^>]*>).)+ # (1) Any text other than open/close DIV | <div[^>]*> (?<tag>) # Add 1 "tag" value to stack if opening DIV found | </div> (?<-tag>) # Remove 1 "tag" value from stack when closing DIV tag is found )* (?(tag)(?!)) # Check if "tag" stack is not empty (then fail) </div>
Watch the regex demo
However, you can really use HtmlAgilityPack to parse HTML.
The main thing is to get XPath that will match all DIV tags that don't have ancestors with the same name. You might need something like this (untested):
private List<string> GetTopmostDivs(string html) { var result = new List<KeyValuePair<string, string>>(); HtmlAgilityPack.HtmlDocument hap; Uri uriResult; if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp) { // html is a URL var doc = new HtmlAgilityPack.HtmlWeb(); hap = doc.Load(uriResult.AbsoluteUri); } else { // html is a string hap = new HtmlAgilityPack.HtmlDocument(); hap.LoadHtml(html); } var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]"); if (nodes != null) return nodes.Select(p => p.OuterHtml).ToList(); else return new List<string>(); }
source share