RegEx needs to return the first paragraph or first n words

I'm looking for RegEx to return either the first [n] words in a paragraph, or if a paragraph contains less than [n] words, a full paragraph is returned.

For example, if I need, at most, the first 7 words:

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p> 

I would get:

 one two <tag>three</tag> four five, six seven 

And the same RegEx in a paragraph containing less than the requested number of words:

 <p>one two <tag>three</tag> four five.</p><p>ignore</p> 

Just coming back:

 one two <tag>three</tag> four five. 

My attempt at a problem led to the following RegEx:

 ^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>) 

However, this returns only the first word - "one." This does not work. I think.*? (after \ w + \ b) causes problems.

Where am I mistaken? Can anyone introduce a RegEx that will work?

FYI, I am using .Net 3.5 Regex engine (via C #)

Thank you very much

+4
source share
3 answers

OK, complete the rename to confirm the new "spec" :)

I am sure you cannot do this with one regex. The best tool is an HTML parser. The closest I can get with regular expressions is a two-step approach.

First, highlight the contents of each paragraph as follows:

 <p>(.*?)</p> 

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

Then in the next step, repeat your matches and apply the following regex once in each match of Group[1].Value :

 ((?:(\S+\s+){1,6})\w+) 

This will match the first seven elements, separated by spaces / tabs / newlines, ignoring any punctuation or non-word characters.

BUT it will process the tag separated by spaces, as one of these elements, i.e. e. in

 One, two three <br\> four five six seven 

it will only match until six . I think with regex, there is no way around this.

+7
source
  • Use the HTML parser to get the first paragraph, smoothing its structure (i.e. remove the decoration of the HTML tags inside the paragraph).
  • Find the position of the space character n.
  • Take a substring from 0 to this position.

edit: I deleted the regex clause for steps 2 and 3, as it was wrong (thanks to the commentator). In addition, the HTML structure must be smoothed.

0
source

I had the same problem and several answers were combined into this class. It uses HtmlAgilityPack, which is the best tool to work with. Call:

  Words(string html, int n) 

To get n words

 using HtmlAgilityPack; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace UmbracoUtilities { public class Text { /// <summary> /// Return the first n words in the html /// </summary> /// <param name="html"></param> /// <param name="n"></param> /// <returns></returns> public static string Words(string html, int n) { string words = html, n_words; words = StripHtml(html); n_words = GetNWords(words, n); return n_words; } /// <summary> /// Returns the first n words in text /// Assumes text is not a html string /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string /// </summary> /// <param name="text"></param> /// <param name="n"></param> /// <returns></returns> public static string GetNWords(string text, int n) { StringBuilder builder = new StringBuilder(); //remove multiple spaces //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " "); IEnumerable<string> words = cleanedString.Split().Take(n + 1); foreach (string word in words) builder.Append(" " + word); return builder.ToString(); } /// <summary> /// Returns a string of html with tags removed /// </summary> /// <param name="html"></param> /// <returns></returns> public static string StripHtml(string html) { HtmlDocument document = new HtmlDocument(); document.LoadHtml(html); var root = document.DocumentNode; var stringBuilder = new StringBuilder(); foreach (var node in root.DescendantsAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) stringBuilder.Append(" " + text.Trim()); } } return stringBuilder.ToString(); } } } 

Merry Christmas!

0
source

All Articles