RegEx needs to return the first paragraph or first n words

Question

RegEx needs to return the first paragraph or first n words

I'm looking for RegEx to return either the first [n] words in a paragraph, or if a paragraph contains less than [n] words, a full paragraph is returned.

For example, if I need, at most, the first 7 words:

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

I would get:

 one two <tag>three</tag> four five, six seven

And the same RegEx in a paragraph containing less than the requested number of words:

 <p>one two <tag>three</tag> four five.</p><p>ignore</p>

Just coming back:

 one two <tag>three</tag> four five.

My attempt at a problem led to the following RegEx:

 ^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

However, this returns only the first word - "one." This does not work. I think.*? (after \ w + \ b) causes problems.

Where am I mistaken? Can anyone introduce a RegEx that will work?

FYI, I am using .Net 3.5 Regex engine (via C #)

Thank you very much

+4

c # regex

Leigh bowers May 07, '09 at 12:03

source share

3 answers

Use the HTML parser to get the first paragraph, smoothing its structure (i.e. remove the decoration of the HTML tags inside the paragraph).
Find the position of the space character n.
Take a substring from 0 to this position.

edit: I deleted the regex clause for steps 2 and 3, as it was wrong (thanks to the commentator). In addition, the HTML structure must be smoothed.

0

Svante May 07, '09 at 12:42

source share

I had the same problem and several answers were combined into this class. It uses HtmlAgilityPack, which is the best tool to work with. Call:

  Words(string html, int n)

To get n words

 using HtmlAgilityPack; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace UmbracoUtilities { public class Text { /// <summary> /// Return the first n words in the html /// </summary> /// <param name="html"></param> /// <param name="n"></param> /// <returns></returns> public static string Words(string html, int n) { string words = html, n_words; words = StripHtml(html); n_words = GetNWords(words, n); return n_words; } /// <summary> /// Returns the first n words in text /// Assumes text is not a html string /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string /// </summary> /// <param name="text"></param> /// <param name="n"></param> /// <returns></returns> public static string GetNWords(string text, int n) { StringBuilder builder = new StringBuilder(); //remove multiple spaces //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " "); IEnumerable<string> words = cleanedString.Split().Take(n + 1); foreach (string word in words) builder.Append(" " + word); return builder.ToString(); } /// <summary> /// Returns a string of html with tags removed /// </summary> /// <param name="html"></param> /// <returns></returns> public static string StripHtml(string html) { HtmlDocument document = new HtmlDocument(); document.LoadHtml(html); var root = document.DocumentNode; var stringBuilder = new StringBuilder(); foreach (var node in root.DescendantsAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) stringBuilder.Append(" " + text.Trim()); } } return stringBuilder.ToString(); } } }

Merry Christmas!

0

Petras Dec 25 '13 at 8:38

source share

Tim pietzcker · Accepted Answer · 2009-05-07T12:47:10+0000

OK, complete the rename to confirm the new "spec" :)

I am sure you cannot do this with one regex. The best tool is an HTML parser. The closest I can get with regular expressions is a two-step approach.

First, highlight the contents of each paragraph as follows:

 <p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

Then in the next step, repeat your matches and apply the following regex once in each match of Group[1].Value :

 ((?:(\S+\s+){1,6})\w+)

This will match the first seven elements, separated by spaces / tabs / newlines, ignoring any punctuation or non-word characters.

BUT it will process the tag separated by spaces, as one of these elements, i.e. e. in

 One, two three <br\> four five six seven

it will only match until six . I think with regex, there is no way around this.

RegEx needs to return the first paragraph or first n words

More articles: