Regex to extract the first three words from a string

I am trying to replace all words except the first three words from a string (using a text panel).

Value Ex: This is the string for testing.

I want to extract only 3 words: This is the from the line above and delete all other words.

I calculated the regular expression according to the three words (\w+\s+){3} , but I need to match all the other words except the first three words and delete the other words. Can anybody help me?

+2
source share
3 answers

Exactly how it depends on taste, but to eliminate everything except the first three words, you can use:

 ^((?:\S+\s+){2}\S+).* 

which captures the first three words in capture group 1, as well as the rest of the line. For your replacement string, you use a link to group 1. In C #, it might look like this:

 resultString = Regex.Replace(subjectString, @"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline); 
+5
source

EDIT: added a line beginning binding to each regular expression and added special TextPad flags.

If you want to exclude the first three words and write the rest,

 ^(?:\w+\s+){3}([^\n\r]+)$ 

?: changes the first three words to a non-capturing group and captures everything after it.

Is this what you are looking for? I do not quite understand your question or your purpose.

As expected, the opposite is true here. Grab only the first three words and drop the rest:

 ^(\w+\s+){3}(?:[^\n\r]+)$ 

Just move ?: from the first to the second grouping.

As for replacing this captured group, what do you want to replace? To replace each word individually, you will need to record each word separately:

 ^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$ 

And then, for example, you can replace each of your first letter with capital letters:

Replace with: \u$1 \u$2 \u$3

Result This Is The

In TextPad, the lowercase \u in the placeholder means changing only the next letter. Upper case \u changes everything after it (until the next capitalization icon).

Try:

http://fiddle.re/f3hgv

(press the [Java] key or some other language that is most important. Note that \ u is not supported by RegexPlanet.)

+1
source

Based on a duplicate question, I will post a solution that works for "traditional" regular expression implementations that do not support Perl \s , \W extensions, etc. Beginners who are not even familiar with the fact that there are different dialects (aka tastes) of regular expressions that are recommended to be read, for example, Why are there so many different dialects of regular expressions?

If you have support for the POSIX class, you can use [[:alpha:]] for \W , [^[:alpha:]] for \W , [[:space:]] for \s , etc. But if we assume that spaces will always be space and you want to extract the first three tokens between spaces, you really don't need this.

 [^ ]+[ ]+[^ ]+[ ]+[^ ]+ 

matches three tokens, separated by spaces. (I put spaces in square brackets to make them stand out, and they are easy to expand if you want to include other characters than just one regular ASCII space in the token separator set. For example, if your regex dialect takes \t for the tab, or you you can insert a regular tab in its place, you can expand it to

 [^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+ 

In most shells, you can enter a literal tab using ctrl + v tab , that is, prefix it with an escape code, which is often typed by holding down the ctrl key and entering v .)

To use this, you may want to do

 grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file 

where single quotes are necessary to protect the regular expression from the shell (double quotes will work here, but weaker or inverted characters reset each character in the regular expression that matters to the shell as a metacharacter) or, possibly, / p>

 sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file 

to replace each line with only a captured expression (the brackets form a capture group with which you can return to \1 in the replacement part in the s command in sed ). The -r option selects a slightly more functional regular expression dialect than the traditional bare-bone sed ; if your sed doesn't have one, try -E or put a backslash in front of each bracket and plus sign.

Because of how regular expressions work, the first three are easy, because the regular expression engine always returns the first possible match in a string. If you want three tokens to start with the second, you must enter a skip expression. Adapting the sed script above, it will be

 sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' 

where you will notice how I put in marker + group without marker before capture. (This cannot be done with grep -o unless you have grep -P , in which case the full gamut of Perl extensions is available to you.)

If your regex dialect supports {m, n} repetition, you can of course reorganize the regular expression to use this. If you need a large number of repetitions, it is certainly more readable and more convenient. Just make sure you do not add parentheses where you break the order of the return line (the first left bracket creates the first group \1 , the second \2 , etc.)

 sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file 

Notice how the second group in brackets is required to indicate the repetition area {2} (we want to repeat more than just a single character immediately before the left curly bracket). The OP attempt had an error when a repetition was indicated outside the last bracket; then the backward link \1 (or whatever it called up in your dialect - TextMate seems to use $1 , just like Perl) will refer to the last single parenthesis match, since repetition is not part of the capture outside the sliding parentheses .

0
source

All Articles