REGEX in R: extract words from a string

I think this is a common problem, and I found quite a few web pages, including some from SO, but I could not figure out how to implement it.

I am new to REGEX and I would like to use it in R to extract the first few words from a sentence.

for example if my suggestion

z = "I love Qaru it is such a cool site" 

id likes to have my output as (if i need the first four words)

 [1] "I love stack overflow" 

or (if I need the last four words)

 [1] "such a cool site" 

of course the following works

 paste(strsplit(z," ")[[1]][1:4],collapse=" ") paste(strsplit(z," ")[[1]][7:10],collapse=" ") 

but I would like to try regex for performance issues, since I need to deal with very large files (and also in order to know about it)

I looked through several links, including Regex for extracting the first three words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in -a-string.html

so I tried things like

 gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE) Error: '\S' is an unrecognized escape in character string starting ""^((?:\S" 

I tried other things, but usually returned me the whole string or an empty string.

Another problem with substr is that it returns a list. it may be that the [[]] operator slows down a little (??) when working with large files and makes application material.

Does the syntax used in R seem to be slightly different? thanks!

+6
source share
2 answers

You have already accepted the answer, but I am going to share this as a means to help you understand a little more about the regular expression in R, since you were really very close to getting the answer yourself.


There are two problems in your gsub approach:

  • You used a single backslash ( \ ). R requires you to avoid those because they are special characters. You avoid them by adding another backslash ( \\ ). If you execute nchar("\\") , you will see that it returns "1".

  • You did not indicate what replacement should be. Here we do not want to replace anything, but we want to fix a certain part of the line. You fix groups in parentheses (...) , and then you can refer to them by the number of groups. Here we have only one group, so we call it "\\1" .

You should have tried something like:

 sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE) # [1] "I love stack" 

It basically says:

  • Work from the beginning of the contents of "z".
  • Start creating a group 1.
  • Find non-spaces (like a word) followed by spaces ( \S+\s+ ) two times {2} , and then the next set of non-white spaces ( \S+ ). This will give us 3 words without getting a space after the third word. Thus, if you want a different number of words, change the value of {2} to one less than the number you actually are after.
  • This is where group 1 ends.
  • Then simply return the contents of group 1 ( \1 ) from "z".

To get the last three words, simply switch the position of the capture group and place it at the end of the template to fit.

 sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE) # [1] "a cool site" 
+5
source

For the first four words.

 library(stringr) str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+") 

To get the last four.

 str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)") 
+3
source

All Articles