I think this is a common problem, and I found quite a few web pages, including some from SO, but I could not figure out how to implement it.
I am new to REGEX and I would like to use it in R to extract the first few words from a sentence.
for example if my suggestion
z = "I love Qaru it is such a cool site"
id likes to have my output as (if i need the first four words)
[1] "I love stack overflow"
or (if I need the last four words)
[1] "such a cool site"
of course the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ") paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but I would like to try regex for performance issues, since I need to deal with very large files (and also in order to know about it)
I looked through several links, including Regex for extracting the first three words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in -a-string.html
so I tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE) Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
I tried other things, but usually returned me the whole string or an empty string.
Another problem with substr is that it returns a list. it may be that the [[]] operator slows down a little (??) when working with large files and makes application material.
Does the syntax used in R seem to be slightly different? thanks!