regmatches This approach uses regexpr / grepgexpr and regmatches . I expanded the test data to add more examples.
text1 <- c(" s@1212a www.abcd.com www.cats.com", "www.boo.com", "asdf", "blargwww.test.comasdf")
What gives
> regmatches(text1, m) ## [[1]] [1] "www.abcd.com" "www.cats.com" [[2]] [1] "www.boo.com" [[3]] character(0) [[4]] [1] "www.test.com"
Note that it returns a list. If we need a vector, you can simply use unlist for the result. This is due to the fact that we used gregexpr , which means that there may be several matches in our line. If we know that there is at most one match, we could use regexpr instead
> m <- regexpr(pattern, text1) > regmatches(text1, m) [1] "www.abcd.com" "www.boo.com" "www.test.com"
Note that this returns all results as a vector and returns only one result from each row (note that www.cats.com is not in the results). In general, however, I think that either of these two methods is preferable to the gsub method, because this method will return all input if the result is not found. For example, see:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1") [1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And this even after changing the template will be a little more reliable. We still get “asdf” in the results, although this clearly does not match the pattern.
Shameless silly self-esteem: regmatches was introduced with R 2.14, so if you are stuck with an earlier version of R, you might be out of luck. If you cannot install the future2.14 package from my github repo , which provides some support for the features introduced in 2.14 in earlier versions of R.
strapplyc . An alternative that gives the same result as ## is:
library(gsubfn) strapplyc(test1, pattern)
Regular expression Here are some explanations for deciphering a regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches part of www
\\. We need to avoid the actual dot character by using \\ because it is simple . represents "any character" in regular expressions.
.*? . represents any character, * reports that it matches 0 or more times, ? following * says he will not be greedy. Otherwise, “asdf www.cats.com www.dogs.com asdf” will match all “www.cats.com www.dogs.com” as a single match, and not recognize that there are two matches.
\\. Once again we need to avoid the actual dot character
com This part corresponds to the final 'com' that we want to map
Putting it all together, he says: start with www. then match any characters until you reach the first ".com"