Extract links to sites from text in R

I have several texts, each of which may contain links to one or more web links. eg:

text1= " s@1212a as www.abcd.com asasa11". 

How to extract:

  "www.abcd.com" 

from this text to R? In other words, I want to extract patterns starting with www and ending with .com

+4
source share
3 answers

regmatches This approach uses regexpr / grepgexpr and regmatches . I expanded the test data to add more examples.

 text1 <- c(" s@1212a www.abcd.com www.cats.com", "www.boo.com", "asdf", "blargwww.test.comasdf") # Regular expressions take some practice. # check out ?regex or the wikipedia page on regular expressions # for more info on creating them yourself. pattern <- "www\\..*?\\.com" # Get information about where the pattern matches text1 m <- gregexpr(pattern, text1) # Extract the matches from text1 regmatches(text1, m) 

What gives

 > regmatches(text1, m) ## [[1]] [1] "www.abcd.com" "www.cats.com" [[2]] [1] "www.boo.com" [[3]] character(0) [[4]] [1] "www.test.com" 

Note that it returns a list. If we need a vector, you can simply use unlist for the result. This is due to the fact that we used gregexpr , which means that there may be several matches in our line. If we know that there is at most one match, we could use regexpr instead

 > m <- regexpr(pattern, text1) > regmatches(text1, m) [1] "www.abcd.com" "www.boo.com" "www.test.com" 

Note that this returns all results as a vector and returns only one result from each row (note that www.cats.com is not in the results). In general, however, I think that either of these two methods is preferable to the gsub method, because this method will return all input if the result is not found. For example, see:

 > gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1") [1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com" 

And this even after changing the template will be a little more reliable. We still get “asdf” in the results, although this clearly does not match the pattern.

Shameless silly self-esteem: regmatches was introduced with R 2.14, so if you are stuck with an earlier version of R, you might be out of luck. If you cannot install the future2.14 package from my github repo , which provides some support for the features introduced in 2.14 in earlier versions of R.

strapplyc . An alternative that gives the same result as ## is:

 library(gsubfn) strapplyc(test1, pattern) 

Regular expression Here are some explanations for deciphering a regular expression:

 pattern <- "www\\..*?\\.com" 

Explanation:

www matches part of www

\\. We need to avoid the actual dot character by using \\ because it is simple . represents "any character" in regular expressions.

.*? . represents any character, * reports that it matches 0 or more times, ? following * says he will not be greedy. Otherwise, “asdf www.cats.com www.dogs.com asdf” will match all “www.cats.com www.dogs.com” as a single match, and not recognize that there are two matches.

\\. Once again we need to avoid the actual dot character

com This part corresponds to the final 'com' that we want to map

Putting it all together, he says: start with www. then match any characters until you reach the first ".com"

+10
source

Check out the gsub function:

 x = " s@1212a as www.abcd.com asasa11" gsub(x=x, pattern=".*(www.*com).*", replace="\\1") 

The main idea is to surround the txt that you want to keep in brackets, and then replace the entire line with it. The gsub replacement parameter "\\ 1" refers to what was found in parentheses.

+5
source

The decisions here are great and at the core. For those who want a quick fix, you can use qdap genXtract . These functions basically take the left and right elements (s), and it will extract everything in between. By setting with = TRUE , it will include these elements:

 text1 <- c(" s@1212a www.abcd.com www.cats.com", "www.boo.com", "asdf", "http://www.talkstats.com/ and http://stackoverflow.com/", "blargwww.test.comasdf") library(qdap) genXtract(text1, "www.", ".com", with=TRUE) ## > genXtract(text1, "www.", ".com", with=TRUE) ## $`www. : .com1` ## [1] "www.abcd.com" "www.cats.com" ## ## $`www. : .com2` ## [1] "www.boo.com" ## ## $`www. : .com3` ## character(0) ## ## $`www. : .com4` ## [1] "www.talkstats.com" ## ## $`www. : .com5` ## [1] "www.test.com" 

PS, if you look at the code for the function, this is the shell for Dason's solution.

+2
source

All Articles