Extract all words between two specific words in a character vector

Question

Extract all words between two specific words in a character vector

Is there a more efficient method? How can I do this without stringr ?

 txt <- "I want to extract the words between this and that, this goes with that, this is a long way from that" library(stringr) w_start <- "this" w_end <- "that" pattern <- paste0(w_start, "(.*?)", w_end) wordsbetween <- unlist(str_extract_all(txt, pattern)) gsub("^\\s+|\\s+$", "", str_sub(wordsbetween, nchar(w_start)+1, -nchar(w_end)-1)) [1] "and" "goes with" "is a long way from"

+7

string regex r

Ben Apr 23 '13 at 5:23

source share

2 answers

Here is another rough attempt using strsplit , although it can probably be further refined:

 txtspl <- unlist(strsplit(gsub("[[:punct:]]","",txt),"this|that")) txtspl[txtspl!=" "][-1] #[1] " and " " goes with " " is a long way from "

+1

thelatemail Apr 23 '13 at 5:50

source share

Tyler rinker · Accepted Answer · 2013-04-23T05:32:42+0000

This is the approach I use in qdap:

Using qdap:

 library(qdap) genXtract(txt, "this", "that") ## > genXtract(txt, "this", "that") ## this : that1 this : that2 this : that3 ## " and " " goes with " " is a long way from "

Without adding a package:

 regmatches(txt, gregexpr("(?<=this).*?(?=that)", txt, perl=TRUE)) ## > regmatches(txt, gregexpr("(?<=this).*?(?=that)", txt, perl=TRUE)) ## [[1]] ## [1] " and " " goes with " " is a long way from "

Extract all words between two specific words in a character vector

More articles: