Does gsub creation replace only whole words?

(I use R.). A list of words called "goodwords.corpus", I look through documents in the corpus and replace each of the words in the list of "goodwords.corpus" with a word + number.

So, for example, if the word "good" is indicated in the list, and "good night" is NOT in the list, then this document:

I am having a good time goodnight 

will turn into:

 I am having a good 1234 time goodnight 

** I am using this code (EDIT - made it reproducible):

 goodwords.corpus <- c("good") test <- "I am having a good time goodnight" for (i in 1:length(goodwords.corpus)){ test <-gsub(goodwords.corpus[[i]], paste(goodwords.corpus[[i]], "1234"), test) } 

However, the problem is that I want gsub to replace only ENTRE words. There is a problem: "good" is on the list of "goodwords.corpus", but then also is "good night", which is NOT on the list. So I get the following:

 I am having a good 1234 time good 1234night 

Is there anyway, I can say gsub only replace ENTRE words, not words that can be part of other words?

I want to use this:

 test <-gsub("\\<goodwords.corpus[[i]]\\>", paste(goodwords.corpus[[i]], "1234"), test) } 

I read that \ <and \> will tell gsub to only look for whole words. But obviously this does not work, because goodwords.corpus [[i]] will not work when it is in quotation marks.

Any suggestions?

+7
r topic-modeling gsub
source share
2 answers

You are so close to that. You are already using paste to create a replacement string, why not use it to form a template string?

 goodwords.corpus <- c("good") test <- "I am having a good time goodnight" for (i in 1:length(goodwords.corpus)){ test <-gsub(paste0('\\<', goodwords.corpus[[i]], '\\>'), paste(goodwords.corpus[[i]], "1234"), test) } test # [1] "I am having a good 1234 time goodnight" 

( paste0 is just paste(..., sep='') .)

(I posted this at the same time as @MatthewLundberg, and he is right too. I am more familiar with using \b vice \< , but I thought I would continue to use your code.)

+7
source share

Use \b to specify the word boundary:

 > text <- "good night goodnight" > gsub("\\bgood\\b", paste("good", 1234), text) [1] "good 1234 night goodnight" 

In your loop, something like this:

 for (word in goodwords.corpus){ patt <- paste0('\\b', word, '\\b') repl <- paste(word, "1234") test <-gsub(patt, repl, test) } 
+9
source share

All Articles