Self Conflicting Stop Words in R tm text mining

I am currently processing data for mining. This includes removing numbers, punctuation, and stop words (common words that will just be a noise in data mining), and then executing the word.

Using the package tmin R, you can remove stop words, for example, using tm_map(myCorpus, removeWords, stopwords('english')). User manual tmdemonstrates usage stopwords("english")). This list of words contains abbreviations such as “I would” and “I”, as well as the very common word “I”:

> library(tm)
> which(stopwords('english') == "i")
[1] 1
> which(stopwords('english') == "i'd")
[1] 69

(Text is considered lowercase before deleting stop words.)

But (presumably), because "i" is the first on the list, abbreviations are never deleted:

> removeWords("i'd like a soda, please", stopwords('english'))
[1] "'d like  soda, please"

A quick hack is to undo a list of words:

> removeWords("i'd like a soda, please", rev.default(stopwords('english')))
[1] " like  soda, please"

- / .

/ - ( "" )?

+4
1

- , , . , - , , , .

, i i'm, . , . :

require(quanteda)
removeFeatures(tokenize("i'd like a soda, please"), c("i'd", "a"))
# tokenizedText object from 1 document.
# Component 1 :
# [1] "like"   "soda"   ","      "please"

quanteda -, ( ):

removeFeatures(tokenize("i'd like a soda, please", removePunct = TRUE),
               stopwords("english"))
# tokenizedText object from 1 document.
# Component 1 :
# [1] "like"   "soda"   "please"

(, , ), - .

+2

All Articles