I am currently processing data for mining. This includes removing numbers, punctuation, and stop words (common words that will just be a noise in data mining), and then executing the word.
Using the package tmin R, you can remove stop words, for example, using tm_map(myCorpus, removeWords, stopwords('english')). User manual tmdemonstrates usage stopwords("english")). This list of words contains abbreviations such as “I would” and “I”, as well as the very common word “I”:
> library(tm)
> which(stopwords('english') == "i")
[1] 1
> which(stopwords('english') == "i'd")
[1] 69
(Text is considered lowercase before deleting stop words.)
But (presumably), because "i" is the first on the list, abbreviations are never deleted:
> removeWords("i'd like a soda, please", stopwords('english'))
[1] "'d like soda, please"
A quick hack is to undo a list of words:
> removeWords("i'd like a soda, please", rev.default(stopwords('english')))
[1] " like soda, please"
- / .
/ - ( "" )?