I do some text analysis using tm_map in R. I run the following code (without errors) to create a Matrix of documents (from source and other pre-processed) words.
corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")
dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=125)
But when I look at my (last) most FreqTerms, I see a couple that makes me think: "Hmm, what words were created for production?" In addition, there may be words that make sense to me at first glance, but maybe I miss the fact that they actually contain words with different meanings.
I would like to apply the strategy / technique described in this SO answer about preserving certain terms during the stalk (for example, preserving “natural” and “naturalized” from becoming the same source term.
Text deployment with tm -package-word stemming
But to do this most fully, I would like to see a list of all the individual words that were compared with my most frequent words of the stem. Is there any way to find the words that, when they were created, made up my mostFreqTerms list?
EDIT: PLAYBACK EXAMPLE
textVector = c("Trisha Takinawa: Here comes Mayor Adam West
himself. Mr. West do you have any words
for our viewers?Mayor Adam West: Box toaster
aluminum maple syrup... no I take that one
back. Im gonna hold onto that one.
Now MaxPower is adding adamant
so this example works")
corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")
dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=2)
mostFreqTerms
... The above exits mostFreqTerms
[1] "Adam" "One" "West"
, "" "" "".