How to see original words that are mapped to a specific phrase

Question

How to see original words that are mapped to a specific phrase

I do some text analysis using tm_map in R. I run the following code (without errors) to create a Matrix of documents (from source and other pre-processed) words.

  corpus = Corpus(VectorSource(textVector))
  corpus = tm_map(corpus, tolower)
  corpus = tm_map(corpus, PlainTextDocument) 
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, c(stopwords("english")))
  corpus = tm_map(corpus, stemDocument, language="english")

  dtm = DocumentTermMatrix(corpus)
  mostFreqTerms = findFreqTerms(dtm, lowfreq=125)

But when I look at my (last) most FreqTerms, I see a couple that makes me think: "Hmm, what words were created for production?" In addition, there may be words that make sense to me at first glance, but maybe I miss the fact that they actually contain words with different meanings.

I would like to apply the strategy / technique described in this SO answer about preserving certain terms during the stalk (for example, preserving “natural” and “naturalized” from becoming the same source term. Text deployment with tm -package-word stemming

But to do this most fully, I would like to see a list of all the individual words that were compared with my most frequent words of the stem. Is there any way to find the words that, when they were created, made up my mostFreqTerms list?

EDIT: PLAYBACK EXAMPLE

textVector = c("Trisha Takinawa: Here comes Mayor Adam West 
               himself. Mr. West do you have any words 
               for our viewers?Mayor Adam West: Box toaster
               aluminum maple syrup... no I take that one 
               back. Im gonna hold onto that one. 
               Now MaxPower is adding adamant
               so this example works")

      corpus = Corpus(VectorSource(textVector))
      corpus = tm_map(corpus, tolower)
      corpus = tm_map(corpus, PlainTextDocument) 
      corpus = tm_map(corpus, removePunctuation)
      corpus = tm_map(corpus, removeWords, c(stopwords("english")))
      corpus = tm_map(corpus, stemDocument, language="english")

      dtm = DocumentTermMatrix(corpus)
      mostFreqTerms = findFreqTerms(dtm, lowfreq=2) 
      mostFreqTerms

... The above exits mostFreqTerms

[1] "Adam" "One" "West"

, "" "" "".

+4

r nlp tm

Max Power 02 '15 16:59

1

Vinicius Woloszyn · Answer 1 · 2016-12-03T19:45:50+0000

, "" "" , "" "" .

import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import string 

st = RSLPStemmer()
punctuations = list(string.punctuation)
textVector = "Trisha Takinawa: Here comes Mayor adams West himself. Mr. \
            West do you have any words for our viewers?Mayor Adam Wester: \
    Box toaster aluminum maple syrup... no I take that one back. Im gonna hold \
    onto that one. Now MaxPower is adding adamant so this example works"

tokens = word_tokenize(textVector.lower())
tokens = [w for w in tokens if not w in punctuations]
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
steammed_words = [st.stem(w) for w in filtered_words ]

allWordDist = nltk.FreqDist(w for w in steammed_words)

for w in allWordDist.most_common(2):
    for i in range(len(steammed_words)):
        if steammed_words[i] == w[0]:
            print str(w[0])+"="+ filtered_words[i]

=

= Wester

=

How to see original words that are mapped to a specific phrase

More articles: