Word exclusion using tm package in R not working properly?

Question

Word exclusion using tm package in R not working properly?

I am engaged in text development (PCA, HC, K-Means), and so far I have managed to encode everything correctly. However, there is a small flaw that I would like to fix.

When I try to stop my Corps, it does not work properly, because there are different words with the same radical that are not identified correctly. These are the words that interest me most (it's in Spanish, and they mean “children” or related):

niñera, niños, niñas, niña, niño

But when I run the code, I get that these words are all the same except

 niña, niño --> niñ

But others remain the same, so I end up only for niña / niño, but not for others.

This is my code to create the enclosure:

 corp <- Corpus(DataframeSource(data.frame(x$service_name))) docs <- tm_map(corp, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, stemDocument, language = "spanish") docs <- tm_map(docs, PlainTextDocument) dtm <- DocumentTermMatrix(docs) dtm

I would really appreciate some suggestions! thank you

+6

r text-mining corpus

adrian1121 May 01 '16 at 14:08

source share

2 answers

I changed with

 corpus <- tm_map(corpus, tolower)

to

 corpus <- tm_map(corpus, content_transformer(tolower))

and then stemDocument .

+1

ResearchBigD Jan 30 '17 at 17:30

source share

Ryan walker · Accepted Answer · 2016-05-01T19:05:56+0000

It seems that stem conversion can only be applied to PlainTextDocument types. See ? stemDocument ? stemDocument .

 sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño."))) docs <- tm_map(sp.corpus, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, PlainTextDocument) # needs to come before stemming docs <- tm_map(docs, stemDocument, "spanish") print(docs[[1]]$content) # " niñer niñ niñ niñ niñ"

vs

 # WRONG sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño."))) docs <- tm_map(sp.corpus, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, stemDocument, "spanish") # WRONG: apply PlainTextDocument first docs <- tm_map(docs, PlainTextDocument) print(docs[[1]]$content) # " niñera niños niñas niña niñ"

In my opinion, this detail is not obvious, and it would be nice to get at least a warning when stemDocument is called on a non-PlainTextDocument.

Word exclusion using tm package in R not working properly?

More articles: