Word exclusion using tm package in R not working properly?

I am engaged in text development (PCA, HC, K-Means), and so far I have managed to encode everything correctly. However, there is a small flaw that I would like to fix.

When I try to stop my Corps, it does not work properly, because there are different words with the same radical that are not identified correctly. These are the words that interest me most (it's in Spanish, and they mean “children” or related):

niñera, niños, niñas, niña, niño 

But when I run the code, I get that these words are all the same except

 niña, niño --> niñ 

But others remain the same, so I end up only for niña / niño, but not for others.

This is my code to create the enclosure:

 corp <- Corpus(DataframeSource(data.frame(x$service_name))) docs <- tm_map(corp, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, stemDocument, language = "spanish") docs <- tm_map(docs, PlainTextDocument) dtm <- DocumentTermMatrix(docs) dtm 

I would really appreciate some suggestions! thank you

+6
source share
2 answers

It seems that stem conversion can only be applied to PlainTextDocument types. See ? stemDocument ? stemDocument .

 sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño."))) docs <- tm_map(sp.corpus, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, PlainTextDocument) # needs to come before stemming docs <- tm_map(docs, stemDocument, "spanish") print(docs[[1]]$content) # " niñer niñ niñ niñ niñ" 

vs

 # WRONG sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño."))) docs <- tm_map(sp.corpus, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, stemDocument, "spanish") # WRONG: apply PlainTextDocument first docs <- tm_map(docs, PlainTextDocument) print(docs[[1]]$content) # " niñera niños niñas niña niñ" 

In my opinion, this detail is not obvious, and it would be nice to get at least a warning when stemDocument is called on a non-PlainTextDocument.

+15
source

I changed with

 corpus <- tm_map(corpus, tolower) 

to

 corpus <- tm_map(corpus, content_transformer(tolower)) 

and then stemDocument .

+1
source

All Articles