I am engaged in text development (PCA, HC, K-Means), and so far I have managed to encode everything correctly. However, there is a small flaw that I would like to fix.
When I try to stop my Corps, it does not work properly, because there are different words with the same radical that are not identified correctly. These are the words that interest me most (it's in Spanish, and they mean “children” or related):
niñera, niños, niñas, niña, niño
But when I run the code, I get that these words are all the same except
niña, niño
But others remain the same, so I end up only for niña / niño, but not for others.
This is my code to create the enclosure:
corp <- Corpus(DataframeSource(data.frame(x$service_name))) docs <- tm_map(corp, removePunctuation) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("spanish")) docs <- tm_map(docs, stemDocument, language = "spanish") docs <- tm_map(docs, PlainTextDocument) dtm <- DocumentTermMatrix(docs) dtm
I would really appreciate some suggestions! thank you