Snowball Stemmer just the last word

I want to interrupt documents in the body of text documents using the tm package in R. When I apply the SnowballStemmer function to all documents in the body, only the last word of each document is disabled.

library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem 

I think this is due to the way documents are read into the enclosure. To illustrate this with a few simple examples:

 > vec<-c("running runner runs","happyness happies") > stemDocument(vec) [1] "running runner run" "happyness happi" > vec2<-c("running","runner","runs","happyness","happies") > stemDocument(vec2) [1] "run" "runner" "run" "happy" "happi" <- > corp<-Corpus(VectorSource(vec)) > corp<-tm_map(corp, stemDocument) > inspect(corp) A corpus with 2 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] run runner run [[2]] happy happi > corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" , load=T)) > corp2<-tm_map(corp2, stemDocument) > inspect(corp2) A corpus with 2 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID $`1.txt` running runner runs $`2.txt` happyness happies 
+7
source share
2 answers

load required libraries

 library(tm) library(Snowball) 

create vector

 vec<-c("running runner runs","happyness happies") 

create a body from a vector

 vec<-Corpus(VectorSource(vec)) 

It is very important to check the class of our case and keep it, because we want the standard case to understand the functions of R

 class(vec[[1]]) vec[[1]] <<PlainTextDocument (metadata: 7)>> running runner runs 

this will probably tell you a plain text document

So now we are modifying our faulty stemDocument function. First, we will convert our plain text to a character, and then split the text, apply the stemDocument method, which now works fine and inserts it back. most importantly, we will convert the output to the PlainTextDocument specified by the tm package.

 stemDocumentfix <- function(x) { PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' ')) } 

now we can use the standard tm_map on our case

 vec1 = tm_map(vec, stemDocumentfix) 

result

 vec1[[1]] <<PlainTextDocument (metadata: 7)>> run runner run 

The most important thing you need to remember is to always keep the class of documents in the enclosure. I hope this is a simplified solution to your problem using a function from 2 loaded libraries.

+4
source

The problem I see is that wordStem accepts a vector of words, but Corpus plainTextReader assumes that in the documents it reads, each word is on a separate line. In other words, this would confuse plainTextReader, as in your document you will get 3 words "

 From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean. From forth the fatal loins of these two foes 

Instead, the document should be

 From ancient grudge break to new mutiny where civil ...etc... 

Note also that punctuation also confuses wordStem , so you will also have to pull them out.

Another way to do this without changing your actual documents is to define a function that will separate and remove non-alphanumeric characters that appear before or after the word. Here is a simple one:

 wordStem2 <- function(x) { mywords <- unlist(strsplit(x, " ")) mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T) mycleanwords <- mycleanwords[mycleanwords != ""] wordStem(mycleanwords) } corpA <- tm_map(mycorpus, wordStem2); corpB <- Corpus(VectorSource(corpA)); 

Now just use corpB as regular Corpus.

+3
source

All Articles