StemCompletion does not work

I use the tm package for textual analysis of recovery data, reading data into a data frame, converting to a Corpus object, applying various methods to clear data using lower, roundabout, delete, etc.

Taken back from the Shell object for stemCompletion.

The completed stemDocument file using the tm_map function, my object words have been exhausted

got the expected results.

When I perform the stemCompletion operation using the tm_map function, it does not work and received an error below

Error in UseMethod ("words"): there is no applicable method for 'words' applied to an object of class character

Trackback () done to show and get the steps as shown below

> traceback() 9: FUN(X[[1L]], ...) 8: lapply(dictionary, words) 7: unlist(lapply(dictionary, words)) 6: unique(unlist(lapply(dictionary, words))) 5: FUN(X[[1L]], ...) 4: lapply(X, FUN, ...) 3: mclapply(content(x), FUN, ...) 2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig) 1: tm_map(c, stemCompletion, dictionary = c_orig) 

How can I solve this error?

+8
tm
source share
4 answers

I got the same error when using tm v0.6. I suspect this is happening because stemCompletion not in the default transform for this version of the tm package:

 > getTransformations function () c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", "stripWhitespace") <environment: namespace:tm> 

The tolower function now has the same problem, but can be done using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.

Note that even if stemCompletion not a default translation, it still works when compressed words are manually entered:

 > stemCompletion("compani",dictCorpus) compani "companies" 

So that I could continue my work, I manually limited each document in the body to single spaces, passed them through stemCompletion and combined them together with the following (awkward and not graceful!) Function:

 stemCompletion_mod <- function(x,dict=dictCorpus) { PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" "))) } 

where dictCorpus is just a copy of a purified body, but before it arose. The extra stripWhitespace specific to my body, but most likely benign to the overall body. You can change the type parameter from "shortest" as needed.


As a complete example, let's set up a dummy package using crude data in the tm package:

 > data("crude") > docs = Corpus(VectorSource(crude)) > docs <- tm_map(docs, content_transformer(tolower)) > docs <- tm_map(docs, removeNumbers) > docs <- tm_map(docs, removeWords, stopwords("english")) > docs <- tm_map(docs, removePunctuation) > docs <- tm_map(docs, stripWhitespace) > docs <- tm_map(docs, PlainTextDocument) > dictCorpus <- docs > docs <- tm_map(docs, stemDocument) > # Define modified stemCompletion function > stemCompletion_mod <- function(x,dict=dictCorpus) { PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" "))) } > # Original doc in crude data > crude[[1]] <<PlainTextDocument (metadata: 15)>> Diamond Shamrock Corp said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said. "The price reduction today was made in the light of falling oil product prices and a weak crude oil market," a company spokeswoman said. Diamond is the latest in a line of US oil companies that have cut its contract, or posted, prices over the last two days citing weak oil markets. Reuter > # Stemmed example in crude data > docs[[1]] <<PlainTextDocument (metadata: 7)>> diamond shamrock corp said effect today cut contract price crude oil dlrs barrel reduct bring post price west texa intermedi dlrs barrel copani said price reduct today made light fall oil product price weak crude oil market compani spokeswoman said diamond latest line us oil compani cut contract post price last two day cite weak oil market reuter > # Stem comlpeted example in crude data > stemCompletion_mod(docs[[1]],dictCorpus) <<PlainTextDocument (metadata: 7)>> diamond shamrock corp said effect today cut contract price crude oil dlrs barrel reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today made light fall oil product price weak crude oil market companies spokeswoman said diamond latest line us oil companies cut contract posted price last two day cited weak oil market reuter 

Note. This example is odd, because in this process the word with the error "copany" is displayed: β†’ "copani" β†’ "NA". Not sure how to fix it ...

To run stemCompletion_mod all over the body, I just use sapply (or parSapply with a snow pack).

Perhaps someone with more experience than me can suggest a simpler change to get stemCompletion to work in v0.6 tm package.

+6
source share

I had success with the following workflow:

  • use content_transformer to apply an anonymous function to each corpus document,
  • divide the document into words with spaces,
  • call stemCompletion in words using a dictionary,
  • and combine the individual words into the document again using paste .

POC demo code:

 tm_map(c, content_transformer(function(x, d) paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d) 

PS: using c as a variable name to store the case is not a good idea because of base::c

+5
source share

Thanks, cdxsza. Your method worked for me.

Note to anyone who is going to use stemCompletion :

The function completes an empty line with a word in the dictionary, which is unexpected. See the example below where the first β€œMonday” was created for a space at the beginning of a line.

 stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday")) [1] "monday" "monday" "tuesday" 

It can be easily stemCompletion removing the empty string "" before stemCompletion , as shown below.

 stemCompletion2 <- function(x, dictionary) { x <- unlist(strsplit(as.character(x), " ")) x <- x[x != ""] x <- stemCompletion(x, dictionary=dictionary) x <- paste(x, sep="", collapse=" ") PlainTextDocument(stripWhitespace(x)) } myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy) myCorpus <- Corpus(VectorSource(myCorpus)) 

See a detailed example on page 12 of the slides at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf

Hi

Yangchang Zhao

RdataMining.com

+5
source share

The problem is that using a tolower (for example, myCorpus <- tm_map(myCorpus, tolower) ) converts text to simple character values ​​that tm version 0.6 does not accept for use with tm_map .

If you instead make your original tolower , like this

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

then the data will be in the correct format when you need stemCompletion .

Other functions, such as removePunctuation and removeNumbers , are used with tm_map , as usual, i.e. without content_transformer .

Link: stack overflow

+3
source share

All Articles