How does tm interact with snow?

Question

How does tm interact with snow?

In a high-performance representation of the problem, it is noted that tm can use snow for parallel mining ( High -Performance and parallel computing with R ). However, I did not find examples demonstrating how to do this, although I did find some discussion of parallel computing with tm ( R / Finance 2012 ). Can anyone shed some light on how tm interacts with the cluster created by snow ?

EDIT: See BenBarnes comment below. In particular:

According to ?tm_startCluster , this function searches for an MPI cluster (not a SOCK cluster) and "allows [s]" tm "to use the cluster." Perhaps this would be an alternative to hadoop, because, given several preconditions, snow can configure the MPI cluster.

+4

r tm snow

Timothy P. Jurka Jun 18 '12 at 23:46

source share

1 answer

42- · Answer 1 · 2012-06-19T00:17:06+0000

LMGTFY using "r-project tm parallel", as the search strategy produces this as a third hit:

Distributed text distribution with tm

Copying directly from slides: Solution: 1. Distributed storage Dataset copied to DFS ('DistributedCorpus) Only meta-information about the housing remains in memory 2. Parallel calculations Computing operations (Map) for all elements in parallel Paradigm MapReduce Workhorses tm_map () and TermDocumentMatrix () Processed documents (revisions) can be obtained upon request.

Implemented in the plugin package in tm: tm.plugin.dc.

 #Distributed Text Mining in R > library("tm.plugin.dc") > dc <- DistributedCorpus(DirSource("Data/reuters"), list(reader = readReut21578XML) ) > dc <- as.DistributedCorpus(Reuters21578) > summary(dc) #A corpus with 21578 text documents #The metadata consists of 2 tag-value pairs and a data frame #Available tags are: #create_date creator #Available variables in the data frame are: #MetaID --- Distributed Corpus --- #Available revisions: #20100417144823 #Active revision: 20100417144823 #DistributedCorpus: Storage #- Description: Local Disk Storage #- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 #- Current chunk size [bytes]: 10485760 > dc <- tm_map(dc, stemDocument) > print(object.size(Reuters21578), units = "Mb") #109.5 Mb > dc #A corpus with 21578 text documents > dc_storage(dc) DistributedCorpus: Storage - Description: Local Disk Storage - Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 - Current chunk size [bytes]: 10485760 > dc[[3]] #---------- Texas Commerce Bancshares Inc ' s Texas Commerce Bank-Houston said it filed an application with the Comptroller of the Currency in an effort to create the largest banking network in Harris County. The bank said the network would link 31 banks having 13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. Reuter #--------- > print(object.size(dc), units = "Mb") # 0.6 Mb

Further search using the terms: tm, snow, parLapply ... creates this link:

With this code:

 library(snow) cl <- makeCluster(4, type="SOCK") par(ask=TRUE) bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime) bigmatrix <- matrix(0, 2000, 2000) sleeptime <- rep(1, 100) tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix)) plot(tm) cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed)) tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix)) plot(tm) cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed)) stopCluster(cl)

How does tm interact with snow?

More articles: