Manipulate char with vectors inside a data.table object in R

Question

Manipulate char with vectors inside a data.table object in R

I am a little new to using data.table and understand all its subtleties. I looked through the document in other examples in SO as well, but could not find what I want, so please help!

I have data.table, which is basically a char vector (each record is a sentence)

DT=c("I love you","she loves me") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text) # > DT # text # 1: I love you # 2: she loves me

What I would like to do is perform some basic string operations inside a DT object. For example, add a new column in which I will have a char vector for which each record is a WORD from a row in the text column.

so I would like to have, for example, a new charvec column, where

 > DT[1]$charvec [1] "I" "love "you"

Of course, I would like to do this with the data.table method, superfast, because I need to do such things on films that are> 1Go file, and use more complex and computational functions. Therefore, do not use APPLY, LAPPLY and MAPPLY

My closest attempt to do what looks like this:

 myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text] # > DU2 # text charvec # 1: I love you I,love,you # 2: she loves me she,loves,me

For example, to create a function that removes the first word of each sentence, I did this

 myfun2 <- function(l){l[[1]][-1]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text] # > DV2 # text charvec # 1: I love you love,you # 2: she loves me loves,me

The problem is in the charvec column, I have a list, not a vector ...

 > str(DU2[1]$charvec) # List of 1 # $ : chr [1:3] "I" "love" "you"

1) how can i do what i want? the other functions I'm going to use is a subset of the char vector or applying some hash to it, etc.

2) By the way, can I get to DU2 or DV2 in one line instead of two lines? 3) I do not understand the syntax for data.table very well. Why does the V1 column disappear with the list() command inside [..]? 4) in another thread, I read a little about the cSplit function.

. it's good? Is this a function adapted to data.table objects?

Thank you very much

UPDATE

thanks to @Ananda Mahto Perhaps I should make myself clearer from my final goal. I have a huge file of 10,000,000 sentences stored as a string. As a first step for this project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences will not even fall into my memory, so I first divided into 10 files out of 1,000,000 sentences that would be around 10x 1Go files. The following code takes several minutes on my laptop for just one file.

 library(data.table); library(digest); num_row=1000000 DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text) rawdata <- DT hash2 <- function(word){ #using library(digest) as.numeric(paste("0x",digest(word,algo="murmur32"),sep="")) }

then

 print(system.time({ colnames(rawdata) <- "sentence" rawdata <- lapply(rawdata,strsplit," ") sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]}) hash_list <- sapply(sentences_begin,hash2) # remove(rawdata) })) ## end of print system.time for loading the data

I know that I'm pushing R here to my limits, but I'm struggling to find faster implementations, and I was thinking about the features of data.table ... hence all my questions

Here is an implementation that excludes lapply, but its actually slower!

 print(system.time({ myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text] myfun2 <- function(l){l[[1]][2:6]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text] rebuildsentence <- function(S){ paste(S,collapse=" ") } myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))} DW1 <- DV2[,myfun3(charvec),by=text] })) #end of system.time

There was no noodle in this implementation with the data file, so I was hoping the hash would be faster. However, since in each column I have a list instead of a char vector, this can slow down (?) It all significantly.

Using the first code above (with lapply / sapply ) took more than 1 hour on my laptop. I was hoping to speed it up with a more efficient data structure. "People using Python, Java, etc., do similar work in a few seconds.

Of course, another way would be to find a faster hash function, but I assumed that the package in the digest package is already optimized.

0

string r data.table strsplit

Fagui curtain Nov 18 '15 at 16:40

source share

1 answer

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-11-18T17:21:46+0000

I'm not quite sure what you need, but you can try cSplit_l from my splitstackshape package to go to the list column:

 library(splitstackshape) DU <- cSplit_l(DT, "DT", " ")

Then you can write a function like the one below to remove values from a list column:

 RemovePos <- function(inList, pos = 1) { lapply(inList, function(x) x[-c(pos[pos <= length(x)])]) }

Usage example:

 DU[, list(RemovePos(DT_list, 1)), by = DT] # DT V1 # 1: I love you love,you # 2: she loves me loves,me DU[, list(RemovePos(DT_list, 2)), by = DT] # DT V1 # 1: I love you I,you # 2: she loves me she,me DU[, list(RemovePos(DT_list, c(1, 2))), by = DT] # DT V1 # 1: I love you you # 2: she loves me me

Update

Based on your `lapply hate, maybe you can try something like the following:

 ## make a copy of your "text" column DT[, vals := text] ## Use `cSplit` to create a "long" dataset. ## Add a column to indicate the word position in the text. DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][] DTL # text vals ind # 1: I love you I 1 # 2: I love you love 2 # 3: I love you you 3 # 4: she loves me she 1 # 5: she loves me loves 2 # 6: she loves me me 3 ## Now, you can extract values easily DTL[ind == 1] # text vals ind # 1: I love you I 1 # 2: she loves me she 1 DTL[ind %in% c(1, 3)] # text vals ind # 1: I love you I 1 # 2: I love you you 3 # 3: she loves me she 1 # 4: she loves me me 3

Update 2

I don’t know what type of timings you get, but as I mentioned in the comment, you can try using regular expressions so that you don’t have to split and then insert the string back together.

Here's a sample ....

Set some data to play with:

 library(data.table) DT <- data.table( text = c("This is a sentence with a lot of words.", "This is a sentence with some more words.", "Words and words and even some more words.", "But, I don't know how you want to deal with punctuation...", "Just one more sentence, for easy multiplication.") ) DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE)) DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))

Check out the gsub sample to extract 5 words from each sentence ....

 ## Regex to extract first five words -- this should work.... patt <- "^((?:\\S+\\s+){4}\\S+).*" ## Check out some of the timings system.time(temp <- DT2[, gsub(patt, "\\1", text)]) # user system elapsed # 0.03 0.00 0.03 system.time(temp2 <- DT3[, gsub(patt, "\\1", text)]) # user system elapsed # 3 0 3 head(temp) # [1] "This is a sentence with" "This is a sentence with" "Words and words and even" # [4] "But, I don't know how" "Just one more sentence, for" "This is a sentence with"

My guess is what you want to do ....

 ## I'm assuming you want something like this.... ## Takes about a minute on my system. ## ... but note the system time for the creation of "temp2" (without digest) ## Not sure if I interpreted your hash requirement correctly.... system.time(out <- DT3[ , firstFive := gsub(patt, "\\1", text)][ , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][]) # user system elapsed # 62.14 0.05 62.20 head(out) # text firstFive firstFiveHash # 1: This is a sentence with a lot of words. This is a sentence with 4179639471 # 2: This is a sentence with some more words. This is a sentence with 4179639471 # 3: Words and words and even some more words. Words and words and even 2556713080 # 4: But, I don't know how you want to deal with punctuation... But, I don't know how 3765680401 # 5: Just one more sentence, for easy multiplication. Just one more sentence, for 298317689 # 6: This is a sentence with a lot of words. This is a sentence with 4179639471

Manipulate char with vectors inside a data.table object in R

Update

Update 2

More articles: