I am a little new to using data.table and understand all its subtleties. I looked through the document in other examples in SO as well, but could not find what I want, so please help!
I have data.table, which is basically a char vector (each record is a sentence)
DT=c("I love you","she loves me") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text)
What I would like to do is perform some basic string operations inside a DT object. For example, add a new column in which I will have a char vector for which each record is a WORD from a row in the text column.
so I would like to have, for example, a new charvec column, where
> DT[1]$charvec [1] "I" "love "you"
Of course, I would like to do this with the data.table method, superfast, because I need to do such things on films that are> 1Go file, and use more complex and computational functions. Therefore, do not use APPLY, LAPPLY and MAPPLY
My closest attempt to do what looks like this:
myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text]
For example, to create a function that removes the first word of each sentence, I did this
myfun2 <- function(l){l[[1]][-1]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text]
The problem is in the charvec column, I have a list, not a vector ...
> str(DU2[1]$charvec) # List of 1 # $ : chr [1:3] "I" "love" "you"
1) how can i do what i want? the other functions I'm going to use is a subset of the char vector or applying some hash to it, etc.
2) By the way, can I get to DU2 or DV2 in one line instead of two lines? 3) I do not understand the syntax for data.table very well. Why does the V1 column disappear with the list() command inside [..]? 4) in another thread, I read a little about the cSplit function.
. it's good? Is this a function adapted to data.table objects?
Thank you very much
UPDATE
thanks to @Ananda Mahto Perhaps I should make myself clearer from my final goal. I have a huge file of 10,000,000 sentences stored as a string. As a first step for this project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences will not even fall into my memory, so I first divided into 10 files out of 1,000,000 sentences that would be around 10x 1Go files. The following code takes several minutes on my laptop for just one file.
library(data.table); library(digest); num_row=1000000 DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text) rawdata <- DT hash2 <- function(word){
then
print(system.time({ colnames(rawdata) <- "sentence" rawdata <- lapply(rawdata,strsplit," ") sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]}) hash_list <- sapply(sentences_begin,hash2)
I know that I'm pushing R here to my limits, but I'm struggling to find faster implementations, and I was thinking about the features of data.table ... hence all my questions
Here is an implementation that excludes lapply, but its actually slower!
print(system.time({ myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text] myfun2 <- function(l){l[[1]][2:6]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text] rebuildsentence <- function(S){ paste(S,collapse=" ") } myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))} DW1 <- DV2[,myfun3(charvec),by=text] }))
There was no noodle in this implementation with the data file, so I was hoping the hash would be faster. However, since in each column I have a list instead of a char vector, this can slow down (?) It all significantly.
Using the first code above (with lapply / sapply ) took more than 1 hour on my laptop. I was hoping to speed it up with a more efficient data structure. "People using Python, Java, etc., do similar work in a few seconds.
Of course, another way would be to find a faster hash function, but I assumed that the package in the digest package is already optimized.