The row break data column.

Question

The row break data column.

This is my first SO question, so let me know if it can be improved. I am working on a natural language processing project in R and trying to create a data table containing test cases. Here I will build a simplified example:

texts.dt <- data.table(string = c("one", "two words", "three words here", "four useless words here", "five useless meaningless words here", "six useless meaningless words here just", "seven useless meaningless words here just to", "eigth useless meaningless words here just to fill", "nine useless meaningless words here just to fill up", "ten useless meaningless words here just to fill up space"), word.count = 1:10, stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))

This returns a data table. We will work on:

  string word.count stop.at.word 1: one 1 0 2: two words 2 1 3: three words here 3 2 4: four useless words here 4 2 5: five useless meaningless words here 5 4 6: six useless meaningless words here just 6 3 7: seven useless meaningless words here just to 7 3 8: eigth useless meaningless words here just to fill 8 6 9: nine useless meaningless words here just to fill up 9 7 10: ten useless meaningless words here just to fill up space 10 5

In a real application, the values in the stop.at.word column stop.at.word determined randomly (with upper bound = word.count - 1). In addition, the lines are not ordered by length, but this should not change.

The code should add two columns input and output , where input contains a substring from position 1 to stop.at.word and output contains the following word (one word):

 >desired_result string word.count stop.at.word input 1: one 1 0 2: two words 2 1 two 3: three words here 3 2 three words 4: four useless words here 4 2 four useless 5: five useless meaningless words here 5 4 five useless meaningless words 6: six useless meaningless words here just 6 2 six useless 7: seven useless meaningless words here just to 7 3 seven useless meaningless 8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just 9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to 10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here output 1: 2: words 3: here 4: words 5: here 6: meaningless 7: words 8: to 9: fill 10: just

Unfortunately, instead I get the following:

  string word.count stop.at.word input output 1: one 1 0 2: two words 2 1 NA NA 3: three words here 3 2 NA NA 4: four useless words here 4 2 NA NA 5: five useless meaningless words here 5 4 NA NA 6: six useless meaningless words here just 6 3 NA NA 7: seven useless meaningless words here just to 7 3 NA NA 8: eigth useless meaningless words here just to fill 8 6 NA NA 9: nine useless meaningless words here just to fill up 9 7 NA NA 10: ten useless meaningless words here just to fill up space 10 5 ten NA

Note the inconsistent results, with an empty line in line 1 and "ten" returned in line 10.

Here is the code I'm using:

  texts.dt[, c("input", "output") := .( substr(string, 1, sapply(gregexpr(" ", string),"[", stop.at.word) - 1), substr(string, sapply(gregexpr(" ", string),"[", stop.at.word), sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1) )]

I have done many tests, and substr instructions work well when I try to use individual rows in the console, but do not execute when applied to a data table. I suspect that I am missing something related to the scope of the data table. But I have not used this package for a long time, so I am very confused.

I really appreciate the help. Thanks in advance!

+7

string r text-processing data.table

Luc frachon Apr 15 '16 at 15:21

source share

3 answers

An alternative to @Frank mapply mapply uses by = 1:nrow(texts.dt) with strsplit and paste :

 library(data.table) texts.dt[, `:=` (input = paste(strsplit(string, ' ')[[1]][1:stop.at.word][stop.at.word>0], collapse = " "), output = strsplit(string, ' ')[[1]][stop.at.word + 1]), by = 1:nrow(texts.dt)]

which gives:

 > texts.dt string word.count stop.at.word input output 1: one 1 0 one 2: two words 2 1 two words 3: three words here 3 2 three words here 4: four useless words here 4 2 four useless words 5: five useless meaningless words here 5 4 five useless meaningless words here 6: six useless meaningless words here just 6 3 six useless meaningless words 7: seven useless meaningless words here just to 7 3 seven useless meaningless words 8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to 9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill 10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just

Instead of [[1]] you can also wrap strsplit in unlist as follows: unlist(strsplit(string, ' ')) (instead of strsplit(string, ' ')[[1]] ). This will give you the same result.

Two other options:

1) with stringi package:

 library(stringi) texts.dt[, `:=`(input = paste(stri_extract_all_words(string[stop.at.word>0], simplify = TRUE)[1:stop.at.word], collapse = " "), output = stri_extract_all_words(string[stop.at.word>0], simplify = TRUE)[stop.at.word+1]), 1:nrow(texts.dt)]

2) or an adaptation of this :

 texts.dt[stop.at.word>0, c('input','output') := tstrsplit(string, split = paste0("(?=(?>\\s+\\S*){", word.count - stop.at.word, "}$)\\s"), perl = TRUE) ][, output := sub('(\\w+).*','\\1',output)]

which both give:

 > texts.dt string word.count stop.at.word input output 1: one 1 0 NA NA 2: two words 2 1 two words 3: three words here 3 2 three words here 4: four useless words here 4 2 four useless words 5: five useless meaningless words here 5 4 five useless meaningless words here 6: six useless meaningless words here just 6 3 six useless meaningless words 7: seven useless meaningless words here just to 7 3 seven useless meaningless words 8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to 9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill 10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just

+5

Jaap Apr 15 '16 at 16:09

source share

 dt[, `:=`(input = sub(paste0('((\\s*\\w+){', stop.at.word, '}).*'), '\\1', string), output = sub(paste0('(\\s*\\w+){', stop.at.word, '}\\s*(\\w+).*'), '\\2', string)) , by = stop.at.word][] # string word.count stop.at.word # 1: one 1 0 # 2: two words 2 1 # 3: three words here 3 2 # 4: four useless words here 4 2 # 5: five useless meaningless words here 5 4 # 6: six useless meaningless words here just 6 3 # 7: seven useless meaningless words here just to 7 3 # 8: eigth useless meaningless words here just to fill 8 6 # 9: nine useless meaningless words here just to fill up 9 7 #10: ten useless meaningless words here just to fill up space 10 5 # input output # 1: one # 2: two words # 3: three words here # 4: four useless words # 5: five useless meaningless words here # 6: six useless meaningless words # 7: seven useless meaningless words # 8: eigth useless meaningless words here just to # 9: nine useless meaningless words here just to fill #10: ten useless meaningless words here just

I'm not sure that I understand the logic for output not for the first line, but the trivial fix, if really needed, remains in the OP.

+5

eddi Apr 15 '16 at 16:23

source share

Frank · Accepted Answer · 2016-04-15T15:42:34+0000

I would probably do

 texts.dt[stop.at.word > 0, c("input","output") := { sp = strsplit(string, " ") list( mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word), mapply(`[`, sp, stop.at.word+1L) ) }] # partial result head(texts.dt, 4) string word.count stop.at.word input output 1: one 1 0 NA NA 2: two words 2 1 two words 3: three words here 3 2 three words here 4: four useless words here 4 2 four useless words

As an alternative:

 library(stringi) texts.dt[stop.at.word > 0, c("input","output") := { patt = paste0("((\\w+ ){", stop.at.word-1, "}\\w+) (.*)") m = stri_match(string, regex = patt) list(m[, 2], m[, 4]) }]

The row break data column.

More articles: