This is my first SO question, so let me know if it can be improved. I am working on a natural language processing project in R and trying to create a data table containing test cases. Here I will build a simplified example:
texts.dt <- data.table(string = c("one", "two words", "three words here", "four useless words here", "five useless meaningless words here", "six useless meaningless words here just", "seven useless meaningless words here just to", "eigth useless meaningless words here just to fill", "nine useless meaningless words here just to fill up", "ten useless meaningless words here just to fill up space"), word.count = 1:10, stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))
This returns a data table. We will work on:
string word.count stop.at.word 1: one 1 0 2: two words 2 1 3: three words here 3 2 4: four useless words here 4 2 5: five useless meaningless words here 5 4 6: six useless meaningless words here just 6 3 7: seven useless meaningless words here just to 7 3 8: eigth useless meaningless words here just to fill 8 6 9: nine useless meaningless words here just to fill up 9 7 10: ten useless meaningless words here just to fill up space 10 5
In a real application, the values ββin the stop.at.word column stop.at.word determined randomly (with upper bound = word.count - 1). In addition, the lines are not ordered by length, but this should not change.
The code should add two columns input and output , where input contains a substring from position 1 to stop.at.word and output contains the following word (one word):
>desired_result string word.count stop.at.word input 1: one 1 0 2: two words 2 1 two 3: three words here 3 2 three words 4: four useless words here 4 2 four useless 5: five useless meaningless words here 5 4 five useless meaningless words 6: six useless meaningless words here just 6 2 six useless 7: seven useless meaningless words here just to 7 3 seven useless meaningless 8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just 9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to 10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here output 1: 2: words 3: here 4: words 5: here 6: meaningless 7: words 8: to 9: fill 10: just
Unfortunately, instead I get the following:
string word.count stop.at.word input output 1: one 1 0 2: two words 2 1 NA NA 3: three words here 3 2 NA NA 4: four useless words here 4 2 NA NA 5: five useless meaningless words here 5 4 NA NA 6: six useless meaningless words here just 6 3 NA NA 7: seven useless meaningless words here just to 7 3 NA NA 8: eigth useless meaningless words here just to fill 8 6 NA NA 9: nine useless meaningless words here just to fill up 9 7 NA NA 10: ten useless meaningless words here just to fill up space 10 5 ten NA
Note the inconsistent results, with an empty line in line 1 and "ten" returned in line 10.
Here is the code I'm using:
texts.dt[, c("input", "output") := .( substr(string, 1, sapply(gregexpr(" ", string),"[", stop.at.word) - 1), substr(string, sapply(gregexpr(" ", string),"[", stop.at.word), sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1) )]
I have done many tests, and substr instructions work well when I try to use individual rows in the console, but do not execute when applied to a data table. I suspect that I am missing something related to the scope of the data table. But I have not used this package for a long time, so I am very confused.
I really appreciate the help. Thanks in advance!