Split line and transpose result

I have a dataset that has a width at each pixel position along the central skeleton. The width is displayed as a separate line, separated by a comma.

cukeDatatest <- read.delim("https://gist.githubusercontent.com/bhive01/e7508f552db0415fec1749d0a390c8e5/raw/a12386d43c936c2f73d550dfdaecb8e453d19cfe/widthtest.tsv") str(cukeDatatest) # or dplyr::glimpse(cukeDatatest) 

I need to save File and FruitNum identifiers with width.

The output I want contains three columns File, FruitNum, ObjectWidth, but File and FruitNum are repeated for the length of the ObjectWidth for this frust. The position is important, so sorting these vectors will be very bad. In addition, each fruit has a different length (if that matters to your method).

I used str_split () before cutting out a few elements from a string, but never anything so big and so much (I have 8000 of them). Processing time is a problem, but will wait for the correct result.

I'm more used to dplyr than data.table, but I see that there is some effort in Arun: R is the split text string in the data.table columns

+5
source share
3 answers

Using splitstackshape

 library(splitstackshape) res <- cSplit(cukeDatatest, splitCols = "ObjectWidth", sep = ",", direction = "long") # result head(res) # File FruitNum ObjectWidth # 1: IMG_7888.JPGcolcorrected.jpg 1 4 # 2: IMG_7888.JPGcolcorrected.jpg 1 10 # 3: IMG_7888.JPGcolcorrected.jpg 1 14 # 4: IMG_7888.JPGcolcorrected.jpg 1 15 # 5: IMG_7888.JPGcolcorrected.jpg 1 22 # 6: IMG_7888.JPGcolcorrected.jpg 1 26 
+5
source

A Hadleyverse variant with some reasonable type conversion:

 library(dplyr) library(tidyr) cukeDatatest %>% # split ObjectWidth into a nested column containing a vector mutate(ObjectWidth = strsplit(as.character(.$ObjectWidth), ',')) %>% # unnest nested column, melting data to long form unnest() %>% # convert data to integer mutate(ObjectWidth = as.integer(ObjectWidth)) # Source: local data frame [39,830 x 3] # # File FruitNum ObjectWidth # (fctr) (int) (int) # 1 IMG_7888.JPGcolcorrected.jpg 1 4 # 2 IMG_7888.JPGcolcorrected.jpg 1 10 # 3 IMG_7888.JPGcolcorrected.jpg 1 14 # 4 IMG_7888.JPGcolcorrected.jpg 1 15 # 5 IMG_7888.JPGcolcorrected.jpg 1 22 # 6 IMG_7888.JPGcolcorrected.jpg 1 26 # 7 IMG_7888.JPGcolcorrected.jpg 1 26 # 8 IMG_7888.JPGcolcorrected.jpg 1 28 # 9 IMG_7888.JPGcolcorrected.jpg 1 34 # 10 IMG_7888.JPGcolcorrected.jpg 1 35 # .. ... ... ... 

Edit

Here's an equivalent version with a more typical tidyr approach. One of the problems with this approach here is the irregular number of terms in ObjectWidth , which makes it difficult to create column names, since separate annoyingly does not contain default values ​​for its into parameter.

A simple workaround here is to intentionally create more columns than you need (the rest will be filled with NA s, which will later be removed using gather ). Although the code is less efficient, the code still works instantly, so it is not enough to achieve performance. If it is, print the length of the longest string using max(sapply(strsplit(as.character(cukeDatatest$ObjectWidth), ','), length)) .

 cukeDatatest %>% # tbl_df conversion is unnecessary, but nice for printing purposes tbl_df() %>% # split ObjectWidth on commas into individual columns separate(ObjectWidth, into = paste0('X', 1:2500), sep = ',', fill = 'right', convert = TRUE) %>% # gather into long form gather(var, ObjectWidth, starts_with('X'), na.rm = TRUE) %>% # remove key column identifying term number within initial ObjectWidth string select(-var) 

If you have a fixed number of terms in each line of ObjectWidth , a simple old read.csv called by ObjectWidth inserted together is a good way. read.csv estimates the number of columns from the first five rows, which is great if the number is constant.

If this does not work (for example, for this data, where the longest line is the seventh), you come across the same name as above, which can be sorted by offering col.names set of names of the corresponding length, In this case, the same thing works workaround as above.

 read.csv(text = paste(as.character(cukeDatatest$ObjectWidth), collapse = '\n'), header = FALSE, col.names = paste0('V', 1:2179)) %>% bind_cols(cukeDatatest[,-3]) %>% gather(var, ObjectWidth, starts_with('V'), na.rm = TRUE) %>% select(-var) 

Both approaches return tbl_df, which is exactly equivalent to the original approach above.

+4
source

I usually start with a simple strsplit :

 dt[, strsplit(ObjectWidth, ",", fixed = T)[[1]], by = .(File, FruitNum)] 

If this is too slow, I would run strsplit on the entire column and then rebuild the data to my liking:

 l = strsplit(dt$ObjectWidth, ",", fixed = T) dt[inverse.rle(list(lengths = lengths(l), values = seq_along(l))), .(File, FruitNum)][, col := unlist(l)][] 
+4
source

All Articles