A line broken into several words

Question

A line broken into several words

I have a data frame that looks like this:

V1 V2 peanut butter sandwich 2 slices of bread 1 tablespoon peanut butter

What I'm going to get is:

 V1 V2 peanut butter sandwich 2 slices of bread peanut butter sandwich 1 tablespoon peanut butter

I tried to split the string using strsplit(df$v2, " ") , but I can only split by " " . I'm not sure that you can only split the string into the first number, and then take the characters to the next number.

+6

string split regex r strsplit

yokota Dec 21 '15 at 2:00

source share

2 answers

Suppose you are dealing with something like:

 mydf <- data.frame( V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"), V2 = c("2 slices of bread 1 tablespoon peanut butter", "2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam")) mydf ## V1 ## 1 peanut butter sandwich ## 2 peanut butter and jam sandwich ## V2 ## 1 2 slices of bread 1 tablespoon peanut butter ## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam

You can first add a separator that you don't expect in "V2" and use cSplit from my "splitstackshape" to get a "long" data set format.

 library(splitstackshape) mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2) cSplit(mydf, "V2", "|", "long") ## V1 V2 ## 1: peanut butter sandwich 2 slices of bread ## 2: peanut butter sandwich 1 tablespoon peanut butter ## 3: peanut butter and jam sandwich 2 slices of bread ## 4: peanut butter and jam sandwich 1 tablespoon peanut butter ## 5: peanut butter and jam sandwich 1 tablespoon jam

Actually, it’s not enough to post as an answer, because they are variations of the @Jota approach, but I pass them here for completeness:

`strsplit` inside "data.table"

The list partition is automatically flattened into one column ....

 library(data.table) as.data.table(mydf)[, list( V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]

"dplyr" + "tidyr"

You can use unnest from "tidyr" to expand the list column into a long form ....

 library(dplyr) library(tidyr) mydf %>% mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>% unnest(V2)

+5

A5C1D2H2I1M1N2O1R2T1 Dec 21 '15 at 2:15

source share

Jota · Accepted Answer · 2015-12-21T02:09:08+0000

You can break the line as follows:

 txt <- "2 slices of bread 1 tablespoon peanut butter" strsplit(txt, " (?=\\d)", perl=TRUE)[[1]] #[1] "2 slices of bread" "1 tablespoon peanut butter"

The regular expression used here looks for spaces followed by a digit. It uses a zero-width positional look-head (?=) To say that if the space is followed by a digit ( \\d ), then this is the type of space we want to split. Why does zero width look like? This is because we don’t want to use a digit as a separator, we just want to match any space followed by a digit.

To use this idea and build your data frame, see this example:

 item <- c("peanut butter sandwich", "onion carrot mix", "hash browns") txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato") df <- data.frame(item, txt, stringsAsFactors=FALSE) # thanks to Ananda for recommending setNames split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item) # alternately: #split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE) #names(split.strings) <- df$item stack(split.strings) # values ind #1 2 slices of bread peanut butter sandwich #2 1 tablespoon peanut butter peanut butter sandwich #3 1 onion onion carrot mix #4 3 carrots onion carrot mix #5 potato hash browns

A line broken into several words

strsplit inside "data.table"

"dplyr" + "tidyr"

More articles:

`strsplit` inside "data.table"