A quick check on the R data frame to see if the row values ​​in one column are inside another column in the data frame

I have a marketing data frame with 22 thousand records and 6 columns, 2 of which are of interest.

  • Variable
  • Fo.variable

Here is a link with the output output of the data frame sample: http://dpaste.com/2SJ6DPX

Please let me know if there is a better way to share this data.

All I want to do is create an extra binary column, which should be:

  • 1 if FO.variable is inside a variable
  • 0 if FO.Variable is not inside the variable

Seems like a simple thing ... in Excel, I would just add another column with the if formula, and then paste the formula down. I spent the last hours trying to get this, and R failed.

Here is what I tried:

  • Using grepl to match patterns. I used grepl before, but this time I'm trying to pass a column instead of a row. My early attempts failed because I tried to force grepl and ifelse to use grepl as a result, using the first value in the column, and not the whole thing.

  • My next attempt was to use transform and grep based on another post on SO. I did not think that this would give me my exact answer, but I thought that it would bring me closer so that I could understand it from there ... the code ran for a while than the error, because the index is invalid.

    transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])

  • , str_detect, , , , , "" ?

    kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))

  • EDIT: for. , . for-loops, . , , , :

for(i in 1:nrow(dd)){ if(dd[i,4] %in% dd[i,2]) dd$test[i] <- 1 }

, - 1 0, FO.variable . , 1, 4- , "/" " , ".

, . , , ?

, . , , . , , .

+4
3

df = dget("http://dpaste.com/2SJ6DPX.txt")

""

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

v , v ,

uv = unlist(v)
idx = rep(seq_along(v), len)

, , uv FO.variable

test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE

( , data.frame, dd$Keep = f0(dd))

f0 = function(dd) {
    v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
    len = lengths(v)
    uv = unlist(v)
    idx = rep(seq_along(v), len)

    keep = logical(nrow(dd))
    keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
    keep
}

( , , , , , ?) ( , )

f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

, fixed = TRUE .

+3

mapply , , . , ( ) fixed = TRUE apriori character .

transform(dd, Keep = mapply(grepl, 
                            as.character(FO.variable), 
                            as.character(variable), 
                            fixed = TRUE))

#    VisitorIDTrue                        variable value      FO.variable FO.value  Keep
# 22      44888657 Direct / Unknown,Organic Search     1 Direct / Unknown        1  TRUE
# 2       44888657   Direct / Unknown,System Email     1 Direct / Unknown        1  TRUE
# 6       44888657             Direct / Unknown,TV     1 Direct / Unknown        1  TRUE
# 10      44888657     Organic Search,System Email     1 Direct / Unknown        1 FALSE
# 18      44888657               Organic Search,TV     1 Direct / Unknown        1 FALSE
# 14      44888657                 System Email,TV     1 Direct / Unknown        1 FALSE
# 24      44888657 Direct / Unknown,Organic Search     1   Organic Search        1  TRUE
# 4       44888657   Direct / Unknown,System Email     1   Organic Search        1 FALSE
...
+3

Here is an approach based on data.table, which, in my opinion, is very similar in spirit to Martin:

require(data.table)

dt <- data.table(df)
dt[,`:=`(
    fch = as.character(FO.variable),
    rn  = 1:.N
)]

dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]

dt[,c("fch","rn"):=NULL]

The idea is to

  • identify all pairs rnand variable(stored in dtvars) and
  • look which of these pairs matches the pairs rnand F0.variable(in the source table dt).
+2
source

All Articles