Search for rows in a data frame in R

I have strings of numbers that do not necessarily have the same length, for example.

0,0,1,2,1,0,0,0

1,1,0,1

2,1,2,0,1,0

I imported them into a dataframe in R eg the three lines above will give the following three lines (which I will call df ):

enter image description here

I want to write some functions that will help me understand the data. As a starting point - given the number vector x - I need a β€œprocess” P to determine the number of lines that contain x as a subvector, for example. if x = c(2,1) then P(x) = 2 , if x = c(0,0,0) then P(x) = 1 , and if x = c(1,3) , then P(x) = 0 . I still have many similar questions, although I hope that I can take the logic from this question and work out some other things myself.

+7
source share
2 answers

Edit: path to regex:

 match.regex <- function(x,data){ xs <- paste(x,collapse="_") dats <- apply(data,1,paste,collapse="_") sum(grepl(xs,dats)) } > match.regex(c(1),dat) [1] 3 > match.regex(c(0,0,0),dat) [1] 1 > match.regex(c(1,2),dat) [1] 2 > match.regex(5,dat) [1] 0 

Surprisingly, this is faster than the other methods given here, and about twice as fast as my solution below, both on small and large datasets. Regular expressions seem to be optimized:

 > benchmark(matching(c(1,2),dat),match.regex(c(1,2),dat),replications=1000) test replications elapsed relative 2 match.regex(c(1, 2), dat) 1000 0.15 1.0 1 matching(c(1, 2), dat) 1000 0.36 2.4 

An approach that gives you a number right away and works more vectorially looks like this:

 matching.row <- function(x,row){ nx <- length(x) sid <- which(x[1]==row) any(sapply(sid,function(i) all(row[seq(i,i+nx-1)]==x))) } matching <- function(x,data) sum(apply(data,1,function(i) matching.row(x,i)),na.rm=TRUE) 

Here you first create a matrix with indexes that move the window along a row of the same length as the vector you want to map. Then these windows are checked for vector. This approach is applied for each row, and the sum of the rows returning TRUE is what you want.

 > matching(c(1),dat) [1] 3 > matching(c(0,0,0),dat) [1] 1 > matching(c(1,2),dat) [1] 2 > matching(5,dat) [1] 0 
+6
source

You need an apply function for your data rows:

 apply(dat, MARGIN = 1, FUN = is.sub.array, x = c(2,1)) 

where dat is your data.frame and is.sub.array is a function that checks if x contained in a larger vector (in practice, the rows of your data.frame).

I am not aware of any available is.sub.array function, here is how I wrote it:

 is.sub.array <- function(x, y) { j <- rep(TRUE, length(y)) for (i in seq_along(x)) { if (i > 1) j <- c(FALSE, head(j, -1)) j <- j & vapply(y, FUN = function(a,b) isTRUE(all.equal(a, b)), FUN.VALUE = logical(1), b = x[i]) } return(sum(j, na.rm = TRUE) > 0L) } 

(The advantage of using all.equal is that it can be used to compare numeric vectors, which ordinary expressions cannot do.)

Here are some examples:

 apply(dat, 1, is.sub.array, x = c(1, 2)) # [1] TRUE FALSE TRUE apply(dat, 1, is.sub.array, x = c(0, 0, 0)) # [1] TRUE FALSE FALSE apply(dat, 1, is.sub.array, x = as.numeric(c(NA, NA))) # [1] FALSE TRUE TRUE 

Note : all.equal data type sensitive, so be careful using x with the same type as your data (integer or numeric).

+3
source

All Articles