Search for rows in a data frame in R

Question

Search for rows in a data frame in R

I have strings of numbers that do not necessarily have the same length, for example.

0,0,1,2,1,0,0,0

1,1,0,1

2,1,2,0,1,0

I imported them into a dataframe in R eg the three lines above will give the following three lines (which I will call df ):

enter image description here

I want to write some functions that will help me understand the data. As a starting point - given the number vector x - I need a “process” P to determine the number of lines that contain x as a subvector, for example. if x = c(2,1) then P(x) = 2 , if x = c(0,0,0) then P(x) = 1 , and if x = c(1,3) , then P(x) = 0 . I still have many similar questions, although I hope that I can take the logic from this question and work out some other things myself.

+7

r

user1873334 Dec 19 '12 at 11:16

source share

2 answers

You need an apply function for your data rows:

 apply(dat, MARGIN = 1, FUN = is.sub.array, x = c(2,1))

where dat is your data.frame and is.sub.array is a function that checks if x contained in a larger vector (in practice, the rows of your data.frame).

I am not aware of any available is.sub.array function, here is how I wrote it:

 is.sub.array <- function(x, y) { j <- rep(TRUE, length(y)) for (i in seq_along(x)) { if (i > 1) j <- c(FALSE, head(j, -1)) j <- j & vapply(y, FUN = function(a,b) isTRUE(all.equal(a, b)), FUN.VALUE = logical(1), b = x[i]) } return(sum(j, na.rm = TRUE) > 0L) }

(The advantage of using all.equal is that it can be used to compare numeric vectors, which ordinary expressions cannot do.)

Here are some examples:

 apply(dat, 1, is.sub.array, x = c(1, 2)) # [1] TRUE FALSE TRUE apply(dat, 1, is.sub.array, x = c(0, 0, 0)) # [1] TRUE FALSE FALSE apply(dat, 1, is.sub.array, x = as.numeric(c(NA, NA))) # [1] FALSE TRUE TRUE

Note : all.equal data type sensitive, so be careful using x with the same type as your data (integer or numeric).

+3

flodel Dec 19 '12 at 12:36

source share

Joris meys · Accepted Answer · 2012-12-19T12:54:56+0000

Edit: path to regex:

 match.regex <- function(x,data){ xs <- paste(x,collapse="_") dats <- apply(data,1,paste,collapse="_") sum(grepl(xs,dats)) } > match.regex(c(1),dat) [1] 3 > match.regex(c(0,0,0),dat) [1] 1 > match.regex(c(1,2),dat) [1] 2 > match.regex(5,dat) [1] 0

Surprisingly, this is faster than the other methods given here, and about twice as fast as my solution below, both on small and large datasets. Regular expressions seem to be optimized:

 > benchmark(matching(c(1,2),dat),match.regex(c(1,2),dat),replications=1000) test replications elapsed relative 2 match.regex(c(1, 2), dat) 1000 0.15 1.0 1 matching(c(1, 2), dat) 1000 0.36 2.4

An approach that gives you a number right away and works more vectorially looks like this:

 matching.row <- function(x,row){ nx <- length(x) sid <- which(x[1]==row) any(sapply(sid,function(i) all(row[seq(i,i+nx-1)]==x))) } matching <- function(x,data) sum(apply(data,1,function(i) matching.row(x,i)),na.rm=TRUE)

Here you first create a matrix with indexes that move the window along a row of the same length as the vector you want to map. Then these windows are checked for vector. This approach is applied for each row, and the sum of the rows returning TRUE is what you want.

 > matching(c(1),dat) [1] 3 > matching(c(0,0,0),dat) [1] 1 > matching(c(1,2),dat) [1] 2 > matching(5,dat) [1] 0

Search for rows in a data frame in R

More articles: