R - search for several columns with the same conditions and counting rows by column and condition

My problem can be better explained by three questions.

1). Is there a way to search across multiple columns using indexes (I use the column names in the example below) using the same conditions? I am wondering if there is a more elegant way to implement this (I have another approach below)?

sepsis <- subset(allhospitals, diag_p %in% c(78552, 99592) | odiag1 %in% c(78552, 99592) | odiag2 %in% c(78552, 99592) | odiag3 %in% c(78552, 99592) | ## etc. etc. 

2.) After I multiply my data, I would like to calculate the number of rows in which both conditions are present for each column (i.e. how many times 78552 and 99552 occur in diag_p, odiag1, odiag2, etc.) .

3.) Finally, I would like to do the calculation above and cut it into factors from another column.

My strategy (which is terrible) was to: a.) Create a column index vector; then b) apply two functions (one for each condition) to a subset of the data and count the rows; c.) create a new data frame (one for each condition) with column indices as a single column; and finally d.) use "apply" with the function that I wrote on the column indices (that is, only a column with a new data frame).

 ## indices for all columns of interest ind <- c(35, seq(from=39, to=85, by=2)) ## create one data frame and function per ICD-9 code (ie, condition) f7 <- function(x) nrow(subset(allhospitals, allhospitals[x]=="78552")) t.7 <- data.frame("diag"=ind) t.7$freq <- apply(t.7,1,f7) f9 <- function(x) nrow(subset(allhospitals, allhospitals[x]=="99592")) t.9 <- data.frame("diag"=ind) t.9$freq <- apply(t.9,1,f9) 

Then I will return all this and get the cumulative value for the entire data set. The problem is that I need to do this for several separate factors, which makes my approach above very tiring. All attempts with the plyr package turned out to be fruitless, although I'm pretty new to R, so maybe there is a solution here.

UPDATE:

I tried the plyr package again and got something close to what I want, although I need to make one condition ("99592") and one column ("odiag1") at a time, since I need to get the number of rows for each condition - not for all conditions. As you can see, my code still looks ugly. In any case, I am returning a data frame that should convert to a "long" format, since my data set is so wide and complicated to work with. Here are some representative data and my updated ddply approach:

Sample data:

  id patzip adm_yr diag_p odiag1 odiag2 odiag3 odiag4 etc. etc. etc. Hosp A 93077 2010 99592 16932 22107 78552 NA Hosp B 99804 2011 16932 99592 78552 12988 NA Hosp B 94503 2010 22107 78552 12988 99592 16932 Hosp A 93013 2010 12988 22107 12988 NA NA Hosp C 93112 2009 99592 78552 22107 NA NA 

My new approach:

 library(plyr) df <- ddply(allhospital, .(id), summarize, diag_p = length(id[diag_p == 99592]), odiag1 = length(id[odiag1 == 99592]), odiag2 = length(id[odiag2 == 99592]), odiag3 = length(id[odiag3 == 99592]), odiag4 = length(id[odiag4 == 99592]), odiag5 = length(id[odiag5 == 99592]), odiag6 = length(id[odiag6 == 99592]), odiag7 = length(id[odiag7 == 99592]), odiag8 = length(id[odiag8 == 99592]), odiag9 = length(id[odiag9 == 99592]), odiag10 = length(id[odiag10 == 99592]), odiag11 = length(id[odiag11 == 99592]), odiag12 = length(id[odiag12 == 99592]), odiag13 = length(id[odiag13 == 99592]), odiag14 = length(id[odiag14 == 99592]), odiag15 = length(id[odiag15 == 99592]), odiag16 = length(id[odiag16 == 99592]), odiag17 = length(id[odiag17 == 99592]), odiag18 = length(id[odiag18 == 99592]), odiag19 = length(id[odiag19 == 99592]), odiag20 = length(id[odiag20 == 99592]), odiag21 = length(id[odiag21 == 99592]), odiag22 = length(id[odiag22 == 99592]), odiag23 = length(id[odiag23 == 99592]), odiag24 = length(id[odiag24 == 99592])) 

UPDATE 2:

Here, one way might look like the expected result:

  id diag Count.78552 Count.99552 Hosp A diag_p 4 0 Hosp A odiag1 10 8 Hosp A odiag2 17 16 Hosp A odiag3 9 10 Hosp B diag_p 5 8 Hosp B odiag1 1 3 Hosp B odiag2 0 1 Hosp B odiag3 0 0 
+4
source share
1 answer

The same condition for multiple columns.

 vn_cond <- c("diag_p","odiag1","odiag2","odiag3")# columns to meet condition cond_set <- c(78552, 99592)# values in condition set # # sapply - repeats conditions # rowSums(...)>0 - at least one TRUE in row # sepsis <- allhospitals[rowSums(sapply(allhospitals[vn_cond], "%in%", cond_set))>0,] 

Edit

 require(reshape2) hosp_long <- melt(allhospitals[c("id",vn_cond)], id.vars="id", na.rm=TRUE, variable.name="var_diag") hosp_long <- transform(hosp_long,diag_78552 = 0L+(value == 78552), diag_99592 = 0L+(value == 99592)) hosp_long <- melt(subset(hosp_long,select=-value), id.vars=c("id","var_diag"), variable.name="var_cond") out <- dcast(hosp_long, id+var_diag~var_cond, sum) 
0
source

All Articles