How do you run chisq.test on two different dplyr outputs and then sum them up in a table?

Question

How do you run chisq.test on two different dplyr outputs and then sum them up in a table?

The question I have is related to the one I posted some time ago here

Jaap was fantastic because it helped me create this wonderful output of pivot tables of counts and frequencies (in percent) of categorical variables.

The “real data” that I am analyzing is two different hospitals, each of which has a different frequency of drug use, but not always the same drugs.

The summary of the Jaap func function from here is as follows, but in general the data.frame shown below (hospitals number one and two):

  id AB1 AB2 AB3 AB4 AB5 AB6 AB7 AB8 AB9 AB10 AB11 AB12 AB13 total perc 1 1st gen Cephalosporin 4 0 0 1 1 0 0 0 0 0 0 0 0 6 1.9 2 3rd gen Cephalosporin 44 7 8 1 3 2 0 0 0 0 0 0 0 65 20.5 3 4th gen Cephalosporin 3 3 0 1 2 1 0 0 0 0 0 0 0 10 3.2

Now I would like to run chisq.test (or Fisher's , if the frequency is below 5) of the entire variable (names) found in the id column, using the total frequency found in the total column, by comparing hospital one against hospital two.

So, in layman's terms, I want to answer the following question: "If the 1st generation of cephalosporins were more often prescribed in the hospital compared to the hospital two?" etc.

Since some variable identifier may not be identical between hospitals, I expect this may return a NULL calculation.

Ideally, I would like to summarize all of these findings in a table with the corresponding p-value, to look like this:

 id Hospital One Total Frequency Hospital Two Total Frequency p-value xyz 15 30 0.01

Many thanks for your help.

All data can be found below.

Greetings

EDIT the following points raised:

This is just mock output (ideally, what I would like).

 id Hospital One Total Frequency Hospital Two Total Frequency p-value xyz ni x.xx

As already mentioned, the p value must be obtained from a chisq.test or fisher.test .

I'm going, the output must be somehow generated this way, with hospital no. 1 called hosp1 and no. 2 hospital called hosp2

 # first take those columns of the dplyr output your interested in hosp1_sel<-hosp1[,c("id","total")] hosp2_sel<-hosp2[,c("id","total")] #then merge the data.frames to one so you can perform analysis on one dataframe new_df <- merge(hosp1_sel, hosp2_sel, by=0) #this looks like this > new_df Row.names id.x total.x id.y total.y 1 1 1st gen Cephalosporin 6 3rd gen Cephalosporin 19 2 10 Trimethoprim 2 Polypeptide 1 3 11 Ureidopenicillin 46 Rifamycin 1 4 12 Carbapenem 19 Tetracycline 1 5 13 Fluorquinolone 17 Lincosamide 1 6 14 Nitromidazole 12 Quinolone 2 7 15 Antifungal 6 Sulfonamides 2 8 16 Oxazolidinone 2 Nitroimidazole 1 9 17 Rifamycin 1 Polymyxine 1 10 18 Polypeptide 1 Colistin 1 11 2 3rd gen Cephalosporin 65 Carbapenem 37 12 3 4th gen Cephalosporin 10 Fluoroquinolone 24 13 4 Aminoglycoside 31 Glycopeptide 32 14 5 Clindamycin 2 Penicillin 29 15 6 Glycopeptide 55 Ureidopenicillin 36 16 7 Macrolide 3 Lipopeptide 4 17 8 Penicillin 36 Macrolid 2 18 9 Tetracycline 2 Aminoglycoside 9

This is where I am stuck. In my opinion, I would have to make this data.frame wider, then to run something like:

 chisq.test(hosp1$Ureidopenicillin, hosp2$Ureidopenicillin)

To determine if there were more frequent cases of "ureidopenicillins" in Hospital No. 1 compared to Hospital No. 2, etc.

The problem is that this is actually a comparison of "counts", not the "proportions" from the contingency table, though ...

Any ideas?

ABOUT.

Hospital No. 1 data.frame :

 structure(list(id = structure(1:19, .Label = c("1st gen Cephalosporin", "3rd gen Cephalosporin", "4th gen Cephalosporin", "Aminoglycoside", "Clindamycin", "Glycopeptide", "Macrolide", "Penicillin", "Tetracycline", "Trimethoprim", "Ureidopenicillin", "Carbapenem", "Fluorquinolone", "Nitromidazole", "Antifungal", "Oxazolidinone", "Rifamycin", "Polypeptide", "Lipopeptide "), class = "factor"), AB1 = c(4L, 44L, 3L, 1L, 1L, 7L, 1L, 7L, 2L, 1L, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AB2 = c(0L, 7L, 3L, 7L, 0L, 16L, 2L, 9L, 0L, 0L, 9L, 1L, 2L, 6L, 0L, 0L, 0L, 0L, 0L), AB3 = c(0L, 8L, 0L, 5L, 1L, 13L, 0L, 5L, 0L, 0L, 12L, 4L, 1L, 2L, 0L, 0L, 0L, 0L, 0L), AB4 = c(1L, 1L, 1L, 6L, 0L, 5L, 0L, 8L, 0L, 0L, 5L, 3L, 4L, 1L, 1L, 1L, 1L, 0L, 0L), AB5 = c(1L, 3L, 2L, 2L, 0L, 4L, 0L, 1L, 0L, 0L, 2L, 4L, 1L, 1L, 2L, 0L, 0L, 0L, 0L), AB6 = c(0L, 2L, 1L, 3L, 0L, 5L, 0L, 1L, 0L, 0L, 2L, 1L, 1L, 2L, 1L, 0L, 0L, 0L, 0L), AB7 = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 3L, 0L, 0L, 2L, 2L, 2L, 0L, 0L, 1L, 0L, 1L, 0L), AB8 = c(0L, 0L, 0L, 3L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), AB9 = c(0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 0L), AB10 = c(0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AB11 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 1L, 0L, 0L, 0L, 0L), AB12 = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L ), AB13 = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), total = c(6, 65, 10, 31, 2, 55, 3, 36, 2, 2, 46, 19, 17, 12, 6, 2, 1, 1, 1), perc = c(1.9, 20.5, 3.2, 9.8, 0.6, 17.4, 0.9, 11.4, 0.6, 0.6, 14.5, 6, 5.4, 3.8, 1.9, 0.6, 0.3, 0.3, 0.3)), class = "data.frame", .Names = c("id", "AB1", "AB2", "AB3", "AB4", "AB5", "AB6", "AB7", "AB8", "AB9", "AB10", "AB11", "AB12", "AB13", "total", "perc"), row.names = c(NA, -19L))

Hospital No. 2 data.frame :

 structure(list(id = structure(1:18, .Label = c("3rd gen Cephalosporin", "Carbapenem", "Fluoroquinolone", "Glycopeptide", "Penicillin", "Ureidopenicillin", "Lipopeptide", "Macrolid", "Aminoglycoside", "Polypeptide", "Rifamycin", "Tetracycline", "Lincosamide", "Quinolone", "Sulfonamides", "Nitroimidazole", "Polymyxine", "Colistin"), class = "factor"), AB1 = c(9L, 3L, 1L, 7L, 16L, 22L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AB2 = c(2L, 17L, 5L, 8L, 2L, 9L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AB3 = c(1L, 9L, 4L, 5L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), AB4 = c(1L, 3L, 3L, 7L, 4L, 3L, 0L, 0L, 2L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L), AB5 = c(3L, 1L, 4L, 1L, 4L, 1L, 2L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 0L), AB6 = c(3L, 2L, 4L, 1L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L), AB7 = c(0L, 2L, 3L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), total = c(19, 37, 24, 32, 29, 36, 4, 2, 9, 1, 1, 1, 1, 2, 2, 1, 1, 1), perc = c(9.4, 18.2, 11.8, 15.8, 14.3, 17.7, 2, 1, 4.4, 0.5, 0.5, 0.5, 0.5, 1, 1, 0.5, 0.5, 0.5)), class = "data.frame", .Names = c("id", "AB1", "AB2", "AB3", "AB4", "AB5", "AB6", "AB7", "total", "perc" ), row.names = c(NA, -18L))

0

function r dplyr

Ofish Dec 10 '14 at 6:32

source share

1 answer

Khashaa · Accepted Answer · 2014-12-11T01:55:03+0000

Merging with hospital <- left_join(hosp1, hosp2, by = "id") %>% select(id, total.x, total.y) led to

  #id total.x total.y #1 1st gen Cephalosporin 6 NA #2 3rd gen Cephalosporin 65 19 #3 4th gen Cephalosporin 10 NA #4 Aminoglycoside 31 9 #5 Clindamycin 2 NA #6 Glycopeptide 55 32 #7 Macrolide 3 NA #8 Penicillin 36 29 #9 Tetracycline 2 1 #10 Trimethoprim 2 NA #11 Ureidopenicillin 46 36 #12 Carbapenem 19 37 #13 Fluorquinolone 17 NA #14 Nitromidazole 12 NA #15 Antifungal 6 NA #16 Oxazolidinone 2 NA #17 Rifamycin 1 1 #18 Polypeptide 1 1 #19 Lipopeptide 1 NA

It is strange that too much NA created for hosp2 . On closer inspection, there are inconsistencies between the id variables. For example, the 14th line in hosp1 is Nitromidazole , while the 16th line in hosp2 is Nitroimidazole , and I'm not sure if they indicate the same medicine.

In any case, although I have some doubts regarding the use of chisq.test , the desired result can be produced as follows

 pval <- function(x, y){ ifelse(!is.na(x) & !is.na(y), chisq.test(c(x, y))$p.value, NA) } p <- lapply(1:length(hospital$total.x), function(i){ pval(hospital$total.x[i],hospital$total.y[i]) } ) hospital$p_value <- unlist(p) colnames(hospital) <- c("id", "Hospital One Total Frequency", "Hospital Two Total Frequency", "p-value")

Final conclusion looks

 > hospital # id Hospital One Total Frequency Hospital Two Total Frequency p-value #1 1st gen Cephalosporin 6 NA NA #2 3rd gen Cephalosporin 65 19 5.193805e-07 #3 4th gen Cephalosporin 10 NA NA #4 Aminoglycoside 31 9 5.042182e-04 #5 Clindamycin 2 NA NA #6 Glycopeptide 55 32 1.366852e-02 #7 Macrolide 3 NA NA #8 Penicillin 36 29 3.852612e-01 #9 Tetracycline 2 1 5.637029e-01 #10 Trimethoprim 2 NA NA #11 Ureidopenicillin 46 36 2.694564e-01 #12 Carbapenem 19 37 1.615693e-02 #13 Fluorquinolone 17 NA NA #14 Nitromidazole 12 NA NA #15 Antifungal 6 NA NA #16 Oxazolidinone 2 NA NA #17 Rifamycin 1 1 1.000000e+00 #18 Polypeptide 1 1 1.000000e+00 #19 Lipopeptide 1 NA NA

How do you run chisq.test on two different dplyr outputs and then sum them up in a table?

More articles: