Can someone tell me why R is not using the whole data.frame for this chisq.test?

I can’t come up with a solution to the problem that occurred while trying to create my own data.frame and do a quantitative analysis (e.g. chisq.test ) on it.

The background is as follows: I summarized the data obtained by me in two hospitals. Both measured the same categorical variable n times. In this case, he often found that health-related bacteria were detected during a certain observation period.

In the table, the summary data is as follows, where% is the percentage of all measurements performed over a period of time.

  n Hospital 1 (%) n Hospital 2 (%) Healthcare associated bacteria 829 (59.4) 578 (57.6) Community associated bacteria 473 (33.9) 372 (37.1) Contaminants 94 (6.7) 53 (5.3) Total 1396 (100.0) 1003 (100.0) 

Now, looking at the percentages, it is obvious that the proportions are very similar, and you may wonder why on earth I want to statistically compare two hospitals. But I have other data where the proportions are different, so the purpose of this question:

How to compare hospital 1 with hospital 2 in terms of measured categories.

Since the data is presented in a generalized form and in an array format, I decided to make data.frame for each of the single variables / categories.

 hosp1 <- rep(c("Yes", "No"), times=c(829,567)) hosp2 <- rep(c("Yes", "No"), times=c(578,425)) all <- cbind(hosp1, c(hosp2,rep(NA, length(hosp1)-length(hosp2)))) all <- data.frame(all) names(all)[2]<-"hosp2" summary(all) 

So far so good, because the summary seems to look right to be able to run chisq.test() . But now everything is strange.

 with(all, chisq.test(hosp1, hosp2, correct=F)) Pearson Chi-squared test data: hosp1 and hosp2 X-squared = 286.3087, df = 1, p-value < 2.2e-16 

The results seem to indicate a significant difference. If you cross-tune the data, you will see that R summarizes it in a very strange way:

 with(all, table(hosp1, hosp2)) No Yes No 174 0 Yes 251 578 

So, of course, if the data are summed in this way, there will be a statistically significant conclusion - because one category is summed up as having no dimensions. Why is this happening and what can I do to fix it? Finally, instead of making a separate data.frame for each category, is there any obvious function to cycle it? I can’t come up with this.

Thanks for your help!

UPDATED ON THE BASIS OF THE TELERO PROBLEM FOR RAW DATA.FRAME

 dput(SO_Example_v1) structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community", "Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L, 285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L, 37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L, 34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L, 62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L, 142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type", "hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType", "hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType", "hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType", "hosp2_CathAssocType"), class = "data.frame", row.names = c(NA, -3L)) 

Explanation: This data.frame is actually more complex than what is summarized in the table above, since it also contains specific types of bacteria where it is cultivated (for example, in wounds, blood cultures, catheters, etc.) . So the table that I create looks like this:

  All locations n Hospital 1 (%) n Hospital 2 (%) p-val Healthcare associated bacteria 829 (59.4) 578 (57.6) 0.39 Community associated bacteria 473 (33.9) 372 (37.1) ... Contaminants 94 (6.7) 53 (5.3) ... Total 1396 (100.0) 1003 (100.0) - 

If the heading “All Locations” is subsequently replaced by a wound, blood, urine, catheter, etc.

+5
source share
1 answer

The answer to the question of how to make p-values ​​work is somewhat simple. You can get the other two p-values ​​that you are looking for using the same syntax as @thelatemail, which is used as follows:

 #community (p = 0.1049) chisq.test(cbind(c(473,923),c(372,631)),correct=FALSE) #contaminants (p = 0.1443) chisq.test(cbind(c(94,1302),c(53,950)),correct=FALSE) 

You can get these answers more programmatically as follows:

 out <- cbind(rowSums(SO_Example_v1[,2:6]),rowSums(SO_Example_v1[,7:11])) chisq.test(rbind(out[1,],colSums(out[2:3,])),correct=FALSE) chisq.test(rbind(out[2,],colSums(out[c(1,3),])),correct=FALSE) chisq.test(rbind(out[3,],colSums(out[1:2,])),correct=FALSE) 

Of course, at the moment we are going beyond the scope of SO, but perhaps a more lofty question, given the nature of the data, is whether there is a difference between the hospitals in general that you can answer (in terms of time zone) using the criterion chi-square based on all three types:

 chisq.test(out,correct=FALSE) 
+1
source

Source: https://habr.com/ru/post/1215113/


All Articles