I can’t come up with a solution to the problem that occurred while trying to create my own data.frame and do a quantitative analysis (e.g. chisq.test ) on it.
The background is as follows: I summarized the data obtained by me in two hospitals. Both measured the same categorical variable n times. In this case, he often found that health-related bacteria were detected during a certain observation period.
In the table, the summary data is as follows, where% is the percentage of all measurements performed over a period of time.
n Hospital 1 (%) n Hospital 2 (%) Healthcare associated bacteria 829 (59.4) 578 (57.6) Community associated bacteria 473 (33.9) 372 (37.1) Contaminants 94 (6.7) 53 (5.3) Total 1396 (100.0) 1003 (100.0)
Now, looking at the percentages, it is obvious that the proportions are very similar, and you may wonder why on earth I want to statistically compare two hospitals. But I have other data where the proportions are different, so the purpose of this question:
How to compare hospital 1 with hospital 2 in terms of measured categories.
Since the data is presented in a generalized form and in an array format, I decided to make data.frame for each of the single variables / categories.
hosp1 <- rep(c("Yes", "No"), times=c(829,567)) hosp2 <- rep(c("Yes", "No"), times=c(578,425)) all <- cbind(hosp1, c(hosp2,rep(NA, length(hosp1)-length(hosp2)))) all <- data.frame(all) names(all)[2]<-"hosp2" summary(all)
So far so good, because the summary seems to look right to be able to run chisq.test() . But now everything is strange.
with(all, chisq.test(hosp1, hosp2, correct=F)) Pearson Chi-squared test data: hosp1 and hosp2 X-squared = 286.3087, df = 1, p-value < 2.2e-16
The results seem to indicate a significant difference. If you cross-tune the data, you will see that R summarizes it in a very strange way:
with(all, table(hosp1, hosp2)) No Yes No 174 0 Yes 251 578
So, of course, if the data are summed in this way, there will be a statistically significant conclusion - because one category is summed up as having no dimensions. Why is this happening and what can I do to fix it? Finally, instead of making a separate data.frame for each category, is there any obvious function to cycle it? I can’t come up with this.
Thanks for your help!
UPDATED ON THE BASIS OF THE TELERO PROBLEM FOR RAW DATA.FRAME
dput(SO_Example_v1) structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community", "Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L, 285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L, 37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L, 34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L, 62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L, 142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type", "hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType", "hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType", "hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType", "hosp2_CathAssocType"), class = "data.frame", row.names = c(NA, -3L))
Explanation: This data.frame is actually more complex than what is summarized in the table above, since it also contains specific types of bacteria where it is cultivated (for example, in wounds, blood cultures, catheters, etc.) . So the table that I create looks like this:
All locations n Hospital 1 (%) n Hospital 2 (%) p-val Healthcare associated bacteria 829 (59.4) 578 (57.6) 0.39 Community associated bacteria 473 (33.9) 372 (37.1) ... Contaminants 94 (6.7) 53 (5.3) ... Total 1396 (100.0) 1003 (100.0) -
If the heading “All Locations” is subsequently replaced by a wound, blood, urine, catheter, etc.