Creating a cross-frequency table in R or MySQL

Question

Creating a cross-frequency table in R or MySQL

I have a table of user_id category pairs. Users can fall into several categories. I am trying to get a cross category count for every possible result. i.e. the number of users who were in category A, as well as category C, etc.

My source data is structured as follows:

example results

I would like the results to look like this, showing a cross category:

example results

How can this be done in R or MySQL? The data is pretty big.

Here is an example of data:

data <- structure(list(category = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 
2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 3L), .Label = c("A", "B", 
"C"), class = "factor"), user_id = c(464L, 345L, 342L, 312L, 
345L, 234L, 423L, 464L, 756L, 756L, 345L, 345L, 464L, 345L, 234L, 
312L)), .Names = c("category", "user_id"), class = "data.frame", row.names = c(NA, 
-16L))

Any snippets of code, thoughts on approaches, features or recommendations of the package are welcome. Thank! -John

+4

mysql r

Super_john Jun 08 '15 at 1:11

source share

4 answers

, , R, @josilber, , , . , igraph , /. R :

library('Matrix')
mat <- spMatrix(nrow=length(unique(data$category)),
    ncol=length(unique(data$user_id)),
    i = as.numeric(factor(data$category)),
    j = as.numeric(factor(data$user_id)),
    x = rep(1, length(as.numeric(data$category)))
)
rownames(mat) <- levels(factor(data$category))
colnames(mat) <- levels(factor(data$user_id))
mat

#mat_row <- mat %*% t(mat)

##  Based on @user20650 comment this is even more efficient than
##    the multiplication above:
mat_row <- tcrossprod(mat)

, , , :

> mat_row
3 x 3 sparse Matrix of class "dgCMatrix"
  A  B C
A 7  3 5
B 3 12 4
C 5  4 5

+1

Forrest R. Stevens 08 . '15 2:24

MySQL :

select a.category, b.category, count(*)
from pairs a join
     pairs b
     on a.user_id = b.user_id
group by a.category, b.category;

SQL, . (google: " mysql" ). .

0

Gordon Linoff 08 . '15 1:32

You can use dplyrto create a list of all unique pairs, and crossprod- count the number of users who are common for a pair of categories.

> library(dplyr)
> data <- data %>% group_by(user_id, category) %>% summarize(records = sign(n()))
> crossprod(table(data$user_id, data$category))

    A B C
  A 4 1 4
  B 1 4 2
  C 4 2 5

0

Alex woolford Jun 08 '15 at 4:33

source share

josliber · Accepted Answer · 2015-06-08T01:39:03+0000

R , , , :

data$category <- as.character(data$category)
(combos <- do.call(rbind, tapply(data$category, data$user_id, function(x) {
  u <- unique(x)
  if (length(u) > 1) t(combn(u, 2))
  else NULL
})))
#      [,1] [,2]
# [1,] "C"  "A" 
# [2,] "A"  "C" 
# [3,] "B"  "C" 
# [4,] "B"  "A" 
# [5,] "C"  "A" 
# [6,] "A"  "C" 
# [7,] "C"  "B"

, , table R. table (a, b) (b, a) a b:

table(combos[,1], combos[,2]) + table(combos[,2], combos[,1])
#     A B C
#   A 0 1 4
#   B 1 0 2
#   C 4 2 0

Creating a cross-frequency table in R or MySQL

More articles: