given my input in userid, itemid format:
raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)})
I would like to generate all combinations (order is not important) of elements within each group. In the end, I intend to use the jaccard affinity in the elements of my group.
Ideally, my bigrams would be generated, and then I would have the FLATTEN output look like this:
(A, (1,2)) (A, (1,3)) (A, (1,4)) (A, (2,3)) (A, (2,4)) (A, (3,4)) (B, (1,2)) (B, (2,3)) (B, (3,5)) (C, (1,5))
The letters ABC, which represent the user ID, are not really needed for output, I just show them for illustrative purposes. From there, I would count the number of occurrences of each bigram to calculate jaccard. I would really like to know if anyone else uses pigs for such similarity counts (sorry!) And have already run into this.
I looked at the NGramGenerator that came with the pig textbooks, but it really doesn't match what I'm trying to accomplish. I am wondering if streaming UDF in Python is apparently possible.