Generating bigram combinations from grouped data in pigs

given my input in userid, itemid format:

raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)}) 

I would like to generate all combinations (order is not important) of elements within each group. In the end, I intend to use the jaccard affinity in the elements of my group.

Ideally, my bigrams would be generated, and then I would have the FLATTEN output look like this:

 (A, (1,2)) (A, (1,3)) (A, (1,4)) (A, (2,3)) (A, (2,4)) (A, (3,4)) (B, (1,2)) (B, (2,3)) (B, (3,5)) (C, (1,5)) 

The letters ABC, which represent the user ID, are not really needed for output, I just show them for illustrative purposes. From there, I would count the number of occurrences of each bigram to calculate jaccard. I would really like to know if anyone else uses pigs for such similarity counts (sorry!) And have already run into this.

I looked at the NGramGenerator that came with the pig textbooks, but it really doesn't match what I'm trying to accomplish. I am wondering if streaming UDF in Python is apparently possible.

+3
source share
1 answer

You will definitely have to write UDF (in Python or Java, everything will be fine). You would like him to work on the bag and then give out the bag (if you flatten the bag with feathers, you will get output lines so that it provides you with the desired result).

UDF itself will not be terribly difficult ... something like

 letter, number = zip(*input_touples) number = list(set(number) for i in range(0,len(number)): for j in range(i,len(number)): res.append((number[i],number[j])) 

and then just drop things and return them accordingly.

If you need help creating a simple python udf, this is not so bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html

And, of course, feel free to ask for further help here.

+1
source

All Articles