given my input data in userid,itemid format:
raw: {userid: bytearray,itemid: bytearray}
dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)
grpd = GROUP raw BY userid;
dump grpd;
(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})
I d like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.
ideally my the bigrams would be generated and then I d FLATTEN the output to look like:
(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))
The letters ABC, which represent the userid, are not really necessary for the output, I m just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I d love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.
I ve looked at the NGramGenerator that s supplied with the pig tutorials but it doesn t really match what I m trying to accomplish. I m wondering if perhaps a python streaming UDF is the way to go.