Question

given my input data in userid,itemid format:

raw: {userid: bytearray,itemid: bytearray}

dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})

I d like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.

ideally my the bigrams would be generated and then I d FLATTEN the output to look like:

(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))

The letters ABC, which represent the userid, are not really necessary for the output, I m just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I d love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.

I ve looked at the NGramGenerator that s supplied with the pig tutorials but it doesn t really match what I m trying to accomplish. I m wondering if perhaps a python streaming UDF is the way to go.

Answer 1

You are definitely going to have to write a UDF (in Python or Java, either would be fine). You would want it to work on a bag, and then output a bag (if you flatten a bag of touples, you will get output rows so it will give you the output that you want).

the UDF itself would not be terribly difficult...something like

letter, number = zip(*input_touples)
number = list(set(number)

for i in range(0,len(number)):
    for j in range(i,len(number)):
        res.append((number[i],number[j]))

and then just cast things and return them appropriately.

If you need any help making a simple python udf, it s not too bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html

And of course feel free to ask for more help here

友情链接