Calculate the number of different field values ​​using a pig script

For form file

AB user1 CD user2 AD user3 AD user1 

I want to calculate the counter of various values ​​of field 3, i.e. count(distinct(user1, user2,user2,user1)) = 3

I do this using the following pig script

 A = load 'myTestData' using PigStorage('\t') as (a1,a2,a3); user_list = foreach A GENERATE $2; unique_users = DISTINCT user_list; unique_users_group = GROUP unique_users ALL; uu_count = FOREACH unique_users_group GENERATE COUNT(unique_users); store uu_count into 'output'; 

Is there a better way to get the number of different field values?

+7
source share
2 answers

A more modern way to do this:

 user_data = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3); users = FOREACH user_data GENERATE a3; uniq_users = DISTINCT users; grouped_users = GROUP uniq_users ALL; uniq_user_count = FOREACH grouped_users GENERATE COUNT(uniq_users); DUMP uniq_user_count; 

This will leave the value (3) in your log.

+8
source

I have one here that is a bit more concise. You might want to check which one is faster.

 A = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3); unique_users_group = GROUP A ALL; uu_count = FOREACH unique_users_group {user = A.a2; uniq = distinct user; GENERATE COUNT(uniq);}; STORE uu_count INTO 'output'; 
+4
source

All Articles