Using the Apache Pig Rank Function

Am uses the Pig 0.11.0 ranking function and generates ranks for each identifier in my data. I need to rank my data in a certain way. I want the rank to be reset and start at 1 for each new identifier.

Is it possible to use the rank function directly for it? Any advice would be appreciated.

Data:

id,rating X001, 9 X001, 9 X001, 8 X002, 9 X002, 7 X002, 6 X002, 5 X003, 8 X004, 8 X004, 7 X004, 7 X004, 4 

When using the rank function, for example: op = rank data by id, score;

I get this conclusion

 rank,id,rating 1, X001, 9 1, X001, 9 2, X001, 8 3, X002, 9 4, X002, 7 5, X002, 6 6, X002, 5 7, X003, 8 8, X004, 8 9, X004, 7 9, X004, 7 10, X004, 4 

Desired O / P:

 rank,id,rating 1, X001, 9 1, X001, 9 2, X001, 8 1, X002, 9 2, X002, 7 3, X002, 6 4, X002, 5 1, X003, 8 1, X004, 8 2, X004, 7 2, X004, 7 3, X004, 4 
+6
source share
2 answers

You can group your data by identifier, and then use the UDF Enumerate (DataFu) to add an index to each set of packages.

 register datafu-1.1.0.jar; define Enumerate datafu.pig.bags.Enumerate('1'); data = load 'data' using PigStorage(',') as (id:chararray, rating:int); data = group data by id; data = foreach data { sorted = order data by rating DESC; generate group, sorted; } data = foreach data generate FLATTEN(Enumerate(sorted)); data = foreach data generate $2, $0, $1; dump data; 

The DataFu database file can be downloaded from the Maven Central repository: http://search.maven.org/#search|ga|1|g%3A%22com.linkedin.datafu% 22

+10
source

You can use the RANK function as shown below: B = A rating by DESC rating; dump B;

Note: given that A has (identifier, rating) mentioned in your example.

+1
source

All Articles