Let's say I have a set of restaurant reviews:
User,City,Restaurant,Rating Jim,New York,Mecurials,3 Jim,New York,Whapme,4.5 Jim,London,Pint Size,2 Lisa,London,Pint Size,4 Lisa,London,Rabbit Whole,3.5
And I want to make a list of users and cities of the average review. That is, Output:
User,City,AverageRating Jim,New York,3.75 Jim,London,2 Lisa,London,3.75
I could write a Pig script as follows:
Data = LOAD 'data.txt' USING PigStorage(',') AS ( user:chararray, city:chararray, restaurant:charray, rating:float ); PerUserCity = GROUP Data BY (user, city); ResultSet = FOREACH PerUserCity { GENERATE group.user, group.city, AVG(Data.rating); }
However, I am curious if I can first group a group of a higher level (users), and then group the next level (cities) later: ie
PerUser = GROUP Data BY user; Intermediate = FOREACH PerUser { B = GROUP Data BY city; GENERATE group AS user, B; }
I get:
Error during parsing. Invalid alias: GROUP in { group: chararray, Data: { user: chararray, city: chararray, restaurant: chararray, rating: float } }
Has anyone tried this with success? Is it just not possible for GROUP inside FOREACH?
My goal is to do something like:
ResultSet = FOREACH PerUser { FOREACH City { GENERATE user, city, AVG(City.rating) } }