Can I generate nested packages using nested FOREACH statements in Pig Latin?

Let's say I have a set of restaurant reviews:

User,City,Restaurant,Rating Jim,New York,Mecurials,3 Jim,New York,Whapme,4.5 Jim,London,Pint Size,2 Lisa,London,Pint Size,4 Lisa,London,Rabbit Whole,3.5 

And I want to make a list of users and cities of the average review. That is, Output:

 User,City,AverageRating Jim,New York,3.75 Jim,London,2 Lisa,London,3.75 

I could write a Pig script as follows:

 Data = LOAD 'data.txt' USING PigStorage(',') AS ( user:chararray, city:chararray, restaurant:charray, rating:float ); PerUserCity = GROUP Data BY (user, city); ResultSet = FOREACH PerUserCity { GENERATE group.user, group.city, AVG(Data.rating); } 

However, I am curious if I can first group a group of a higher level (users), and then group the next level (cities) later: ie

 PerUser = GROUP Data BY user; Intermediate = FOREACH PerUser { B = GROUP Data BY city; GENERATE group AS user, B; } 

I get:

 Error during parsing. Invalid alias: GROUP in { group: chararray, Data: { user: chararray, city: chararray, restaurant: chararray, rating: float } } 

Has anyone tried this with success? Is it just not possible for GROUP inside FOREACH?

My goal is to do something like:

 ResultSet = FOREACH PerUser { FOREACH City { GENERATE user, city, AVG(City.rating) } } 
+7
source share
5 answers

Currently allowed operations are: DISTINCT , FILTER , LIMIT and ORDER BY inside FOREACH .

Now grouping directly (by user, city) is a good way to do what you said.

+8
source

Release notes for Pig version 0.10 assume that nested FOREACH operations are now supported .

+2
source

Try the following:

 Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float); grpRecs = group Records By (user,city); avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average; Result = foreach avgRating_Byuser_perCity generate flatten(group), average; 
+1
source
 awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float); data = filter rawdata by user != 'User'; groupbyusercity = group data by (user,city); --describe groupbyusercity; --groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}} average = foreach groupbyusercity { generate group.user,group.city,AVG(data.rating); } dump average; 
0
source

Grouping with two keys and then smoothing the structure produces the same result:

Data loading as you did

 Data = LOAD 'data.txt' USING PigStorage(',') AS ( user:chararray, city:chararray, restaurant:charray, rating:float); 

Group by user and city

  ByUserByCity = GROUP Data BY (user, city); 

Add the average rating for the groups (you can add more, for example COUNT (Data) as count_res) Then group the group structure into the original one.

 ByUserByCityAvg = FOREACH ByUserByCity GENERATE FLATTEN(group) AS (user, city), AVG(Data.rating) as user_city_avg; 

Results in:

 Jim,London,2.0 Jim,New York,3.75 Lisa,London,3.75 User,City, 
0
source

All Articles