I have more experience with Rs data.table , but I'm trying to learn pandas . In data.table I can do something like this:
> head(dt_m) event_id device_id longitude latitude time_ category 1: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 2: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 3: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 4: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 5: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 6: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free app_id is_active 1: -5305696816021977482 0 2: -7164737313972860089 0 3: -8504475857937456387 0 4: -8807740666788515175 0 5: 5302560163370202064 0 6: 5521284031585796822 0 dt_m_summary <- dt_m[, .( mean_active = mean(is_active, na.rm = TRUE) , median_lat = median(latitude, na.rm = TRUE) , median_lon = median(longitude, na.rm = TRUE) , mean_time = mean(time_) , new_col = your_function(latitude, longitude, time_) ) , by = list(device_id, category) ]
New columns ( mean_active via new_col ), as well as device_id and category will appear in dt_m_summary . I could also do a similar by conversion on the source table if I need a new column that has groupby-apply results:
dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]
(in case I wanted, for example, to select strings where mean_active greater than some threshold, or to do something else).
I know there is groupby in pandas , but I have not found a way to make it look like simple transforms, as mentioned above. The best I could think of is to make a series of group requests and then combine the results into a single dataframe , but that seems very awkward. Is there a better way to do this?