The pandas.DataFrame.query() method is very useful for (pre / post) filtering when loading or building. This is especially convenient for chaining methods.
I often find that I want to apply the same logic to pandas.Series , for example. after executing a method such as df.value_counts that returns a pandas.Series .
Example
Suppose there is a huge table with columns Player, Game, Points , and I want to build a histogram of players with more than 14 times 3 points. First I have to sum up the points of each player ( groupby -> agg ), who will return a Series of 1000 players and their total points. Applying .query logic, it looks something like this:
df = pd.DataFrame({ 'Points': [random.choice([1,3]) for x in range(100)], 'Player': [random.choice(["A","B","C"]) for x in range(100)]}) (df .query("Points == 3") .Player.values_count() .query("> 14") .hist())
The only solutions I find make me make an unnecessary assignment and break the chain of methods:
(points_series = df .query("Points == 3") .groupby("Player").size() points_series[points_series > 100].hist()
The method chain, as well as the request method, help keep the code legible, while filtering the subset can be quite confusing.
# just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me in my dilemma! Thanks
python pandas dataframe method-chaining series
dmeu
source share