Is there a query method or similar for pandas Series (pandas.Series.query ())?

The pandas.DataFrame.query() method is very useful for (pre / post) filtering when loading or building. This is especially convenient for chaining methods.

I often find that I want to apply the same logic to pandas.Series , for example. after executing a method such as df.value_counts that returns a pandas.Series .

Example

Suppose there is a huge table with columns Player, Game, Points , and I want to build a histogram of players with more than 14 times 3 points. First I have to sum up the points of each player ( groupby -> agg ), who will return a Series of 1000 players and their total points. Applying .query logic, it looks something like this:

 df = pd.DataFrame({ 'Points': [random.choice([1,3]) for x in range(100)], 'Player': [random.choice(["A","B","C"]) for x in range(100)]}) (df .query("Points == 3") .Player.values_count() .query("> 14") .hist()) 

The only solutions I find make me make an unnecessary assignment and break the chain of methods:

 (points_series = df .query("Points == 3") .groupby("Player").size() points_series[points_series > 100].hist() 

The method chain, as well as the request method, help keep the code legible, while filtering the subset can be quite confusing.

 # just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape 

Please help me in my dilemma! Thanks

+12
python pandas dataframe method-chaining series
source share
3 answers

IIUC you can add query("Points > 100") :

 df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf], 'Player':['a','a','a','s','s','s']}) print (df) Player Points 0 a 50.000000 1 a 20.000000 2 a 38.000000 3 s 90.000000 4 s 0.000000 5 s inf points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points'] print (points_series) a = points_series[points_series > 100] print (a) Player a 108.0 Name: Points, dtype: float64 points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"}) .query("Points > 100") print (points_series) Points Player a 108.0 

Another solution is Selection By Callable :

 points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"})['Points'] .loc[lambda x: x > 100] print (points_series) Player a 108.0 Name: Points, dtype: float64 

Edited answer on an editable question:

 np.random.seed(1234) df = pd.DataFrame({ 'Points': [np.random.choice([1,3]) for x in range(100)], 'Player': [np.random.choice(["A","B","C"]) for x in range(100)]}) print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15]) C 19 B 16 Name: Player, dtype: int64 print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15]) Player B 16 C 19 dtype: int64 
+6
source share

Why not convert from Series to DataFrame, execute the query and then convert back.

 df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"] 

Here .to_frame() converted to a DataFrame, and the final ["Points"] converted to Series.

Then the .query() method can be used sequentially whether the Pandas object has 1 or more columns.

+3
source share

Instead of a query, you can use pipe :

 s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10]) 
0
source share

All Articles