String Pandas String Matching Slow

Basically, I want to know a faster way to cut a Pandas framework with a conditional cut based on a regular expression. For example, the following df (there are more than 4 variations in string_column, they are for illustrative purposes only):

index, string_col1, string_col2, value 0, 'apple', 'this', 10 1, 'pen', 'is', 123 2, 'pineapple', 'sparta', 20 3, 'pen pineapple apple pen', 'this', 234 4, 'apple', 'is', 212 5, 'pen', 'sparta', 50 6, 'pineapple', 'this', 69 7, 'pen pineapple apple pen', 'is', 79 8, 'apple pen', 'sparta again', 78 ... 100000, 'pen pineapple apple pen', 'this is sparta', 392 

I need to do boolean conditional slicing according to string_column using a regular expression when looking for indexes with minimum and maximum in the value column, and then finally find the difference between the minimum and maximum value. I do this in the following way, but it is SUPER SLOW when I have to match many different regex patterns:

 pat1 = re.compile('apple') pat2 = re.compile('sparta') mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2)) max_idx = df[mask].idxmax() min_idx = df[mask].idxmin() difference = df['value'].loc[max_idx] - df['value'].loc[min_idx] 

I think in order to get one “difference” answer, I cut df too many times, but I cannot figure out how to do this less. Also, is there a faster way to cut it?

This is an optimization issue, since I know that my code gets what I need for me. Any advice would be appreciated!

+7
optimization python numpy pandas
source share
4 answers

you can speed up logical comparisons by 50 times without using & , but scipy.logical_and() instead

 a = pd.Series(sp.rand(10000) > 0.5) b = pd.Series(sp.rand(10000) > 0.5) %timeit sp.logical_and(a.values,b.values) 100000 loops, best of 3: 6.31 µs per loop %timeit a & b 1000 loops, best of 3: 390 µs per loop 
+2
source share

Pass each mask to the next subset of the data frame, each new filtering occurs on a smaller subset of the original frame:

 pat1 = re.compile('apple') pat2 = re.compile('sparta') mask1 = df['string_col1'].str.contains(pat1) mask = (df[mask1]['string_col2'].str.contains(pat2)) df1=df[mask1][mask] max_idx = df1['value'].idxmax() min_idx = df1['value'].idxmin() a,b=df1['value'].loc[max_idx],df1['value'].loc[min_idx] 
+1
source share

I am trying to profile your example, but actually I get very good performance on synthetic data, so I might need some clarification. (Also, for some reason .idxmax () breaks for me whenever I have a line in my framework).

Here is my test code:

 import pandas as pd import re import numpy as np import random import IPython from timeit import default_timer as timer possibilities_col1 = ['apple', 'pen', 'pineapple', 'joseph', 'cauliflower'] possibilities_col2 = ['sparta', 'this', 'is', 'again'] entries = 100000 potential_words_col1 = 4 potential_words_col2 = 3 def create_function_col1(): result = [] for x in range(random.randint(1, potential_words_col1)): result.append(random.choice(possibilities_col1)) return " ".join(result) def create_function_col2(): result = [] for x in range(random.randint(1, potential_words_col2)): result.append(random.choice(possibilities_col2)) return " ".join(result) data = {'string_col1': pd.Series([create_function_col1() for _ in range(entries)]), 'string_col2': pd.Series([create_function_col2() for _ in range(entries)]), 'value': pd.Series([random.randint(1, 500) for _ in range(entries)])} df = pd.DataFrame(data) pat1 = re.compile('apple') pat2 = re.compile('sparta') pat3 = re.compile('pineapple') pat4 = re.compile('this') #IPython.embed() start = timer() mask = df['string_col1'].str.contains(pat1) & \ df['string_col1'].str.contains(pat3) & \ df['string_col2'].str.contains(pat2) & \ df['string_col2'].str.contains(pat4) valid = df[mask] max_idx = valid['value'].argmax() min_idx = valid['value'].argmin() #max_idx = result['max'] #min_idx = result['min'] difference = df.loc[max_idx, 'value'] - df.loc[min_idx, 'value'] end = timer() print("Difference: {}".format(difference)) print("# Valid: {}".format(len(valid))) print("Time Elapsed: {}".format(end-start)) 

Can you explain how many conditions you are applying? (Each added regular expression adds an approximately linear increase in time (i.e. a 2-> 3 regular expression means an increase in time by 1.5 times)). I also get linear scaling by the number of entries and both possible line lengths (potential_word variables).

For reference, this code is evaluated after ~ 15 seconds on my machine (1 million records takes ~ 1.5 seconds).

Edit: I was an idiot and did not do the same thing as you (I accepted the difference between the values ​​at the smallest and largest indices in the dataset, and not the difference between the smallest and largest values), but its correction didn’t really affect the time fulfillment.

Edit 2: How does idxmax () know which column selects the maximum in your code example?

+1
source share

I think using your mask to compress your data frame and then doing a more compressed set of operations on this smaller frame will help a lot. Index search only for use as search queries is not required - just search for max / min directly:

 pat1 = re.compile('apple') pat2 = re.compile('sparta') mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2)) result = df.loc[mask, 'value'] difference = result.max() - result.min() 
0
source share

All Articles