Basically, I want to know a faster way to cut a Pandas framework with a conditional cut based on a regular expression. For example, the following df (there are more than 4 variations in string_column, they are for illustrative purposes only):
index, string_col1, string_col2, value 0, 'apple', 'this', 10 1, 'pen', 'is', 123 2, 'pineapple', 'sparta', 20 3, 'pen pineapple apple pen', 'this', 234 4, 'apple', 'is', 212 5, 'pen', 'sparta', 50 6, 'pineapple', 'this', 69 7, 'pen pineapple apple pen', 'is', 79 8, 'apple pen', 'sparta again', 78 ... 100000, 'pen pineapple apple pen', 'this is sparta', 392
I need to do boolean conditional slicing according to string_column using a regular expression when looking for indexes with minimum and maximum in the value column, and then finally find the difference between the minimum and maximum value. I do this in the following way, but it is SUPER SLOW when I have to match many different regex patterns:
pat1 = re.compile('apple') pat2 = re.compile('sparta') mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2)) max_idx = df[mask].idxmax() min_idx = df[mask].idxmin() difference = df['value'].loc[max_idx] - df['value'].loc[min_idx]
I think in order to get one “difference” answer, I cut df too many times, but I cannot figure out how to do this less. Also, is there a faster way to cut it?
This is an optimization issue, since I know that my code gets what I need for me. Any advice would be appreciated!