Look for matching rows in a column of a data frame from a list - Pandas - Python

Question

Look for matching rows in a column of a data frame from a list - Pandas - Python

I have a list:

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the elements in the list above (this will not be a perfect match since it has a different part of the row in the column .. for example the row in this column may have "Wow; Here; This = A1 ; 10001; 0 ')

I want to save the rows containing the match with the elements from the list, and then create a new data frame with the selected rows (should have the same headers). This is what I tried:

 import re for_new_df =[] for x in df['COLUMN']: for mp in things: if df[df['COLUMN'].str.contains(mp)]: for_new_df.append(mp) #This won't save the whole row - help here too, please.

This code gave me an error:

ValueError: The true value of the DataFrame is ambiguous. Use the a.empty, a.bool (), a.item (), a.any (), or a.all () commands.

I am very new to coding, so the more explanations and details in your answer, the better! Thanks in advance.

+5

python pandas

Eric Coy Jul 12 '16 at 15:48

source share

2 answers

Pandas is really awesome, but I don't find it very easy to use. However, it does have many features designed to make life easier, including tools to search huge data frames.

Although this may not be the complete solution to your problem, it may help you disconnect you on your right foot. I assumed that you know which column you are in in column A in my example.

 import pandas as pd df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']), 'B' : 'foo'}) print df #Original data frame print print df['A'].str.contains('A1|B2|C3') # Boolean array showing matches for col A print print df[df['A'].str.contains('A1|B2|C3')] # Matching rows

Output:

  AB 0 Wow;Here;This=A1;10001;0 foo 1 Another;C3;Row=Great;100 foo 2 This;D6;Row=bad100 foo 0 True 1 True 2 False Name: A, dtype: bool AB 0 Wow;Here;This=A1;10001;0 foo 1 Another;C3;Row=Great;100 foo

+2

emmalg Jul 12 '16 at 16:36

source share

Edchum · Accepted Answer · 2016-07-12T15:50:46+0000

You can avoid the loop by attaching your list of words to the creation of a regular expression and using str.contains :

 pat = '|'.join(thing) for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work

So the regex pattern becomes: 'A1|B2|C3' and it will match anywhere in your lines containing any of these lines

Example:

 In [65]: things = ['A1','B2','C3'] pat = '|'.join(things) df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']}) df[df['a'].str.contains(pat)] Out[65]: a 0 Wow;Here;This=A1;10001;0 1 B2

What did not work out before:

 if df[df['COLUMN'].str.contains(mp)]

this line:

 df[df['COLUMN'].str.contains(mp)]

returns df masked by the boolean array of your internal str.contains , if does not understand how to evaluate the array of logical elements, hence the error. If you are thinking about it, what if you are 1 True or all but one True? It expects a scalar, not an array, as a value.

Look for matching rows in a column of a data frame from a list - Pandas - Python

More articles: