How to compare two string variables in pandas?

I have two row columns in my Pandas set

name1     name2
John Doe  John Doe
AleX T    Franz K

and I need to check if it is equal name1 name2. The naive way I'm using now is to use a simple mask

mask=df.name1==df.name2

But the problem is that lines can be labeled (in a sense that are not predictable - too much data) that prevents an exact match.

For example, "John Doe" and "John Doe" do not match. Of course, I cut, lowered my strings, but other possibilities remain.

One idea would be to see if name1c name2. But it looks like I cannot use str.containswith another variable as an argument. Any other ideas?

Many thanks!

EDIT: isin .

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})

test
Out[6]: 
           A             B
0   john doe      john doe
1   john doe  eddie murphy
2       John        batman

test['A'].isin(test['B'])
Out[7]: 
0    False
1     True
2    False
Name: A, dtype: bool
+4
4

, str.lower str.replace s/+:

test = pd.DataFrame({'A': ["john  doe", " john doe", 'John'], 
                     'B': [' john doe', 'eddie murphy', 'batman']})

print test['A'].str.lower().str.replace('s/+',"") == 
      test['B'].str.strip().str.replace('s/+',"")


0     True
1    False
2    False
dtype: bool
+4

strip lower :

In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()

Out[414]:
0     True
1    False
2    False
dtype: bool
+2

, , , distance(s1, s2), edit distance of strings. :

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps

Google :

https://pypi.python.org/pypi/editdistance

, , . , .

+1

difflib

import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit: Pandas:

import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)
+1
source

All Articles