Python Pandas is something like ISIN, but contains "against" exact "matching

I use Python Pandas to work with two data frames. The first block of data contains entries from the customer database (first name, last name, email address, etc.). The second data block contains a list of domain names , for example. gmail.com, hotmail.com etc.

I am trying to exclude entries from the client data frame when the email address contains the domain name from the second list. On the other hand, I need to remove the client when its email address domain appears in the domain blacklist.

Here is an example of data:

>>> customer = pd.DataFrame({'Email': [
    "bob@example.com", 
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

>>> blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

>>> customer
         Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim
2    joe@gmail.com        Joe
>>> blacklist
  Domain
0  gmail.com
1  outlook.com

My desired result:

>>> filtered_list = magic_happens_here(customer, blacklist)
>>> filtered_list
    Email First Name
0 bob@example.com    Bob
1 jim@example.com    Jim

What I have tried so far:

  • , df1[df1['email'].isin(~df2['email'])... , .
  • df.apply, , , . : df1['Email'].apply(lambda x: x for i in ['gmail.com', 'outlook.com'] if i in x). , , TypeError: 'generator' object is not callable.

:

  • ?
  • ?
  • ... , , ?
+4
2

-

import pandas as pd


customer = pd.DataFrame({'Email': [
    "bob@example.com",
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

invalid_emails = tuple(blacklist['Domain'])

df = customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]

print(df)

-

             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim
+1

:

customer[~customer.Email.str.endswith(invalid_emails)]

customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]

In [399]: filtered_list
Out[399]:
             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim

:

In [395]: customer.Email.str.replace(r'^[^@]*\@', '')
Out[395]:
0    example.com
1    example.com
2      gmail.com
Name: Email, dtype: object

In [396]: customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)
Out[396]:
0    False
1    False
2     True
Name: Email, dtype: bool

:: 300K DF:

In [401]: customer = pd.concat([customer] * 10**5)

In [402]: customer.shape
Out[402]: (300000, 2)

In [420]: %timeit customer[~customer.Email.str.endswith(invalid_emails)]
10 loops, best of 3: 136 ms per loop

In [421]: %timeit customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
10 loops, best of 3: 151 ms per loop

In [422]: %timeit customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
1 loop, best of 3: 642 ms per loop

:

customer[~customer.Email.str.endswith(invalid_emails)] customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))] customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]

+3

All Articles