Iterate pandas column containing lists and retrieve only unique values

These are three questions that I just can’t understand, I hope someone can help me.

import pandas as pd data = {'Col1': ['ONE, ONE, NULL', 'ONE, TWO, THREE', 'TWO, NULL, TEN']} index = pd.Index(['d1','d2','d3']) data = pd.DataFrame(data,index=index) pattern = 'ONE|TWO' <----QUESTION1 data['Col1'].str.findall(pattern) <----QUESTION2 

Question1: How to change this regular expression so that 'ONE' is found only once in d1? As of now, each ONE instance will be returned, as shown below.

 d1 [ONE, ONE] d2 [ONE, TWO] d3 [TWO] 

I want it

 d1 [ONE] d2 [ONE, TWO] d3 [TWO] 

Question2:
I want to take the list d1, d2 and d3 and make it into one list containing only unique values. This is something like this:

 set(d1 + d2 + d3) ---> ['ONE', 'TWO'] 


Question3:
If I did something like this:

 data['Col2'] = data['Col1'].str.findall(pattern) 

How can I iterate over each row in Col2 to get the same results as in Question2?

+6
source share
2 answers

You can use reduce (over set.union):

 In [11]: reduce(set.union, data['Col1'].str.findall(pattern), set()) Out[11]: {'ONE', 'TWO'} 

Another option is to use list comprehension:

 In [12]: [w for w in ['ONE', 'TWO'] if data['Col1'].str.contains(w).any()] Out[12]: ['ONE', 'TWO'] 
+3
source

For Question 1, try the following: data['Col1'].str.findall(pattern).apply(set)

For Question 2,3, try the following: {x for s in data['Col1'].str.findall(pattern).apply(set) for x in s}

0
source

All Articles