Iterate pandas column containing lists and retrieve only unique values

Question

Iterate pandas column containing lists and retrieve only unique values

These are three questions that I just can’t understand, I hope someone can help me.

import pandas as pd data = {'Col1': ['ONE, ONE, NULL', 'ONE, TWO, THREE', 'TWO, NULL, TEN']} index = pd.Index(['d1','d2','d3']) data = pd.DataFrame(data,index=index) pattern = 'ONE|TWO' <----QUESTION1 data['Col1'].str.findall(pattern) <----QUESTION2

Question1: How to change this regular expression so that 'ONE' is found only once in d1? As of now, each ONE instance will be returned, as shown below.

 d1 [ONE, ONE] d2 [ONE, TWO] d3 [TWO]

I want it

 d1 [ONE] d2 [ONE, TWO] d3 [TWO]

Question2:
I want to take the list d1, d2 and d3 and make it into one list containing only unique values. This is something like this:

 set(d1 + d2 + d3) ---> ['ONE', 'TWO']

Question3:
If I did something like this:

 data['Col2'] = data['Col1'].str.findall(pattern)

How can I iterate over each row in Col2 to get the same results as in Question2?

+6

python pandas regex

user3139545 Jan 21 '14 at 18:49

source share

2 answers

Andy hayden · Answer 1 · 2014-01-21T19:03:08+0000

You can use reduce (over set.union):

 In [11]: reduce(set.union, data['Col1'].str.findall(pattern), set()) Out[11]: {'ONE', 'TWO'}

Another option is to use list comprehension:

 In [12]: [w for w in ['ONE', 'TWO'] if data['Col1'].str.contains(w).any()] Out[12]: ['ONE', 'TWO']

Alvaro fuentes · Answer 2 · 2014-01-21T19:15:46+0000

For Question 1, try the following: data['Col1'].str.findall(pattern).apply(set)

For Question 2,3, try the following: {x for s in data['Col1'].str.findall(pattern).apply(set) for x in s}

Iterate pandas column containing lists and retrieve only unique values

More articles: