String comparison pandas dataframe (rows have some overlapping values)

I have a pandas framework with 21 columns. I focus on a subset of rows that have exactly the same column data values, with the exception of 6, which are unique to each row. I do not know which column headings of these 6 values ​​correspond a priori.

I tried to convert each row into Index objects and perform the installation operation in two rows. Example.

row1 = pd.Index(sample_data[0]) row2 = pd.Index(sample_data[1]) row1 - row2 

which returns an Index object containing values ​​unique to row1. Then I can manually infer which columns have unique values.

How can I programmatically capture column headings that match these values ​​in the original data frame? Or is there a way to compare two or more rows of data data and extract 6 different column values ​​for each row, as well as the corresponding headers? Ideally, it would be nice to create a new framework with unique columns.

In particular, is there a way to do this using dial operations?

Thanks.

+3
python pandas dataframe
source share
2 answers

Here's a quick fix to return only columns in which two rows differ.

 In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'), ... list('bbbbb')]), columns=list('ABCD')) In [14]: df Out[14]: ABCD 0 0 aab 1 1 bab 2 2 cab 3 3 dab 4 4 eab In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]] Out[15]: AB 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e 

And a solution to search all columns with several unique values ​​in the entire frame.

 In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]] Out[33]: AB 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e 
+1
source share

You really don't need an index, you can simply compare two rows and use them to filter the columns of the list.

 df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)}) row1 = df.irow(0) row2 = df.irow(1) unique_columns = row1 != row2 cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column] print cols # ['col3'] 

If you know the standard value for each column, you can convert all rows to a list of logical elements, i.e.:

 standard_row = np.ones(3) columns = df.columns unique_columns = df.apply(lambda x: x != standard_row, axis=1) unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1) 
+1
source share

All Articles