How to discretize values โ€‹โ€‹in pandas DataFrame and convert to binary matrix?

I mean something like this:

I have a DataFrame with columns that can be categorical or nominal. For each observation (row), I want to generate a new row, where each possible value for the variables is now its own binary variable. For example, this matrix (the first row is the column labels)

 'a' 'b' 'c' one 0.2 0 two 0.4 1 two 0.9 0 three 0.1 2 one 0.0 4 two 0.2 5 

will be converted to something like this:

 'a' 'b' 'c' one two three [0.0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1.0] 0 1 2 3 4 5 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 

Each variable (column) in the original matrix falls into all possible values. If this is categorical, then each possible value becomes a new column. If itโ€™s a float, then the values โ€‹โ€‹are cast in some way (say, they are always split into 10 boxes). If it is int, then it can be any int int value, or, possibly, also binning.

FYI: in my real application, a table has up to 2 million rows, and a full โ€œextendedโ€ matrix can have hundreds of columns.

Is there an easy way to perform this operation?

Separately, I would also like to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of crosstabs). Is there an easy way to do something similar with the crosstab function? Otherwise, calculating the cross tab is simply a simple matrix multiplication.

+7
source share
5 answers

You can use some kind of broadcast:

  In [58]: df Out[58]: abc 0 one 0.2 0 1 two 0.4 1 2 two 0.9 0 3 three 0.1 2 4 one 0.0 4 5 two 0.2 5 In [41]: (df.a.values[:,numpy.newaxis] == df.a.unique()).astype(int) Out[41]: array([[1, 0, 0], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) In [54]: ((0 <= df.b.values[:,numpy.newaxis]) & (df.b.values[:,numpy.newaxis] < 0.2)).astype(int) Out[54]: array([[0], [0], [0], [1], [1], [0]]) In [59]: (df.c.values[:,numpy.newaxis] == df.c.unique()).astype(int) Out[59]: array([[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1]]) 

Then attach all the parts together with pandas.concat or similar.

+4
source

Note that I applied the new cut and qcut to sample continuous data:

http://pandas-docs.imtqy.com/pandas-docs-travis/basics.html#discretization-and-quantiling

+29
source

For labeled columns, such as column a and c in your example, you can use the pandas get_dummies () built-in method.

Example:.

 import pandas as pd s1 = ['a', 'b', np.nan] pd.get_dummies(s1) ab 0 1 0 1 0 1 2 0 0 
+5
source

I doubt that you will beat patsy simplicity. It was designed specifically for this task:

 >>> from patsy import dmatrix >>> dmatrix('C(a) + C(b) + C(c) - 1', df, return_type='dataframe') C(a)[one] C(a)[three] C(a)[two] C(b)[T.0.1] C(b)[T.0.2] C(b)[T.0.4] C(b)[T.0.9] C(c)[T.1] C(c)[T.2] C(c)[T.4] C(c)[T.5] 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 2 0 0 1 0 0 0 1 0 0 0 0 3 0 1 0 1 0 0 0 0 1 0 0 4 1 0 0 0 0 0 0 0 0 1 0 5 0 0 1 0 1 0 0 0 0 0 1 

Here C(a) means the transformation of the variable into categorical, and -1 - exclude the output of the interception column.

+3
source

Combining several other comments into one answer, answering OPs questions.

 d = {'a' : pd.Series(['one', 'two', 'two', 'three', 'one', 'two']), 'b' : pd.Series([0.2, 0.4, 0.9, 0.1, 0.0, 0.2]), 'c' : pd.Series([0, 1, 0, 2, 4, 5]) } data = pd.DataFrame(d) a_cols = pd.crosstab(data.index, [data.a]) b_bins = pd.cut(data.b, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], right=False) b_cols = pd.crosstab(data.index, b_bins) c_cols = pd.crosstab(data.index, [data.c], ) new_data = a_cols.join(b_cols).join(c_cols) new_data.index.names = [''] print new_data.to_string() """ one three two [0, 0.2) [0.2, 0.4) [0.4, 0.6) [0.8, 1) 0 1 2 4 5 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 2 0 0 1 0 0 0 1 1 0 0 0 0 3 0 1 0 1 0 0 0 0 0 1 0 0 4 1 0 0 1 0 0 0 0 0 0 1 0 5 0 0 1 0 1 0 0 0 0 0 0 1 """ 
+1
source

All Articles