How to discretize values in pandas DataFrame and convert to binary matrix?

Question

How to discretize values in pandas DataFrame and convert to binary matrix?

I mean something like this:

I have a DataFrame with columns that can be categorical or nominal. For each observation (row), I want to generate a new row, where each possible value for the variables is now its own binary variable. For example, this matrix (the first row is the column labels)

 'a' 'b' 'c' one 0.2 0 two 0.4 1 two 0.9 0 three 0.1 2 one 0.0 4 two 0.2 5

will be converted to something like this:

 'a' 'b' 'c' one two three [0.0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1.0] 0 1 2 3 4 5 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1

Each variable (column) in the original matrix falls into all possible values. If this is categorical, then each possible value becomes a new column. If it’s a float, then the values are cast in some way (say, they are always split into 10 boxes). If it is int, then it can be any int int value, or, possibly, also binning.

FYI: in my real application, a table has up to 2 million rows, and a full “extended” matrix can have hundreds of columns.

Is there an easy way to perform this operation?

Separately, I would also like to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of crosstabs). Is there an easy way to do something similar with the crosstab function? Otherwise, calculating the cross tab is simply a simple matrix multiplication.

+7

python pandas dataframe

Uri laserson May 29 '12 at 12:06

source share

5 answers

Note that I applied the new cut and qcut to sample continuous data:

http://pandas-docs.imtqy.com/pandas-docs-travis/basics.html#discretization-and-quantiling

+29

Wes mckinney Jun 12 '12 at 21:52

source share

For labeled columns, such as column a and c in your example, you can use the pandas get_dummies () built-in method.

Example:.

 import pandas as pd s1 = ['a', 'b', np.nan] pd.get_dummies(s1) ab 0 1 0 1 0 1 2 0 0

+5

wonderkid2 Mar 22 '15 at 12:13

source share

I doubt that you will beat patsy simplicity. It was designed specifically for this task:

 >>> from patsy import dmatrix >>> dmatrix('C(a) + C(b) + C(c) - 1', df, return_type='dataframe') C(a)[one] C(a)[three] C(a)[two] C(b)[T.0.1] C(b)[T.0.2] C(b)[T.0.4] C(b)[T.0.9] C(c)[T.1] C(c)[T.2] C(c)[T.4] C(c)[T.5] 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 2 0 0 1 0 0 0 1 0 0 0 0 3 0 1 0 1 0 0 0 0 1 0 0 4 1 0 0 0 0 0 0 0 0 1 0 5 0 0 1 0 1 0 0 0 0 0 1

Here C(a) means the transformation of the variable into categorical, and -1 - exclude the output of the interception column.

+3

elyase Aug 2 '13 at 14:37

source share

Combining several other comments into one answer, answering OPs questions.

 d = {'a' : pd.Series(['one', 'two', 'two', 'three', 'one', 'two']), 'b' : pd.Series([0.2, 0.4, 0.9, 0.1, 0.0, 0.2]), 'c' : pd.Series([0, 1, 0, 2, 4, 5]) } data = pd.DataFrame(d) a_cols = pd.crosstab(data.index, [data.a]) b_bins = pd.cut(data.b, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], right=False) b_cols = pd.crosstab(data.index, b_bins) c_cols = pd.crosstab(data.index, [data.c], ) new_data = a_cols.join(b_cols).join(c_cols) new_data.index.names = [''] print new_data.to_string() """ one three two [0, 0.2) [0.2, 0.4) [0.4, 0.6) [0.8, 1) 0 1 2 4 5 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 2 0 0 1 0 0 0 1 1 0 0 0 0 3 0 1 0 1 0 0 0 0 0 1 0 0 4 1 0 0 1 0 0 0 0 0 0 1 0 5 0 0 1 0 1 0 0 0 0 0 0 1 """

+1

Tim Jul 05 '13 at 4:40

source share

lbolla · Accepted Answer · 2012-05-29T08:13:14+0000

You can use some kind of broadcast:

  In [58]: df Out[58]: abc 0 one 0.2 0 1 two 0.4 1 2 two 0.9 0 3 three 0.1 2 4 one 0.0 4 5 two 0.2 5 In [41]: (df.a.values[:,numpy.newaxis] == df.a.unique()).astype(int) Out[41]: array([[1, 0, 0], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) In [54]: ((0 <= df.b.values[:,numpy.newaxis]) & (df.b.values[:,numpy.newaxis] < 0.2)).astype(int) Out[54]: array([[0], [0], [0], [1], [1], [0]]) In [59]: (df.c.values[:,numpy.newaxis] == df.c.unique()).astype(int) Out[59]: array([[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1]])

Then attach all the parts together with pandas.concat or similar.

How to discretize values ​​in pandas DataFrame and convert to binary matrix?

More articles:

How to discretize values in pandas DataFrame and convert to binary matrix?