Drop columns with low standard deviation in Pandas Dataframe

Question

Drop columns with low standard deviation in Pandas Dataframe

Is there a way to do this without writing a for loop?

Suppose we have the following data:

d = {'A': {-1: 0.19052041339798062, 0: -0.0052531481871952871, 1: -0.0022017467720961644, 2: -0.051109629013311737, 3: 0.18569441222621336}, 'B': {-1: 0.029181417300734112, 0: -0.0031021862533310743, 1: -0.014358516787430284, 2: 0.0046386615308068877, 3: 0.056676322314857898}, 'C': {-1: 0.071883343375205785, 0: -0.011930096520251999, 1: -0.011836365865654104, 2: -0.0033930358388315237, 3: 0.11812543193496111}, 'D': {-1: 0.17670604006475121, 0: -0.088756293654161142, 1: -0.093383245649534194, 2: 0.095649943383654359, 3: 0.51030339029516592}, 'E': {-1: 0.30273513342295627, 0: -0.30640233455497284, 1: -0.32698263145105921, 2: 0.60257484810641992, 3: 0.36859978928328413}, 'F': {-1: 0.25328469046380131, 0: -0.063890702001567143, 1: -0.10007720832198815, 2: 0.08153164759036724, 3: 0.36606175240021183}, 'G': {-1: 0.28764606940509913, 0: -0.11022209861109525, 1: -0.1264164305949009, 2: 0.17030074112227081, 3: 0.30100292424380881}} df = pd.DataFrame(d)

I know that I can get std values with std_vals = df.std() , which gives the following result and uses these values to delete columns one by one.

 In[]: pd.DataFrame(d).std() Out[]: A 0.115374 B 0.028435 C 0.059394 D 0.247617 E 0.421117 F 0.200776 G 0.209710 dtype: float64

However, I do not know how to use Pandas indexing to remove columns with low std values directly.

Is there a way to do this, or do I need to loop over each column?

+7

python pandas

Ashkan Aug 4 '15 at 1:17

source share

2 answers

To remove columns, you need column names.

 threshold = 0.2 df.drop(df.std()[df.std() < threshold].index.values, axis=1) DEFG -1 0.1767 0.3027 0.2533 0.2876 0 -0.0888 -0.3064 -0.0639 -0.1102 1 -0.0934 -0.3270 -0.1001 -0.1264 2 0.0956 0.6026 0.0815 0.1703 3 0.5103 0.3686 0.3661 0.3010

+3

Jianxun li Aug 4 '15 at 1:22

source share

maxymoo · Accepted Answer · 2015-08-04T01:23:24+0000

You can use the loc method on a data frame to select specific columns based on a boolean indexer. Create an indexer like this (uses Numpy Array broadcast to apply a condition to each column):

 df.std() > 0.3 Out[84]: A False B False C False D False E True F False G False dtype: bool

Then call loc with : in the first position to indicate that you want to return all the lines:

 df.loc[:, df.std() > .3] Out[85]: E -1 0.302735 0 -0.306402 1 -0.326983 2 0.602575 3 0.368600

Drop columns with low standard deviation in Pandas Dataframe

More articles: