Normalize data in pandas

Suppose I have a pandas df data frame:

I want to calculate the average column value of a data frame.

This is easy:

 df.apply(average) 

then the column range max (col) is min (col). This is easy again:

 df.apply(max) - df.apply(min) 

Now for each element I want to subtract its average column value and divide by its range of columns. I'm not sure how to do this

Any help / pointers are greatly appreciated.

+125
python numpy pandas
Sep 21
source share
6 answers
 In [92]: df Out[92]: abcd A -0.488816 0.863769 4.325608 -4.721202 B -11.937097 2.993993 -12.916784 -1.086236 C -5.569493 4.672679 -2.168464 -9.315900 D 8.892368 0.932785 4.535396 0.598124 In [93]: df_norm = (df - df.mean()) / (df.max() - df.min()) In [94]: df_norm Out[94]: abcd A 0.085789 -0.394348 0.337016 -0.109935 B -0.463830 0.164926 -0.650963 0.256714 C -0.158129 0.605652 -0.035090 -0.573389 D 0.536170 -0.376229 0.349037 0.426611 In [95]: df_norm.mean() Out[95]: a -2.081668e-17 b 4.857226e-17 c 1.734723e-17 d -1.040834e-17 In [96]: df_norm.max() - df_norm.min() Out[96]: a 1 b 1 c 1 d 1 
+213
Sep 21 '12 at 7:14
source share

If you don't mind importing the sklearn library, I would recommend the method described in this blog.

 import pandas as pd from sklearn import preprocessing data = {'score': [234,24,14,27,-74,46,73,-18,59,160]} cols = data.columns df = pd.DataFrame(data) df min_max_scaler = preprocessing.MinMaxScaler() np_scaled = min_max_scaler.fit_transform(df) df_normalized = pd.DataFrame(np_scaled, columns = cols) df_normalized 
+68
May 13 '16 at a.m.
source share

You can use apply for this, and this is a bit ahead:

 import numpy as np import pandas as pd np.random.seed(1) df = pd.DataFrame(np.random.randn(4,4)* 4 + 3) 0 1 2 3 0 9.497381 0.552974 0.887313 -1.291874 1 6.461631 -6.206155 9.979247 -0.044828 2 4.276156 2.002518 8.848432 -5.240563 3 1.710331 1.463783 7.535078 -1.399565 df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x))) 0 1 2 3 0 0.515087 0.133967 -0.651699 0.135175 1 0.125241 -0.689446 0.348301 0.375188 2 -0.155414 0.310554 0.223925 -0.624812 3 -0.484913 0.244924 0.079473 0.114448 

Also, it works great with groupby if you select the appropriate columns:

 df['grp'] = ['A', 'A', 'B', 'B'] 0 1 2 3 grp 0 9.497381 0.552974 0.887313 -1.291874 A 1 6.461631 -6.206155 9.979247 -0.044828 A 2 4.276156 2.002518 8.848432 -5.240563 B 3 1.710331 1.463783 7.535078 -1.399565 B df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x))) 0 1 2 3 0 0.5 0.5 -0.5 -0.5 1 -0.5 -0.5 0.5 0.5 2 0.5 0.5 0.5 -0.5 3 -0.5 -0.5 -0.5 0.5 
+32
Oct 21 '15 at 3:10
source share

Slightly modified from: Python Pandas Dataframe: normalize data between 0.01 and 0.99? , but from some comments it is considered that this is relevant (sorry if you count repost though ...)

I need an individual normalization in that the regular zero percentile or z-score was insufficient. Sometimes I knew what the possible max and minimum numbers of the population were, and therefore I wanted to define it, except for my sample, or another middle, or something else! This can often be useful for scaling and normalizing data for neural networks, where you may need all inputs between 0 and 1, but some of your data may need to be scaled more individually ... because percentiles and stdevs assume your samples cover the population, but sometimes we know that this is not true. It was very useful for me when visualizing data in heatmaps. Therefore, I created a custom function (additional steps in the code were used here to make it as readable as possible):

 def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.): if low=='min': low=min(s) elif low=='abs': low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s)) if hi=='max': hi=max(s) elif hi=='abs': hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s)) if center=='mid': center=(max(s)+min(s))/2 elif center=='avg': center=mean(s) elif center=='median': center=median(s) s2=[x-center for x in s] hi=hi-center low=low-center center=0. r=[] for x in s2: if x<low: r.append(0.) elif x>hi: r.append(1.) else: if x>=center: r.append((x-center)/(hi-center)*0.5+0.5) else: r.append((x-low)/(center-low)*0.5+0.) if insideout==True: ir=[(1.-abs(z-0.5)*2.) for z in r] r=ir rr =[x-(x-0.5)*shrinkfactor for x in r] return rr 

It takes a series of Pandas or even just a list and normalizes it to your low, central and high points. there is also a compression ratio! so you can scale the data far from endpoints 0 and 1 (I had to do this when combining color palettes in matplotlib: A single pcolormesh with more than one color scheme using Matplotlib ) This way you can see how the code works, but mostly say that you have the values ​​[-5,1,10] in the sample, but you want to normalize based on the range from -7 to 7 (so that nothing above 7, our "10" is treated as 7 efficiently) with a middle of 2, but reduces it so that it matches the 256-bit color map:

 #In[1] NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256) #Out[1] [0.1279296875, 0.5826822916666667, 0.99609375] 

It can also turn your data inside out ... it may seem strange, but I found it useful for thermal material. Suppose you want a darker color for values ​​close to 0, not hi / low. You can heat the map based on normalized data, where inout = True:

 #In[2] NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256) #Out[2] [0.251953125, 0.8307291666666666, 0.00390625] 

So, now β€œ2”, which is closest to the center, defined as β€œ1”, is the highest value.

In any case, I thought that my application matters if you want to rescale the data in other ways that may have useful applications for you.

+2
May 05 '17 at 18:27
source share

If you want to normalize data, you should use this simple solution.

 df = (df - df.min()) / (df.max() - df.min()) 
0
Dec 01 '18 at 23:02
source share

Here's how you do it column by column:

 [df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns] 
0
Aug 01 '19 at 21:58
source share



All Articles