Python Pandas Dataframe: normalize data between 0.01 and 0.99?

I am trying to associate each value in a data frame between 0.01 and 0.99

I successfully normalized the data between 0 and 1 using: .apply(lambda x: (x - x.min()) / (x.max() - x.min())) as follows:

 df = pd.DataFrame({'one' : ['AAL', 'AAL', 'AAPL', 'AAPL'], 'two' : [1, 1, 5, 5], 'three' : [4,4,2,2]}) df[['two', 'three']].apply(lambda x: (x - x.min()) / (x.max() - x.min())) df 

Now I want to link all the values ​​between 0.01 and 0.99

Here is what I tried:

 def bound_x(x): if x == 1: return x - 0.01 elif x < 0.99: return x + 0.01 df[['two', 'three']].apply(bound_x) 

Df

But I get the following error:

 ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index two') 
+2
python pandas dataframe normalization
Mar 19 '16 at 13:28
source share
2 answers

There is an application, err clip method , for this:

 import pandas as pd df = pd.DataFrame({'one' : ['AAL', 'AAL', 'AAPL', 'AAPL'], 'two' : [1, 1, 5, 5], 'three' : [4,4,2,2]}) df = df[['two', 'three']].apply(lambda x: (x - x.min()) / (x.max() - x.min())) df = df.clip(lower=0.01, upper=0.99) 

gives

  two three 0 0.01 0.99 1 0.01 0.99 2 0.99 0.01 3 0.99 0.01 



A problem with

 df[['two', 'three']].apply(bound_x) 

lies in the fact that bound_x receives a series of type df['two'] , and then if x == 1 requires x == 1 be evaluated in a boolean context. x == 1 is a Boolean series, for example,

 In [44]: df['two'] == 1 Out[44]: 0 False 1 False 2 True 3 True Name: two, dtype: bool 

Python is trying to reduce this series to a single logical value of True or False . Pandas follows the NumPy convention when an error occurs while trying to convert a series (or array) to bool .

+10
Mar 19 '16 at 13:32
source share

So I had a similar problem when I wanted to set up normal normalization in that the regular percentile of the zero point or z-score was insufficient. Sometimes I knew what the possible max and minimum numbers of the population were, and therefore I wanted to define it, except for my sample, or another middle, or something else! Therefore, I created a custom function (additional steps in the code were used here to make it as readable as possible):

 def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.): if low=='min': low=min(s) elif low=='abs': low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s)) if hi=='max': hi=max(s) elif hi=='abs': hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s)) if center=='mid': center=(max(s)+min(s))/2 elif center=='avg': center=mean(s) elif center=='median': center=median(s) s2=[x-center for x in s] hi=hi-center low=low-center center=0. r=[] for x in s2: if x<low: r.append(0.) elif x>hi: r.append(1.) else: if x>=center: r.append((x-center)/(hi-center)*0.5+0.5) else: r.append((x-low)/(center-low)*0.5+0.) if insideout==True: ir=[(1.-abs(z-0.5)*2.) for z in r] r=ir rr =[x-(x-0.5)*shrinkfactor for x in r] return rr 

It takes a series of pandas or even just a list and normalizes it to your low, central and high points. there is also a compression ratio! so that you can scale the data away from 0 and 1 (I had to do this when combining color codes in matplotlib: A single pcolormesh with more than one color scheme using Matplotlib ) So you can see how the code works, but basically they say that you have the values ​​[-5,1,10] in the sample, but you want to normalize based on the range from -7 to 7 (so nothing higher than 7, our "10" is effectively treated as 7) with a middle of 2, but reduces it up to 256 RGB colors:

 #In[1] NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256) #Out[1] [0.1279296875, 0.5826822916666667, 0.99609375] 

It can also turn your data inside out ... it may seem strange, but I found it useful for thermal material. Suppose you want a darker color for values ​​close to 0, not hi / low. You can heat the map based on normalized data, where inout = True:

 #In[2] NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256) #Out[2] [0.251953125, 0.8307291666666666, 0.00390625] 

So, now β€œ2”, which is closest to the center, defined as β€œ1”, is the highest value.

In any case, I thought that my question is very similar to yours, and this function may be useful to you.

+1
May 05 '17 at 18:13
source share



All Articles