Define an algorithm that receives a number and a list and returns a scalar based on the distance to the middle list

Suppose we have a list that adds an integer at each iteration, which is between 15, 32 (let it call the integer rand ). I want to develop an algorithm that assigns a reward of about 1 (1.25 to 0.75) to each rand . the rule for the appointment of remuneration is as follows.

first we calculate the average value of the list. Then, if rand greater than average, we expect that the reward will be less than 1, and if rand less than average, the reward will be greater than 1. The greater the distance between the average and rand , the greater the increase / decrease reward. eg:

rand = 15, avg = 23 then reward = 1.25

rand = 32, avg = 23 then reward = 0.75

rand = 23, avg = 23 then reward = 1 , etc.

I developed the code below for this algorithm:

 import numpy as np rollouts = np.array([]) i = 0 def modify_reward(lst, rand): reward = 1 constant1 = 0.25 constant2 = 1 std = np.std(lst) global avg avg = np.mean(lst) sub = np.subtract(avg, rand) landa = sub / std if std != 0 else 0 coefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa))) md_reward = reward + (reward * constant1 * coefficient) return md_reward while i < 100: rand = np.random.randint(15, 33) rollouts = np.append(rollouts, rand) modified_reward = modify_reward(rollouts, rand) i += 1 print([i,rand, avg, modified_reward]) # test the reward for upper bound and lower bound rand1, rand2 = 15, 32 reward1, reward2 = modify_reward(rollouts, rand1), modify_reward(rollouts, rand2) print(['reward for upper bound', rand1, avg, reward1]) print(['reward for lower bound', rand2, avg, reward2]) 

The algorithm works pretty well, but if you look at the examples below, you will notice a problem with the algorithm.

rand = 15, avg = 23.94 then reward = 1.17 # which has to be 1.25

rand = 32, avg = 23.94 then reward = 0.84 # which has to be 0.75

rand = 15, avg = 27.38 then reward = 1.15 # which has to be 1.25

rand = 32, avg = 27.38 then reward = 0.93 # which has to be 0.75

As you can see, the Algorithm does not take into account the distance between avg and borders (15, 32). The more avg moves toward a lower border or a higher border, the more modified_reward becomes unbalanced.

I need the modified_reward be evenly assigned, regardless of the fact that avg moves towards the upper bound or lower bound. Could someone suggest some modification of this algorithm that could consider the distance between avg and the borders of the list.

+7
python design algorithm numpy
source share
4 answers

I do not understand why you are calculating md_reward like this. Indicate logic and reason. But

 landa = sub / std if std != 0 else 0 coefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa))) md_reward = reward + (reward * constant1 * coefficient) 

will not give what you are looking for. Since we consider the cases below

 for md_reward to be .75 --> coefficient should be -1 --> landa == -infinite (negative large value, ie , rand should be much larger than 32) for md_reward to be 1 --> coefficient should be 0 --> landa == 0 (std == 0 or sub == 0) # which is possible for md_reward to be 1.25 --> coefficient should be 1 --> landa == infinite (positive large value, ie , rand should be much smaller than 15) 

If you want to normalize the rewards from avg to max and avg to min. check below links. https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range https://stats.stackexchange.com/questions/70553/what-does-normalization-mean -and-how-to-verify-that-a-sample-or-a-distribution

Now change your function to something below.

 def modify_reward(lst, rand): reward = 1 constant1 = 0.25 min_value = 15 max_value = 32 avg = np.mean(lst) if rand >= avg: md_reward = reward - constant1*(rand - avg)/(max_value - avg) # normalize rand from avg to max else: md_reward = reward + constant1*(1 - (rand - min_value)/(avg - min_value)) # normalize rand from min to avg return md_reward 

I used below method

 Normalized: (Xโˆ’min(X))/(max(X)โˆ’min(X)) 

for the case rand >= avg

min (X) will be avg and max (X) will be max_value

and for the case rand < avg

min (X) in min_value and max (X) is avg

Hope this helps.

+2
source share

Combining these two requirements:

if rand greater than average, we expect the reward to be less than 1, and if rand less than average, the reward will be more than 1.

I need the modified_reward be evenly assigned, regardless of the fact that avg moves to the top border or bottom border.

a little harder, depending on what you mean by "evenly."

If you want 15 to always be rewarded 1.25 and 32 to always be rewarded 0.75, you cannot have a single linear relationship, and also comply with the first requirement.

If you are happy with the two linear relationships, you can aim for a situation where modified_reward depends on rand like this:

enter image description here

which I made with this Wolfram Alpha request . As you can see, these are two linear relationships with the โ€œkneeโ€ in avg . I expect that you can get the formulas for each part without any problems.

+7
source share

This code implements a linear distribution of weights proportional to the distance from the average to your specified limits.

 import numpy as np class Rewarder(object): lo = 15 hi = 32 weight = 0.25 def __init__(self): self.lst = np.array([]) def append(self, x): self.lst = np.append(self.lst, [x]) def average(self): return np.mean(self.lst) def distribution(self, a, x, b): ''' Return a number between 0 and 1 proportional to the distance of x from a towards b. Note: Modify this fraction if you want a normal distribution or quadratic etc. ''' return (x - a) / (b - a) def reward(self, x): avg = self.average() if x > avg : w = self.distribution(avg, x, self.hi) else: w = - self.distribution(avg, x, self.lo) return 1 - self.weight * w rollouts = Rewarder() rollouts.append(23) print rollouts.reward(15) print rollouts.reward(32) print rollouts.reward(23) 

Production:

 1.25 0.75 1.0 

The code in your question seems to be using np.std , which I assume is an attempt to get a normal distribution. Remember that the normal distribution never reaches zero.

If you tell me which form you want to distribute, we can modify Rewarder.distribution to fit.

Edit:

I canโ€™t access the paper that you are referring to, but the conclusion is that you want a sigmoid distribution of the style of rewards giving 0 by the average value and approximately +/- 0.25 at min. and max. Using the error function as weighting, if we scale by 2, we get approximately 0.995 with min and max.

Override Rewarder.distribution:

 import math class RewarderERF(Rewarder): def distribution(self, a, x, b): """ Return an Error Function (sigmoid) weigthing of the distance from a. Note: scaled to reduce error at max to ~0.003 ref: https://en.wikipedia.org/wiki/Sigmoid_function """ return math.erf(2.0 * super(RewarderERF, self).distribution(a, x, b)) rollouts = RewarderERF() rollouts.append(23) print rollouts.reward(15) print rollouts.reward(32) print rollouts.reward(23) 

leads to:

 1.24878131454 0.75121868546 1.0 

You can choose which error function is suitable for your application and how many errors you can accept with minimum and max. I also expected you to include all these functions in your class, I divided everything so that we could see the details.

As for calculating the average, do you need to keep a list of values โ€‹โ€‹and recount each time, or can you save the score and the total amount? Then you will not need numpy for this calculation.

+5
source share

try it

 def modify_reward(lst, rand): reward = 1 constant = 0.25 #Think of this as the +/- amount from initial reward global avg avg = np.mean(lst) sub = np.subtract(avg, rand) dreward = 0 if sub>0: dreward = sub/(avg-15) #put your lower boundary instead of 15 elif sub<0: dreward = sub/(32-avg) #put your higher boundary instead of 32 md_reward = reward +(dreward*constant) return md_reward 

This is a linear solution based on @AakashM. I donโ€™t know if this was what you were looking for, but it fits your description.

+1
source share

All Articles