I am looking to build or adapt a model, preferably based on RL theory, that can solve the following problem. I would really appreciate any directions or pointers.
I have a continuous action space where actions can be selected from a range of 10-100 (inclusive). Each action is associated with a specific reinforcement value from 0 to 1 (also inclusive) in accordance with the value function. So far, so good. Here, where I begin to enter the head:
Complication 1:
The value function V displays the actions for reinforcement according to the distance between the given action x and the target action A. The smaller the distance between them, the greater the reinforcement (ie, Reinforcement is inversely proportional to abs ( A - x ). However, the value function is non-zero for actions close to A (abs ( A - x ) is smaller, for example, epsilon) and zero elsewhere. So:
**V** proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon , and
**V** = 0 for abs(**A** - **x**) > epsilon .
Complication 2:
I do not know exactly what actions were taken at every step. I know that they are such that I know that they belong to the x +/- sigma range, but cannot accurately relate one action to the effort that I get.
The exact problem that I would like to solve is as follows: I have a series of noise estimates of actions and exact reinforcement values (for example, in test 1 I can have x ~ 15-30 and gain from 0, in test 2 I can be x ~ 25-40 and gain 0; in experiment 3, x ~ 80-95 and gain 0.6.) I would like to build a model that represents an estimate of the most likely location of the target action A after each step, possibly weighing a new information in accordance with a certain parameter of the learning rate (since certainty b will increase with increasing samples).
user2388629
source share