Studying the space of results taking into account noise actions and nonmonotonic reinforcement

I am looking to build or adapt a model, preferably based on RL theory, that can solve the following problem. I would really appreciate any directions or pointers.

I have a continuous action space where actions can be selected from a range of 10-100 (inclusive). Each action is associated with a specific reinforcement value from 0 to 1 (also inclusive) in accordance with the value function. So far, so good. Here, where I begin to enter the head:

Complication 1:

The value function V displays the actions for reinforcement according to the distance between the given action x and the target action A. The smaller the distance between them, the greater the reinforcement (ie, Reinforcement is inversely proportional to abs ( A - x ). However, the value function is non-zero for actions close to A (abs ( A - x ) is smaller, for example, epsilon) and zero elsewhere. So:

**V** proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon , and

**V** = 0 for abs(**A** - **x**) > epsilon .

Complication 2:

I do not know exactly what actions were taken at every step. I know that they are such that I know that they belong to the x +/- sigma range, but cannot accurately relate one action to the effort that I get.

The exact problem that I would like to solve is as follows: I have a series of noise estimates of actions and exact reinforcement values ​​(for example, in test 1 I can have x ~ 15-30 and gain from 0, in test 2 I can be x ~ 25-40 and gain 0; in experiment 3, x ~ 80-95 and gain 0.6.) I would like to build a model that represents an estimate of the most likely location of the target action A after each step, possibly weighing a new information in accordance with a certain parameter of the learning rate (since certainty b will increase with increasing samples).

+7
source share
1 answer

This journal article may be relevant: it addresses pending rewards and reliable training in the presence of noise and inconsistent rewards .

Rare neural correlations realize the construction of robots with deferred rewards and violations

In particular, they track (remember) which synapses (or actions) were fired before the reward event and strengthened them all, where the sum of the army structure fades over time between the action and the reward.

An individual reward event will reward any synapses that will be fired before the reward (or actions), including those not related to the reward. However, with a suitable learning rate, this should stabilize over several iterations, while only the desired action will be consistently rewarded and reinforced.

0
source

All Articles