Alpha and Gamma Options in QLearning

What difference does the algorithm make with a large or small gamma value? In my optics, if it is not 0 or 1, it should work exactly the same. On the other hand, no matter what gamma I choose, it seems that Qvalues ​​very quickly approach zero (I have values ​​here of the order of 10 ^ -300 only in a quick test). As usual, people build Qvalues ​​(I plan (x, y, the best QValue for this state) given this problem? I try to get around using logarithms, but even then it looks awkward.

Also, I don’t understand what is the reason for the presence of the alpha parameter in the Q Learning update function. It basically sets the amount of update that we are going to do for the function of the Q value. I have an idea that it usually decreases over time. What is the interest in reducing it over time? Should the update value at the beginning be more important than 1000 episodes later?

In addition, I thought it was a good idea to examine the state space every time the agent does not want to do greedy actions, will examine any state that still has zero QValue (this means, at least, most times, a state that has never been before not done), but I do not see that it was mentioned in any literature. Are there any flaws in this? I know that this cannot be used with (at least some) generalization functions.

Another idea would be to save a table of visited states / actions and try to perform actions that were previously checked in this state. Of course, this can only be done in relatively small public spaces (in my case, this is definitely possible).

The third idea at the end of the research process will be to look not only at the selected action looking for the best qvalues, but also to look at all possible actions and this state, and then others from this state and so on.

I know that these issues are not related to each other, but I would like to hear the opinions of people who previously worked with this and (possibly) struggled with some of them.

+6
language-agnostic artificial-intelligence reinforcement-learning
source share
3 answers

From a candidate for the post of candidate for the army:

Alpha is the speed of learning. If the reward or transition function is stochastic (random), then alpha should change over time, approaching zero at infinity. This is due to the approximation of the expected result of the internal product (T (transition) * R (reward)), when one of the two or both has random behavior.

This is important to note.

Gamma is the value of a future reward. This can affect learning a bit and can be dynamic or static. If it is equal to one, the agent evaluates the future JUST AS MUCH reward as the current reward. This means that in ten actions, if the agent does something good, it is JUST AS VALUABLE, as a direct action of this action. Thus, training does not work at this level with high gamma values.

Conversely, a gamma of zero will result in the agent only evaluating immediate rewards that only work with very detailed reward functions.

In addition, as with intelligence behavior ... there is actually a TON of literature. All your ideas, 100%, have been tried. I would recommend a more detailed search and even start using the theory of decision-making and "Improving policy."

Just adding a note to Alpha: imagine you have a reward function that splashes out 1 or zero for a specific action combo SA action. Now every time you execute SA, you will get 1 or 0. If you save alpha as 1, you will get Q values ​​of 1 or zero. If it is 0.5, you get a value of +0.5 or 0, and the function will always oscillate between the two values ​​forever. However, if every time you decrease alpha by 50 percent, you get those values. (provided that remuneration is received 1,0,1,0, ...). Your Q-values ​​will ultimately be 1,0,5,0,75,0,9,0,8, .... And in the end they converge by about 0.5. At infinity, it will be 0.5, which is the expected reward in a probabilistic sense.

+10
source share

What is the difference between an algorithm and a large or small gamma value?

gammas should fit the size of the observation space: you must use larger gamma (i.e. closer to 1) for larger state spaces and smaller gamma for smaller spaces.

One way to think of gamma is to measure the damping of the reward from the final successful state.

0
source share

I haven’t worked with systems like this before, so I don’t know how useful I can be, but ...

Gamma is a measure of an agent’s tendency to expect future rewards. The smaller it is, the more the agent will strive for action with the highest reward, regardless of the resulting state. Agents with a larger gamut will explore the long paths to great rewards. As for all Q values ​​approaching zero, have you tried with a very simple state map (say, one state and two actions) with gamma = 0? This should quickly approach Q = reward.

The idea of ​​reducing the alpha effect is to weaken the fluctuations in the Q values ​​so that the agent can settle in a stable structure after wild youth.

Exploring the state space? Why not just go over it, ask the agent for everything? There is no reason for the agent to really follow the course of actions in his training - unless that is the point of your modeling. If the idea is to find the best behavior pattern, tune all of Q, not just the highest ones along the way.

-2
source share

All Articles