From a candidate for the post of candidate for the army:
Alpha is the speed of learning. If the reward or transition function is stochastic (random), then alpha should change over time, approaching zero at infinity. This is due to the approximation of the expected result of the internal product (T (transition) * R (reward)), when one of the two or both has random behavior.
This is important to note.
Gamma is the value of a future reward. This can affect learning a bit and can be dynamic or static. If it is equal to one, the agent evaluates the future JUST AS MUCH reward as the current reward. This means that in ten actions, if the agent does something good, it is JUST AS VALUABLE, as a direct action of this action. Thus, training does not work at this level with high gamma values.
Conversely, a gamma of zero will result in the agent only evaluating immediate rewards that only work with very detailed reward functions.
In addition, as with intelligence behavior ... there is actually a TON of literature. All your ideas, 100%, have been tried. I would recommend a more detailed search and even start using the theory of decision-making and "Improving policy."
Just adding a note to Alpha: imagine you have a reward function that splashes out 1 or zero for a specific action combo SA action. Now every time you execute SA, you will get 1 or 0. If you save alpha as 1, you will get Q values ​​of 1 or zero. If it is 0.5, you get a value of +0.5 or 0, and the function will always oscillate between the two values ​​forever. However, if every time you decrease alpha by 50 percent, you get those values. (provided that remuneration is received 1,0,1,0, ...). Your Q-values ​​will ultimately be 1,0,5,0,75,0,9,0,8, .... And in the end they converge by about 0.5. At infinity, it will be 0.5, which is the expected reward in a probabilistic sense.