Problem:
I have prepared an agent to perform a simple task in the grid world (go to the top of the grid without encountering obstacles), but the following situation always arises. It is located in the light part of the stateβs space (without obstacles) and therefore constantly receives a strong positive amplification signal. Then, when he finds himself, is a complex part of the state space (wedges next to two obstacles), he simply chooses the same action as before, without effect (it rises and falls into the obstacle). In the end, the Q value for this value corresponds to a negative reward, but by this time other actions have even lower Q values ββbecause they are useless in a simple part of the state space, so the error signal drops to zero, and the wrong action always remains selected.
How can I prevent this? I thought of several solutions, but no one seems viable:
- Use a policy that is always useful for research. Since obstacles take ~ 5 actions to get around, one random action from time to time seems ineffective.
- Make the reward function so that bad actions are worse when they are repeated. This causes the reward function to violate the Markov property. Perhaps this is not bad, but I just have no idea.
- Just reward the agent for completing the assignment. The task takes more than a thousand actions to complete, so the training signal will be too weak.
Some information about the task:
So, I did a little test to test the RL algorithms - something like a more complex version of the grid world described in Sutton's book. The world is a large binary grid (300 by 1000) filled with 1 in the form of rectangles of arbitrary size against a background of 0. Band 1 is surrounded by the edges of the world.
The agent occupies one place in this world and only fixed windows around it (window 41 by 41 with the agent in the center). Agent actions consist of moving 1 space in any of four main directions. An agent can only move through spaces marked with 0, 1, impassable.
The current task to be performed in this environment is to make it the top of the grid world, starting from a random bottom position. +1 reward for successful promotion. Reward -1 is given for any move that would hit an obstacle or to the edge of the world. All other states receive remuneration in the amount of 0.
The agent uses the basic SARSA algorithm with an approximator of the neural network function (as described in Sutton's book). To make political decisions, I tried both e-greedy and softmax.
source share