Q-Learning Values Getting Too High

Question

Q-Learning Values Getting Too High

I recently tried to implement the basic Q-Learning algorithm in the Golang. Please note that I am new to teaching artifacts and artificial intelligence in general, so the error can be very good.

Here's how I implemented the solution for the m, n, k-game environment: At each given point in time, the tagent contains the last state action (s, a)and the reward for it; the agent selects the move a'based on the Epsilon-greedy policy and calculates the reward r, then proceeds to update the Q(s, a)time valuet-1

func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}

Note:

agent.prevState saves the previous state immediately after taking the action and before the environment reacts (i.e. after the agent forces it to move and before the other player makes the transition), I use this instead of a tuple of the state action, but I'm not quite sure the right approach
agent.prevScore has a reward for previous state action
The argument rewardrepresents the reward for the current action of step state ( Qmax)

C agent.LearningRate = 0.2and agent.DiscountFactor = 0.8agent does not reach 100K episodes due to overflow of state value. I am using golang float64(IEEE 754-1985 standard double precision floating point variable), which overflows with speed ±1.80×10^308and gives ±Infiniti. This is too much value, which I would say!

, 0.02 0.08, 2M (1M ):

Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000

:

: 1
: -1
Draw: 0
: 0.5

, , .

, , , python script, ! , - ( , ), Q-Learning !

agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))

, ? Q-Learning?!

Update: , , , , prevScore, Q ( oldVal) ( , -1, 0, 0,5 1).

2M, :

Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718

5 , 2 ( , ) 3 .

+3

floating-point go reinforcement-learning q-learning

Fardin 30 '16 11:24

2

, . ; , - ! q, ( , ) , . , ( , ).

. 3.2 : . , , 3.5 .

+2

Nick Walker 31 '16 3:02

Pablo EM · Accepted Answer · 2016-06-01T07:07:43+0000

, Q-learning . Q- (x - u - ):

, , Qmax, . , , Q-.

Q-Learning Values ​​Getting Too High

More articles:

Q-Learning Values Getting Too High