Questions about Q-Learning Using Neural Networks

Question

Questions about Q-Learning Using Neural Networks

I implemented Q-Learning as described in the section

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

For approx. Q (S, A) I use the structure of the neural network as shown below,

Sigmoid Activation
Inputs, number of inputs + 1 for action neurons (all inputs are scaled 0-1)
Exits, one exit. Q-value
N numbers M hidden layers.
Random research method 0 <rand () propExplore

At each training iteration, using the following formula,

enter image description here

I calculate the Q-Target value, then calculate the error using

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, am I on the right track? I saw several works that implement NN with one output neuron for each action.

Q2, my reward function returns a number from -1 to 1. Is it possible to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3. From my understanding of this method, given sufficient case studies, should it be quarantined in order to find the optimal strategic strategy? When learning XOR, sometimes he learns about it after 2k iterations, sometimes he doesn't even know after 40k 50k iterations.

+7

artificial-intelligence reinforcement-learning machine-learning q-learning neural-network

Hamza yerlikaya Dec 7 '14 at 8:27

source share

1 answer

purpletentacle · Accepted Answer · 2016-02-27T08:22:39+0000

Q1. This is more effective if you place all the active neurons in it. One front pass will give you all q-values for this state. In addition, a neural network can generalize a much better way.

Q2. Sigmoid is commonly used for classification. Although you can use the sigmoid in other layers, I would not use it in the latter.

Q3. Well .. Q-learning with neural networks is famous for not always converging. Look at the DQN (deep mind). What they do is solve two important issues. They decorate workout data using memory repetition. Stochastic gradient descent is not like when the training data is in order. Secondly, they are loaded using old weights. Thus, they reduce unsteadiness.

Questions about Q-Learning Using Neural Networks

More articles: