I implemented Q-Learning as described in the section
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
For approx. Q (S, A) I use the structure of the neural network as shown below,
- Sigmoid Activation
- Inputs, number of inputs + 1 for action neurons (all inputs are scaled 0-1)
- Exits, one exit. Q-value
- N numbers M hidden layers.
- Random research method 0 <rand () propExplore
At each training iteration, using the following formula,

I calculate the Q-Target value, then calculate the error using
error = QTarget - LastQValueReturnedFromNN
and back propagate the error through the neural network.
Q1, am I on the right track? I saw several works that implement NN with one output neuron for each action.
Q2, my reward function returns a number from -1 to 1. Is it possible to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3. From my understanding of this method, given sufficient case studies, should it be quarantined in order to find the optimal strategic strategy? When learning XOR, sometimes he learns about it after 2k iterations, sometimes he doesn't even know after 40k 50k iterations.
artificial-intelligence reinforcement-learning machine-learning q-learning neural-network
Hamza yerlikaya
source share