Updates in the Temporary Difference section

I read about the Tesauro TD-Gammon program and would like to implement it for tic tac, but almost all the information is not available to me as a high school student, because I do not know the terminology.

The first equation is here http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2

gives a "general supervised learning paradigm." It says that w sub t on the left side of the equation is a parameter vector at time step t. What does “time step” mean? In the framework of the neural network tic tac toe, designed to display the value of the state of the board, will the time step refer to the number of parts played in this game? For example, the board represented by the string “xoxoxoxox” would be in step 9, and the board “xoxoxoxo” would be in step 8? Or is the time step related to the amount of time that has passed since the beginning of training?

Since w sub t is the weight vector for a given time step, does this mean that each time step has its own estimated function (neural network)? So, in order to evaluate the state of a board with only one movement, you would need to submit to another NN, how would you feed the state of the board in two moves? I think I'm misinterpreting something here because, as far as I know, Tesauro used only one NN to evaluate all board states (although it is difficult to find reliable information about TD-Gammon).

How is the output gradient relative to w, and not w sub t?

Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions on available reading materials.

+4
source share
1 answer

TD provides training in the Markov Decision Making Process . That is, you start in some state s t , perform the action a t , get a reward r t and end in another state - s <sub> T + 1sub>. The initial state is called s 0 . The subscript is called time.

The TD agent starts without knowing what rewards he will receive for what actions or which other states accept these actions. The goal is to find out which actions maximize the long-term overall reward.

The state may represent the state of the tic-tac-toe board. So, s 0 in your case will be a clear board: "---------", s 1 may be "-o ---- ---", s 2 may be "-ox ---- " etc. The action may be the index of the cell to mark. With this view, you will have 3 9 = 19 683 possible states and 9 possible actions. After studying the game, you will have a table in which you are informed which cell will be marked on each possible board.

The same direct presentation will not work well for Backgammon, because there are too many possible conditions. What TD-Gammon does is an approximate state using a neural network. The network weights are considered as a state vector, and the reward is always 0, except for the gain.

The hard part here is that the state that the TD-Gammon algorithm examines is the state of the neural network used to evaluate the positions of the board, not the state of the board. So, at t = 0 you have not played a single game, and the network is in its original state. Each time you need to make a move, you use the network to select the best possible and then update the network weights based on whether the movement has led to victory or not. Before moving, the network has a weight w t , after which it has a weight w t + 1 . After playing hundreds of thousands of games, you will learn the weights that allow the neural network to correctly evaluate the position of the boards.

Hope this clarifies the situation.

+6
source

Source: https://habr.com/ru/post/1413662/


All Articles