TD provides training in the Markov Decision Making Process . That is, you start in some state s t , perform the action a t , get a reward r t and end in another state - s <sub> T + 1sub>. The initial state is called s 0 . The subscript is called time.
The TD agent starts without knowing what rewards he will receive for what actions or which other states accept these actions. The goal is to find out which actions maximize the long-term overall reward.
The state may represent the state of the tic-tac-toe board. So, s 0 in your case will be a clear board: "---------", s 1 may be "-o ---- ---", s 2 may be "-ox ---- " etc. The action may be the index of the cell to mark. With this view, you will have 3 9 = 19 683 possible states and 9 possible actions. After studying the game, you will have a table in which you are informed which cell will be marked on each possible board.
The same direct presentation will not work well for Backgammon, because there are too many possible conditions. What TD-Gammon does is an approximate state using a neural network. The network weights are considered as a state vector, and the reward is always 0, except for the gain.
The hard part here is that the state that the TD-Gammon algorithm examines is the state of the neural network used to evaluate the positions of the board, not the state of the board. So, at t = 0 you have not played a single game, and the network is in its original state. Each time you need to make a move, you use the network to select the best possible and then update the network weights based on whether the movement has led to victory or not. Before moving, the network has a weight w t , after which it has a weight w t + 1 . After playing hundreds of thousands of games, you will learn the weights that allow the neural network to correctly evaluate the position of the boards.
Hope this clarifies the situation.
source share