5 A multitude of algorithms for learning the state-values exist; see Sutton and Barto (2018) for a comprehensive treatment. To begin with, it is possible to use (stochastic) gradient descent on the objective function given by the total expected future reward. However, it is more common to use algorithms that proceed by a recursion where the value of a state is defined based on the values of the states to which the agent can go from that state. This recursion is based on the theory of dynamic programming, and in particular what is called the Bellman equation. The recursion basically ends at the goal state in the sense that the value of the goal state is given by the fact alone that it is a goal state, and the values of the other states are derived from that by the Bellman equation. Consider a simple world with three states, A, B, and C, where C is the goal state. You can move from A to B and from B to C. The first part of the recursion says that the value of state A must be the value of state B minus a small quantity. That is because from state A you could go to state B in a single step, and the small quantity expresses the fact that you need one step; it is a consequence of discounting. Likewise, the value of state B must be a bit less than the value of C. Now the value of C is fixed (to some numerical value which is irrelevant) by the fact that it is the goal, and needs no recursion or computation. So, once the agent encounters the state C even once, it knows the value of C. Based on that knowledge and its model of the world, it can start recursive computations, by applying the ideas above (value of B equal to value of C minus a small constant, value of A likewise) to recursively compute the values of B and A. If we fix the value of the goal to 1, the state-values could be 0.8, 0.9, and 1 for A, B, and C respectively. Note that in this example, we assumed the agent follows the optimal policy, i.e., it always takes the smartest possible actions; thus, the state-values computed are the state-values of the optimal policy. You could also compute the state-values for a very dumb policy (say, always taking random actions), and they would be lower because by taking less smart actions, the agent would get less reward.