8This can be seen as an example of using something like action-values without a sophisticated world model. On the level of neurobiology, such reactive behaviour can also be explained by a special form of Hebbian learning, which implements something similar to the abstract theory of reinforcement learning we have just seen. In such learning, the association weight between the state (being in the box) and action (pressing the lever) increases every time both the state and the action are active, and a reward (fish) is delivered. Ordinary Hebbian learning would only be able to learn, in an unsupervised manner, the connection between the state and the action if the same action is frequently taken in a particular state. It would be useless in itself for selecting the best action since it does not take the reward gained by the actions into account. So, an extension of Hebbian learning to such “three-factor” learning, modulated by reward, is necessary. This may not be exactly what happens in the brain, but it is probably a useful approximation nevertheless (Nevin, 1999). Such three-factor (or modulated) Hebbian learning rules have a long history, see e.g. the discussions by Legenstein et al. (2010); Gerstner et al. (2018). These learning rules can also be extended to choosing action sequences in a dynamic environment: Basically, instead of the reward itself, the Hebbian rule might be modulated by reward prediction error considered next in the main text.