20(Moore and Atkeson, 1993; Schaul et al., 2016; Singer and Frank, 2009) More precisely, the prioritization mechanism replays memories of individual states (and actions taken in them) whose replay leads to maximal change in the estimated state-value function. This is not exactly the same as replaying episodes where a strong reinforcement occurred, as proposed earlier in the text, but it is closely related. Typically, a strong reward or punishment is unexpected, at least in the beginning of the learning. When you find a reward the first time, your state-value function is in some rather random initial state, and you could not really predict that the reward would be obtained; thus any reward is initially surprising. That is why prioritized sweeping prioritizes, as a first approximation, episodes containing reward or punishment.