Chapter 5
Frustration due to reward prediction error

Now, armed with modern machine learning theory, we revisit the problem of action selection and the concept of frustration. In planning treated in Chapter 3, the main computational problem is looking several steps ahead, which can lead to quite impossible demands of computational capacities. Another constraint is that it requires a model of how your actions affect the world, i.e. where do you go in the search tree when you perform a given action in a given state. As such, planning is not really a good method for action selection if computational resources are very limited, as in a simple computer, or a very simple animal such as an insect.

In this chapter, we consider an alternative way to action selection, based on learning. A paradigm called reinforcement learning enables learning intelligent actions without any explicit planning, thus avoiding many of its problems. It also generalizes the framework of a single goal to maximization of rewards obtained at different states. While it can be performed even in very simple animals and computers, it is also used by humans; it is similar to how habits work.

We then consider how frustration can be defined in such a case; it can no longer be simply defined as not reaching the goal—since there is no explicit goal. We define more general error signals called reward loss and reward prediction error, which have been linked to signals of certain neurons in the mammalian brain. Thus, we expand the view where frustration is related to error signalling by linking it to errors in prediction.

Repeated frustration is thus something necessary for learning algorithms to work, and intelligence may not be possible without some frustration. We further see how the very construction of an agent based on reward maximization means that it is insatiable, never satisfied with the amount of reward obtained. Moreover, it can be directed towards intermediate goals which are not valuable in themselves, but simply predictive of future reward. Evolutionary rewards, in particular, can lead to behaviour which resemble obsessions.

Maximizing rewards instead of reaching goals

In modern AI, action selection is most often not based on planning, but a framework where the obtained rewards, or reinforcement, is maximized. This is useful because often an AI does not have just a single goal to accomplish, but many things it should take care of. Defining behaviour as maximization of rewards as opposed to reaching goals is also often thought to be more appropriate for modelling behaviour in simple animals, which are thought to be incapable of the sophisticated computations needed in planning (more on this in Chapter 9).

For example, if a cleaning robot disposes of some dust in the dustbin, it could be given a reward signal. Since there are many rooms and many dustbins in the building, it makes sense to give a reward whenever the robot disposes of some of the dust. In principle, we could decide to give it a single reward when all the rooms are completely clean; however, it is common sense to rather give it a reward every time it removes some dirt or dust from any of the rooms. After all, the robot has done something useful every time it reduces the amount of dust in the room; telling this to the robot is highly useful information, and it would simply complicate the learning if the reward were postponed until the robot has completed some larger part of the task.

In fact, giving a single reward at the end would mean the robot has to engage in long-term planning, which is difficult. A “piece-wise” training by giving rewards for small accomplishments is not very different from how you would teach a child to perform a rather demanding, long task, say tying shoelaces: Divide the task into successive parts and give the child a small encouragement when it completes each small part. This is computationally advantageous since it eliminates the need for long-term planning, a bit like the heuristics we saw above.1

Reinforcement or reward can also be negative; if the robot tries to put household items in the dustbin, it can be given some. Negative reinforcement is really what we usually call a punishment—but the word is interpreted without any moral connotations here.

Thus, we actually ground action selection in the optimization of an objective function, i.e. a quantity to be optimized. Earlier, we saw that minimization of an error function, such as the number of images incorrectly classified, is the way an AI can learn to recognize objects in images. Here we define a different kind of objective function which is the basis of action selection: It is equal to the sum of all future rewards. It is a function of the action selection parameters of the agent, and more precisely, it expresses how much reward the agent can obtain by behaving according to its current action selection system.

Such a learning process based on maximization of future rewards by learning a value function is called reinforcement learning. Reinforcement learning can be seen as a third major type of learning in AI, in addition to supervised and unsupervised learning.

In a sense, this future reward is the ultimate objective function of an agent. Its maximization, by tuning the action-selection system, is the very meaning of life of the agent. The objective functions we saw earlier, used to learn things like pattern recognition by minimization of errors, are there merely to help in maximizing this reward-based objective function.

In such a reward-based objective function, more weight is often put on the rewards in the near future as opposed to rewards in the far-away future, which is called discounting. The justification for this is complicated, but suffice it to say that such discounting is often evident in human behaviour: Humans prefer to have their reward right now, and value it less if they have to wait. To keep the discussion simple, I sometimes ignore discounting in what follows, but it could be used in almost every case considered in this chapter.2

Learning to plan using state-values and action-values

As such, the sum of future rewards gives a more general framework than having a single goal as in Chapter 3, since trying to reach a single goal can be accommodated in the reward framework by simply giving a reward when the agent reaches the goal, and no reward otherwise. In such a case, discounting further means the agent receives more reward if it reaches the goal more quickly, which is intuitively reasonable.

It turns out that we can use this reformulation of planning as reward maximization to our advantage, since the algorithms developed for maximizing future rewards give a particularly attractive way of solving the problem of planning. In Chapter 3, we saw how difficult planning is due to the exponential explosion in the number of possible plans to choose from. While heuristics were proposed as a practical trick to make the computations more manageable, there is no universal way of designing good heuristics.

Like in other branches of AI, it has been found that learning solves many of these problems. Intuitively, if the agent encounters the same planning problem again and again, it can store information about the previous solutions (or attempts) in memory. For example, a cleaning robot will probably clean the same building many, many times, and a delivery robot will deliver the parcels to the same addresses quite a few times. So, such agents should be able to learn something about planning in their respective worlds. This would be a clear improvement compared to heuristics which need to be explicitly programmed in the system by programmers, as in our examples above, and it is often not at all clear how to do that.3

In reinforcement learning, there is a sophisticated mathematical theory that tells how to learn a particularly good substitute of a heuristic, called the state-value function. It is a clever way of learning to deal with the complexity of the search in a planning tree. The basic principle is simple: Using the previous planning results in its memory, the agent can compute something like a heuristic based on how well it performed starting from each possible state. If it found the goal quickly starting from a certain state, that state gets a large state-value.

In the case where we have a single goal, the state-value function basically tells you how far from the goal you are, thanks to discounting which takes account of the time needed to reach the goal. A delivery robot that frequently delivers stuff to the same building (say town hall) would easily learn the distance from any other building to the town hall. In the beginning, when it had a delivery to the town hall, it had to spend a lot of time and effort in planning the path there. But little by little, it gained information by storing any results of executed plans in its memory, and learned the distance from any other building to the town hall. Such distances now give the state-value function for that goal (the state-value is actually a decreasing function of that distance). When the robot next needs to go to the town hall, it recalls the distances—to the town hall—from those buildings that are close to its current location, and simply decides to move in the direction of the near-by building which has the smallest distance to the town hall. Thus, it has learned a kind of a heuristic that avoids planning action sequences altogether.

This works even in a very general setting when there is no particular goal. In general, the value of a state is defined as the sum of all future rewards the agent can obtain starting from that state.4 After successful learning of the state-values, the solution to the problem of action selection is that at each step, whatever state the agent may be in, the agent just selects the single action which leads to the state with the highest state-value (e.g. closest to the town hall above). There is no need to compute several steps ahead, or make a search within the huge search tree anymore. This reduces the complexity of the computations radically: instead of planning all its actions up to reaching the goal, the state-values now provide kind of intermediate goals, one step ahead, that the agent can very easily reach. However, a lot of time and computation still needs to be spent on first learning the state values.5

Completely reactive action selection by action-values

While we have thus solved the problem of the computational explosion of planning, there is still the problem that the agent needs to have a model of how the world works. Even using the state-values, it needs to understand which action takes it into a state with higher state-value. Now, consider an extreme case where the agent has no model of how the world works in the sense that it has no idea what about the effects of its actions. Then, it is not enough to assign values to different states since the agent does not know how to get from state A to state B. (Still, we assume the agent knows in which state it is, at any given moment, so it does have some minimal model of the world.)

The trick to learning to act even with such a minimal model of the world is to learn what is called the action-value function. When the agent is in a given state, the action-value function tells the value of each of its actions, in terms of how much the total future reward is if the agent performs that action.6 This makes action selection really easy and extremely fast: Just compare the values of different actions and choose the one which gives the maximum. In fact, all the relevant information about the effects of the agent’s actions are implicitly included in the action-value function. The agent still has to learn the action-values, but that is not really more difficult than learning the state-values.7

At the end of the 19th century, Edward Thorndike put cats in a box where they would need to press a lever to get out of the box and receive some fish to eat. He observed that in successive trials, the cats were pressing the lever more and more often. Such learning is called instrumental conditioning (as opposed to classical conditioning as in the famous Pavlov’s dogs, to be considered below). This shows how learning to choose actions is possible by simply associating what we call a state in AI theory (here, being in the box) with an action.8

So, using reinforcement learning, an AI or an animal can actually learn to act without doing any real planning and having almost no model about the world. If it learns the action-value function, it only needs to look at the single actions immediately available, and then take the action which has the largest action-value—at the state where it happens to find itself. Since the action is here triggered immediately without any deliberation, like a habit or a knee-jerk reaction, the resulting behaviour is often called habit-based, or reactive.9

Reinforcement learning has recently become popular as a model of human behaviour in neuroscience, where humans may not be considered too different from experimental animals such as cats or rats. Current thinking is that the same reinforcement learning algorithms can be used to model at least one part of the action selection system in most animals, including humans. Nevertheless, there is little doubt that some animals, probably most mammals, engage in planning as well.10

In fact, reinforcement learning using value functions is not a magic trick that will obliterate the complexity of the action selection: It simply shifts the computational burden from search in the tree to learning a value function. Sometimes, this is a good idea, but not always. We will discuss the pros and cons of reinforcement learning vs. planning in Chapter 7. Let me just mention here the main disadvantages of habit-based behaviour: such learning often needs a lot of time and data, and leads to inflexible behaviour. This is quite in line with the common-sense idea we have about habits.

Frustration as reward loss and prediction error

We have thus divided action selection into planning and habits, where habits refer to more automated action selection mechanisms.11 Now we consider how to define frustration in the case of habits, where there are no goals but rather rewards obtained here and there, so we cannot talk about frustration in the sense of not reaching the goal.

What defines frustration in this case is an error signal called reward loss12 which we already saw briefly in Chapter 2. It is computed by the following simple formula:

reward loss = expected reward - obtained reward

which is set to zero in case the difference is negative. That is, a reward loss is incurred when an agent expects to get some reward but actually gets less reward than expected. Maybe a cleaning robot expected to find a lot of dust in a room, but in fact there was much less. If it happens that the obtained reward is actually larger than expected, there is obviously no reward loss, so the reward loss is defined as zero if the difference in the expression above is negative. Reward loss can also occur if the expected reward is negative, and the obtained reward is negative while even larger (in absolute value): the agent did expect something bad to happen, but it turned out to be even worse.13

Expectation of reward here refers to the mathematical expectation as defined in probability theory. It is obtained by weighting the possible values by their probabilities: if the probability of obtaining a reward is 50% and the reward is 10 pieces of chocolate, the expected reward is 5 pieces of chocolate.14 Expectations of the future are often called predictions, which are in fact ubiquitous in the brain: it is likely that the brain makes a prediction of almost any important quantity in the environment.15

In contrast to our basic definition of frustration in Chapter 3, which works only on the level of plans, such a reward loss can be computed after every single action and at every single time point. This definition of reward loss is, in fact, quite flexible since the time interval in which the reward is computed can be specified to be anything from seconds to days. Therefore, it provides a general framework encompassing both planning and habit-based action. Reward loss coming from planning and reward loss coming from single actions are similar except that they work on very different time scales. We shall consider this point in more detail at the end of Chapter 7.16

Reward loss, in its turn, is related to what is called the reward prediction error (RPE), a most fundamental quantity in machine learning theory. RPE means any error made in the prediction of the reward. This definition is very general because the expected reward can be greater or less than the obtained one, and thus RPE can be positive or negative. If the obtained reward was larger than expected, that is the opposite of reward loss and suffering, and related to pleasure.17

As the very expression “reward prediction error” indicates, the theory of RPE also shows how suffering is related to learning by minimization of errors, which is a fundamental approach in machine learning. If the agent can predict the rewards obtained by different actions in different states, it will be able to act so as to maximize the obtained rewards (at least if it has a good model of other aspects of the world as well). To learn and improve such predictions, it is necessary to compute the errors incurred by the current system to predict the reward. This is how minimizing RPE is related to maximization of rewards. In fact, it is possible to devise reinforcement learning algorithms that work simply by minimizing reward prediction error.18

The exact mathematical definition of RPE is quite involved and relegated to a footnote.19 Let me just point out that RPE is a more general concept than reward loss also in the sense that it enables an error signal even when the agent is far away from any actual or expected rewards, but it receives “bad news” about future reward. This is in contrast to reward loss which does not make any sense except when a reward is actually expected to be obtained right now (the exact meaning of “now” depends on the time scale). In particular, if the expectation of total future reward decreases, RPE signals frustration, while the reward loss has no meaning if no reward was expected to be obtained at this moment. Suppose a cleaning robot is on its way to a room where there is a lot of dust (yummy!), and thus its expected (predicted) reward is high, but this reward will in any case not be obtained for quite a while. Then, it finds that the door to the room is locked and it cannot enter; that is bad news it didn’t expect. Thus the robot finds itself in a new state that has a much lower expected total future reward since the dust in that room cannot be reached. It is this difference between the earlier prediction and the new prediction that creates RPE and suffering. This is not an ordinary reward loss because no actual reward was expected to appear at this time point anyway: the robot has not yet even entered the room, and the dust is still far away. However, RPE can create suffering merely based on predictions: if information arrives that makes the agent reduce its prediction of future reward, frustration is created. This is intuitively appealing since a lot of our frustration is actually about such negative news and the lowering of expectations they create. Suppose I’m planning to attend an event that I expect to enjoy, and then, well in advance, I hear the event has been cancelled. I will suffer, although I didn’t expect to obtain anything enjoyable yet, and I may not have taken any action either; it was all just predictions in my head.20

Expectations or predictions are crucial for frustration

Reward loss and RPE highlight the importance of expectations and predictions. Clearly, there must be some expectation or prediction in order for them to occur. If the cleaning robot were so primitive that it had no expectations or predictions at all, it might be just enjoying every single speck of dust it finds. Making it more intelligent so that it can predict the future thus deprives it of its “innocence”, and enables frustration to occur. Likewise, Cassell says that “to suffer, there must be a source of thoughts about possible futures”, even though his approach to suffering is quite different.21

The importance of predictions is well appreciated in neuroscience. It has been observed that in the brain, RPE is coded by certain neurons using a neurotransmitter called dopamine. More precisely, it is coded by quick changes in the level of dopamine (called “phasic dopamine signal”), typically originating in evolutionarily old areas such as the midbrain, which is literally in the very center of the brain.22 In case the obtained reward is higher than expected, there is a temporary peak in the amount of dopamine in the signalling pathways, which is called by some a “dopamine surge”. That’s why many drugs of abuse target the dopamine pathways in the brain. For example, cocaine blocks the removal of dopamine in the synapse so that its signal is amplified.23 Such drugs are fooling the reward-processing system in the brain, thus leading to a strong desire for such drugs, in addition to a pleasurable feeling. This has led some to think that dopamine is the neurotransmitter responsible for the feeling of pleasure itself. Such a viewpoint is probably incorrect, and the actual feeling of pleasure is mainly mediated by other transmitters, namely those in the opioid family, while dopamine is more related to “cold” action selection and learning.24

Classical conditioning

To emphasize the importance of predictions in the brain, let’s consider an extremely famous kind of prediction learning in the animal realm: classical conditioning. Ivan Pavlov, doing physiological experiments on dogs around the year 1900, observed that the dogs began to salivate when they saw the staff person who was responsible for feeding them, even before receiving any food. Pavlov was intrigued and tried to see if the dogs would be able to associate any arbitrary stimuli to food. He succeeded in making the dogs associate food with many different kinds of stimuli, including the sound of a bell or a metronome, provided that these stimuli were consistently presented just before food was given.

What the animal is clearly doing here is predicting the future: After the bell, food is likely to arrive. Such predictions are ubiquitous in the brain; the brain is constantly trying to predict what happens, using many different systems. Predicting the results of any action you might take is important if you want to choose good actions, as we already saw in the case of instrumental conditioning above (🡭). Predicting where the rabbit will be a second or two later is necessary if you want to catch it. Note the crucial distinction between classical conditioning and instrumental conditioning: In classical conditioning, the agent does not yet learn to choose actions, but merely to predict future states, independently of any rewards.25

It would be natural to assume that such classical conditioning could be easily performed by Hebbian learning. It is just the kind of association of two stimuli—bell and food— that Hebbian learning seems to be good at. That is to some extent true, although this is a bit tricky; the most successful models actually use supervised learning, with the bell as input and the food as the output. Such learning, again, proceeds by minimizing prediction error.26

Does a low level of rewards produce suffering?

Intuitively, however, it might seem that talking about frustration based on expectations and predictions is unnecessarily complicated. If the agent is in a state with low state-value (in its own estimation), would that not in itself imply suffering? Being in a state of low value means that the agent believes it will not obtain much net reward in the future, which sounds like a good reason for mental pain. Or, even more fundamentally, why not just say that lack of rewards, presumably during recent history, is suffering?

One fundamental problem with such an approach would be that it is not obvious how to define a suitable baseline or comparison: What level of state-value is actually low, and how small should recent reward actually be to create suffering? The reward loss or prediction error actually solves this problem by using the expected reward as the baseline. Thus, the obtained level of rewards is compared with the expected level, and if it is “low” in this particular sense, suffering occurs.27

Unexpected implications of state-value computation

In the rest of this chapter, I will consider some practical implications of the theory presented here. First, let us consider how the computation of state-values, as proposed in basic reinforcement learning theory, fundamentally changes the behaviour of human agents. Originally, of course, evolutionary forces demand that an action is pursued by a biological organism if it helps it reproduce and spread our genes, and an action is avoided if it hampers this effort. So, evolution “tells” us that kicking a stone is bad because it can cause damage to our foot, and the damage decreases our potential for reproduction—thus giving us negative reward for such an action. Having sex is very good, and rewarded by basic evolutionarily mechanisms, because then we are fulfilling our deepest evolutionary calling and spreading our genes.

The computation of state-values changes the situation: The organism will not only try to reach states directly giving reward—such as having sex—but also states that have higher state-values. This is a mechanism for looking forward in time: instead of immediate reward, the organism will try to maximize the total reward in the future, and just that is given by the state-value.

Seemingly valueless states are now valued by the agent since they predict that more actual reinforcement can be found sooner. Such states provide intermediate goals in the pursuit of the actual reward, similarly to heuristics in tree search. If you train a robot to get orange juice from the fridge, it must of course first go to the fridge, and open it. So, the state where the robot is standing next to the fridge acquires a positive state-value and we could even say that the robot “likes” being next to a fridge, even more so if it is open.

The situation is even more complex due to the existence of human civilization and society. Culture plays an important role in determining the state-value function, and it is often difficult to separate the influences of biology and culture. In neuroscience, this is called the “nature vs. nurture” question. There can be extremely complex chains of value computation which transform the original evolutionary goals to behaviour based on intermediate goals. For example, humans have evolved to strive for high social status. From an evolutionary perspective, this is because it helps humans get more sexual partners and increases the number and the survival probability of their offspring. This then implies that we want to increase our status: for example, winning a gold medal in the Olympics is a good behavioural goal. Clearly, a gold medal in Olympics has no evolutionary value in itself: it does not satisfy your hunger, thirst, or sexual appetite in itself. It is just an arbitrary piece of metal. There is no logical connection between such a piece of metal and sex. It is only due to a complex interplay of value function calculation and cultural meanings that the original evolutionary reward of sex has been subtly transformed into a goal such as excelling in sports—or science, or politics.

Such slightly weird desires are another manifestation of the phenomenon discussed earlier: emergence of unexpected phenomena due to the interaction between the learning agent and a complex environment. If we program sufficiently complex AI, the same thing is likely to happen as with human evolution. The AI will pursue goals that were not intended by the programmers, but which still happen to produce a high state-value.28

Evolutionary rewards as obsessions

Now, if we admit that our desires are based on evolution, even if quite indirectly, is that a good thing or a bad thing? Should we just follow our desires, or think twice, or even try to follow some completely different goals? There are actually people who try to justify certain kinds of behaviour by saying they are evolutionarily conditioned, i.e. “evolution made me do it”. In popular science magazines and web sites, such logic is not very uncommon. Fortunately, it is rejected by many as an example of sloppy thinking.29 In the following, I argue the very opposite: following evolutionary desires is often a bad idea and even morally wrong.

In fact, even the evolutionarily conditioned rewards themselves can go wrong, sometimes quite catastrophically. One reason is that evolutionarily, we may be adapted to the environment where our evolutionary ancestors lived, often assumed to be the “African savannah”. However, the modern world is different and, therefore, our evolutionary programming may not be very suitable.30 With humans, a well-known example is the addictive quality of sugary food. The sweet taste of sugar must have signalled the high nutritious quality of food in the environment where our ancestors lived.31 But these days it tends to signal added refined sugar which is bad for your health; evolutionarily speaking, sweet taste should rather be punishing in the modern context, not rewarding.32 Yet, the state of having a sweet taste in your mouth is rewarding, and humans tend to try to reach such a “sweet” states.33

What is even more serious is that evolution makes us want particularly questionable goals, especially from a societal viewpoint. Evolution is fundamentally based on selfish, merciless competition between different organisms (or strictly speaking, between their genes). Many behavioural tendencies evolution has imposed on us should be seen as instruments for such egoistic competition. Evolution is all about maximally spreading our genes. It makes us hoard finite resources such as food to ourselves in order to spread our genes. It makes us violent; it even makes us go to war, again for the sole purpose of spreading genes. This is in stark contrast to most ethical systems in the world which see such selfishness as evil, and recommend quite opposite courses of action.34

Even more fundamentally, the rewards defined by evolution never had the goal of making us happy in any meaningful sense. They are a force that drives us to do exactly those things which are good for our evolutionary fitness. Even if you come to the conclusion that the evolutionary reward system makes you suffer, it cannot be switched off or modified. You cannot decide to be rewarded by something you consider more meaningful and good for society.

I think what evolution offers us is something I would call evolutionary obsessions.35 That is, the evolutionary rewards, together with the learned state-values, make us desire, even crave for many things which we would actually prefer not to desire if we could rationally decide what we desire. If you could just consciously, rationally, “switch off” your desire for, say, sugary food—would you not do that? Chapter 14 explains how Buddhist and Stoic thinking are based on the rather extreme tenet that switching off all desires would actually be very good for you. Whether one agrees with that extreme viewpoint or not, surely, most people have certain desires that they would rather not have. I call them obsessions because they are automatically created, they often override any conscious deliberation, and they may even feel unwanted and intrusive. (We will look at the computational mechanisms for this in Chapters 7 and 8.)

Reward maximization is insatiable

Finally, let me mention another dark side to this reinforcement learning theory. One crucial property of the algorithms based on reward prediction error is that they drive the system to get more and more reward, and there is never any long-term satisfaction. This is because any prediction of the future is learned by the agent, and constantly updated by learning. Thus, in the reward loss, the level of expected reward is updated based on what the agent has obtained recently.

Suppose that an agent gets an exceptional amount of rewards for a while, maybe because a cleaning robot finds itself in a building with lots of nice dust to clean, and it is rewarded for every speck of dust it sucks away. Now, the agent’s prediction system is updated so that an equally large amount of rewards is predicted in the future as well. An environment that produced an unexpectedly large amount of reward for a while becomes the new baseline. That level of reward is not unexpected anymore and, therefore, does not produce any particular “pleasure” anymore either.36 What’s worse is that when things get back to normal, the agent will get less rewards than what it has now learned to expect, since the prediction was updated to reflect the particularly nice environment that lasted for a while. Therefore, the agent suffers enormously when it has to go back to a normal room with a modest amount of dirt.

Similar computations take place in our brain, since our brain also computes the reward prediction error and updates its expected level of reward. No wonder that Wolfram Schultz, one of the leading neuroscientists on dopamine, calls the dopamine neurons “little devils”.37 In fact, this is a logical consequence of the guiding principle of AI agent design: the agent should maximize obtained reward. The reward prediction system has no other goal than helping in maximization of rewards. If you program an agent to maximize reward, then by definition, nothing can possibly be enough; the system will be insatiable. The agent will relentlessly try to get more and more reward, and it is precisely the frustration signal that will force the agent to try harder and harder.38

A merciful programmer might program some stopping criterion to limit the greed of the agent: Once you have obtained X units of reward, you can stop. Unfortunately, evolution knows no mercy, and humans don’t seem to have any such stopping criterion programmed in them. We need more money, more power, more sex (and better sex), and better food (and more food). If we follow our evolutionary “obsessions”, as I called them, nothing is enough.

Suppose you program a robot called Pat to clean a building. You would like the building to be superclean, and the building is quite large with dozens of rooms. So, you would be very tempted to program Pat so that it will spend all its time cleaning the building. You probably want to program a couple of other functions in Pat as well, such as a routine for charging its batteries, some basic maintenance procedures, as well as safety systems to prevent it from hurting people or breaking things. But you would probably program Pat to spend all the rest of the time in tirelessly cleaning the rooms, with no breaks in between. This is what most programmers would do. Here, you have implemented a kind of a “cleaning drive” which is without mercy. Pat will spend all its time and energy just making the rooms spotlessly clean. This may seem completely natural, given that it is “just” a robot.

Now, suppose your colleague, responsible for the visual design of the robot, decides to make Pat look really cute, giving it the shape of a little kitten. It even says “Meow” using its loudspeakers. Many people may suddenly start feeling sympathy for this poor little kitten. “Does it really have to be working all the time? Can’t it ever play, or take a rest?” they would ask. What would you reply?