The Difficulty of Specifying Goals

Jun 22, 2025

One of the central challenges in aligning an AI system is the difficulty of specifying the full range of desired and undesired behaviors.

A well-known example by Nick Bostrom goes as follows:

"Suppose we have an AI whose only goal is to make as many paper clips as possible. The AI will realize quickly that it would be much better if there were no humans because humans might decide to switch it off. Because if humans do so, there would be fewer paper clips. Also, human bodies contain a lot of atoms that could be made into paper clips. The future that the AI would be trying to gear towards would be one in which there were a lot of paper clips but no humans."

The point is not that this will happen per se; rather, it shows how difficult it would be to specify the goal of such a paper clip maximizer.

For example, one might object to the above example that we could add a requirement that the AI should make as many paper clips as possible while not hurting humans. However, the AI might then hijack all steel factories to use for the production of paper clips. Again, one might say that we can add a requirement that they should not hijack any factories. The AI might then build new steel factories in unwanted places. Again, one might object...

Hopefully you can see the pattern. Whatever the goal is, it is impossible to specify all (un)desired behaviors. This makes it hard to trust such an AI, because we are always unsure about a possible undesired, unspecified subgoal that it will pursue.

Alexander’s Substack

Discussion about this post