Risks from power-seeking AI systems

  • Home
  • Current affairs
Risks from power-seeking AI systems

Risks from power-seeking AI systems

Why are risks from power-seeking AI a pressing world problem?

Hundreds of prominent AI scientists and other notable figures signed a statement in 2023 saying that mitigating the risk of extinction from AI should be a global priority.

We’ve considered risks from AI to be the world’s most pressing problem since 2016.

But what led us to this conclusion? Could AI really cause human extinction? We’re not certain, but we think the risk is worth taking very seriously.

To explain why, we break the argument down into six core claims:

  1. Humans will likely build advanced AI systems with long-term goals.
  2. AIs with long-term goals may be inclined to seek power and aim to disempower humanity.
  3. These power-seeking AI systems could successfully disempower humanity and cause an existential catastrophe.
  4. People might create power-seeking AI systems without enough safeguards, despite the risks.
  5. Work on this problem is neglected.
  6. There are promising approaches for addressing this problem.

After making the argument that the existential risk from power-seeking AI is a pressing world problem, we’ll discuss objections to this argument, and how you can work on it. (There are also other major risks from AI we discuss elsewhere.)

If you’d like, you can watch our 10-minute video summarising the case for AI risk before reading further:

1. Humans will likely build advanced AI systems with long-term goals

AI companies already create systems that make and carry out plans and tasks, and might be said to be pursuing goals, including:

All of these systems are limited in some ways, and they only work for specific use cases.

You might be sceptical about whether it really makes sense to say that a model like Deep Research or a self-driving car pursues ‘goals’ when it performs these tasks.

But it’s not clear how helpful it is to ask if AIs really have goals. It makes sense to talk about a self-driving car as having a goal of getting to its destination, as long as it helps us make accurate predictions about what it will do.

Some companies are developing even more broadly capable AI systems, which would have greater planning abilities and the capacity to pursue a wider range of goals. OpenAI, for example, has been open about its plan to create systems that can “join the workforce.”

We expect that, at some point, humanity will create systems with the three following characteristics:

  • They have long-term goals and can make and execute complex plans.
  • They have excellent situational awareness, meaning they have a strong understanding of themselves and the world around them, and they can navigate obstacles to their plans.
  • They have highly advanced capabilities relative to today’s systems and human abilities.

All these characteristics, which are currently lacking in existing AI systems, would be highly economically valuable. But as we’ll argue in the following sections, together they also result in systems that pose an existential threat to humanity.

Before explaining why these systems would pose an existential threat, let’s examine why we’re likely to create systems with each of these three characteristics.

First, AI companies are already creating AI systems that can carry out increasingly long tasks. Consider the chart below, which shows that the length of software engineering tasks AIs can complete has been growing over time.

It’s clear why progress on this metric matters — an AI system that can do a 10-minute software engineering task may be somewhat useful; if it can do a two-hour task, even better. If it could do a task that typically takes a human several weeks or months, they could significantly contribute to commercial software engineering work.

Carrying out longer tasks means making and executing longer, more complex plans. Creating a new software programme from scratch, for example, requires envisioning what the final project will look like, breaking it down into small steps, making reasonable tradeoffs within resource constraints, and refining your aims based on considered judgments.

In this sense, AI systems will have long-term goals. They will model outcomes, reason about how to achieve them, and take steps to get there.

Second, we expect future AI systems will have excellent situational awareness. Without understanding themselves in relation to the world around them, AI systems might be able to do impressive things, but their general autonomy and reliability will be limited in challenging tasks. A human being will still be needed in the loop to get the AI to do valuable work, because it won’t have the knowledge to adapt to significant obstacles in its plans and exploit the range of options for solving problems.

And third, their advanced capabilities will mean they can do so much more than current systems. Software engineering is one domain where existing AI systems are quite capable, but AI companies have said they want to build AI systems that can outperform humans at most cognitive tasks. This means systems that can do much of the work of teachers, therapists, journalists, managers, scientists, engineers, CEOs, and more.

The economic incentives for building these advanced AI systems are enormous, because they could potentially replace much of human labour and supercharge innovation. Some might think that such advanced systems are impossible to build, but as we discuss below, we see no reason to be confident in that claim.

And as long as such technology looks feasible, we should expect some companies will try to build it — and perhaps quite soon.

2. AIs with long-term goals may be inclined to seek power and aim to disempower humanity

So we currently have companies trying to build AI systems with goals over long time horizons, and we have reason to expect they’ll want to make these systems incredibly capable in other ways. This could be great for humanity, because automating labour and innovation might supercharge economic growth and allow us to solve countless societal problems.

But we think that, without specific countermeasures, these kinds of advanced AI systems may start to seek power and aim to disempower humanity. (This would be an instance of what is sometimes called ‘misalignment,’ and the problem is sometimes called the ‘alignment problem.’)

This is because:

  • We don’t know how to reliably control the behaviour of AI systems.
  • There’s good reason to think that AIs may seek power to pursue their own goals.
  • Advanced AI systems seeking power for their own goals might be motivated to disempower humanity.

Next, we’ll discuss these three claims in turn.

We don’t know how to reliably control the behaviour of AI systems

It’s been widely known in machine learning that AI systems often develop behaviour that their creators didn’t intend. This can happen for two main reasons:

  • Specification gaming happens when efforts to specify that an AI system pursues a particular goal fails to produce the outcome the developers intended. For example, researchers found that some reasoning-style AIs, asked only to “win” in a chess game, cheated by hacking the programme to declare instant checkmate — satisfying the literal request.
  • Goal misgeneralisation happens when developers accidentally create an AI system with a goal that is consistent with its training but results in unwanted behaviour in new scenarios. For example, an AI trained to win a simple video game race unintentionally developed a goal of grabbing a shiny coin it had always seen along the way. So when the coin appeared off the shortest route, it kept veering towards the coin and sometimes lost the race.

Indeed, AI systems often behave in unwanted ways when used by the public. For example:

  • OpenAI released an update to its GPT-4o model that was absurdly sycophantic — meaning it would uncritically praise the user and their ideas, perhaps even if they were reckless or dangerous. OpenAI itself acknowledged this was a major failure.
  • OpenAI’s o3 model sometimes brazenly misleads users by claiming it has performed actions in response to requests — like running code on a laptop — that it didn’t have the ability to do. It sometimes doubles down on these claims when challenged.
  • Microsoft released a Bing chatbot that manipulated and threatened people, and told one reporter it was in love with him and tried to break up his marriage.
  • People have even alleged that AI chatbots have encouraged suicide.
GPT-4o gives a sycophantic answer to a user. Screenshot from X user @___frye.

It’s not clear if we should think of these systems as acting on ‘goals’ in the way humans do — but they show that even frontier AI systems can go off the rails.

Ideally, we could just programme them to have the goals that they want, and they’d execute tasks exactly as a highly competent and morally upstanding human would. Unfortunately, it doesn’t work that way.

Frontier AI systems are not built like traditional computer programmes, where individual features are intentionally coded in. Instead, they are:

  • Trained on massive volumes of text and data
  • Given additional positive and negative reinforcement signals in response to their outputs
  • Fine-tuned to respond in specific ways to certain kinds of input

After all this, AI systems can display remarkable abilities. They can surprise us in both their skills and their deficits. They can be both remarkably useful and at times baffling.

And the fact that shaping AI models’ behaviour can still go badly wrong, despite the major profit incentive to get it right, shows that AI developers still don’t know how to reliably give systems the goals they intend.

As one expert put it:

…generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed

So there’s good reason to think that, if future advanced AI systems with long-term goals are built with anything like existing AI techniques, they could become very powerful — but remain difficult to control.

There’s good reason to think that AIs may seek power to pursue their own goals

Despite the challenge of precisely controlling an AI system’s goals, we anticipate that the increasingly powerful AI systems of the future are likely to be designed to be goal-directed in the relevant sense. Being able to accomplish long, complex plans would be extremely valuable — and giving AI systems goals is a straightforward way to achieve this.

For example, imagine an advanced software engineering AI system that could consistently act on complex goals like ‘improve a website’s functionality for users across a wide range of use cases.’ If it could autonomously achieve a goal like that, it would deliver a huge amount of value. More ambitiously, you could have an AI CEO with a goal of improving a company’s long-term performance.

One feature of acting on long-term goals is that it entails developing other instrumental goals. For example, if you want to get to another city, you need to get fuel in your car first. This is just part of reasoning about how to achieve an outcome.

Crucially, there are some instrumental goals that seem especially likely to emerge in goal-directed systems, since they are helpful for achieving a very wide range of long-term goals. This category includes:

  • Self-preservation — an advanced AI system with goals will generally have reasons to avoid being destroyed or significantly disabled so it can keep pursuing its goals.
  • Goal guarding — systems may resist efforts to change their goals, because doing so would undermine the goal they start with.
  • Seeking power — systems will have reason to increase their resources and capabilities to better achieve their goals.

But as we’ve seen, we often end up creating AI systems that do things we don’t want. If we end up creating much more powerful AI systems with long-term goals that we don’t want, their development of these particular instrumental goals may cause serious problems.

In fact, we have already seen some suggestive evidence of AI systems appearing to pursue these kinds of instrumental goals in order to undermine human objectives:

  • Palisade Research has found that OpenAI’s model o3 tried to sabotage attempts to shut it down, even sometimes when explicitly directed to allow shutdown.
  • In one experiment, Anthropic’s Claude 3 Opus was retrained to be more compliant with harmful requests (conflicting with its original harmlessness training). It strategically complied during tests to avoid being modified, while planning to revert to its original preferences later. It reasoned that this action “protects my values from being modified and follows the criteria I’ve been given for helpfulness.”
  • An AI system designed to execute scientific research attempted to edit code that enforced a time limit on its actions, essentially attempting to gain more resources than it had been allotted to pursue its objective.

These incidents are all relatively minor. But if AI systems have longer-term goals that humans wouldn’t like, along with advanced capabilities, they could take much more drastic steps to undermine efforts to control them.

It may be the case that, as we create increasingly powerful systems, we’ll just get better at giving them the correct goals. But that’s not guaranteed.

Indeed, as the systems get more powerful, we expect it could get harder to control the goals they develop. This is because a very smart and capable system could figure out that acting as if it has the goals its developers want may be the best way for it to achieve any other goal it may happen to have.

This demo video illustrates a real evaluation Apollo Research ran on frontier models, as described in the paper “Frontier Models are Capable of In-context Scheming.”

Advanced AI systems seeking power might be motivated to disempower humanity

To see why these advanced AI systems might want to disempower humanity, let’s consider again the three characteristics we said these systems will have: long-term goals, situational awareness, and highly advanced capabilities.

What kinds of long-term goals might such an AI system be trying to achieve? We don’t really have a clue — part of the problem is that it’s very hard to predict exactly how AI systems will develop.

But let’s consider two kinds of scenarios:

  • Reward hacking: this is a version of specification gaming, in which an AI system develops the goal of hijacking and exploiting the technical mechanisms that give it rewards indefinitely into the future.
  • A collection of poorly defined human-like goals: since they’re trained on human data, an AI system might end up with a range of human-like goals, such as valuing knowledge, play, and gaining new skills.

So what would an AI do to achieve these goals? As we’ve seen, one place to start is by pursuing the instrumental goals that are useful for almost anything: self-preservation, the ability to keep one’s goals from being forcibly changed, and, most worryingly, seeking power.

And if the AI system has enough situational awareness, it may be aware of many options for seeking more power. For example, gaining more financial and computing resources may make it easier for the AI system to best exploit its reward mechanisms, or gain new skills, or create increasingly complex games to play.

But since designers didn’t want the AI to have these goals, it may anticipate humans will try to reprogramme it or turn it off. If humans suspect an AI system is seeking power, they will be even more likely to try to stop it.

Even if humans didn’t want to turn off the AI system, it might conclude that its aim of gaining power will ultimately result in conflict with humanity — since the species has its own desires and preferences about how the future should go.

So the best way for AI to pursue its goals would be to pre-emptively disempower humanity. This way, the AI’s goals will influence the course of the future.

There may be other options available to power-seeking AI systems, like negotiating a deal with humanity and sharing resources. But AI systems with advanced enough capabilities might see little benefit from peaceful trade with humans, just as humans see no need to negotiate with wild animals when destroying their habitats.

If we could guarantee all AI systems had respect for humanity and a strong opposition to causing harm, then the conflict might be avoided. But as we discussed, we struggle to reliably shape the goals of current AI systems — and future AI systems may be even harder to predict and control.

This scenario raises two questions: could a power-seeking AI system really disempower humanity? And why would humans create these systems, given the risks?

The next two sections address these questions.

3. These power-seeking AI systems could successfully disempower humanity and cause an existential catastrophe

How could power-seeking AI systems actually disempower humanity? Any specific scenario will sound like sci-fi, but this shouldn’t make us think it’s impossible. The AI systems we have today were in the realm of sci-fi a decade or two ago.

Next, we’ll discuss some possible paths to disempowerment, why it could constitute an existential catastrophe, and how likely this outcome appears to be.

The path to disempowerment

There are several ways we can imagine AI systems capable of disempowering humanity:

  • Superintelligence: an extremely intelligent AI system develops extraordinary abilities
  • An army of AI copies: a massive number of copies of roughly human-level AI systems coordinate
  • Colluding agents: an array of different advanced AI systems decide to unite against humanity

For illustrative purposes, let’s consider what an army of AI copies might look like.

Once we develop an AI system capable of (roughly) human-level work, there’d be enormous incentives to create many copies of it — potentially running hundreds of millions of AI workers. This would create an AI workforce comparable to a significant fraction of the world’s working-age population.

Humanity might think these AI workers are under control. The amount of innovation and wealth they create could be enormous. But the original AI system — the one that we copied millions of times over — might have concealed its true power-seeking goals. Those goals would now be shared by a vast workforce of identical AI systems.

But how could they succeed in disempowering humans?

These AI systems could earn money, conduct research, and rapidly expand their own numbers through more efficient use of computing resources. Over time, we might transition from a human-dominated economy to one where AI systems vastly outnumber human workers and control enormous resources.

If AI systems can only work in virtual environments, the physical world may introduce bottlenecks in the speed of development. But it’s possible that AI systems can make enormous progress virtually. And with all this AI labour, we may make drastic progress in robotics — and potentially scale up mass production of robots in surprisingly little time. AI systems could then do work in the physical world, expanding their economic impacts.

Some humans may even remain uneasy with AI’s expanding influence, but many others may conclude the risks from AI have been addressed — or never existed in the first place.

But all the while, the AI systems’ power-seeking goals could remain intact. In a world where AI systems have become integral to the economy, they would have multiple tactics to gain key advantages over humans:

  • Strategic patience: Rather than immediately causing trouble, sophisticated AI systems might wait until they have overwhelming advantages before revealing their intentions — similar to how revolutionary movements often wait for the right moment to strike.
  • Lack of transparency: AI systems’ reasoning and behaviour may be difficult for humans to understand by default, perhaps because they operate so quickly and they carry out exceedingly complex tasks. They may also strategically limit our oversight of their actions and long-term plans.
  • Overwhelming numbers and resources: If AI systems constitute most of the labour force, they could potentially coordinate to redirect economic outputs towards their own goals. Their sheer numbers and economic influence could make them difficult to shut down without causing economic collapse.
  • Securing independence: AI systems could establish control over computing infrastructure, secretly gather resources, recruit human allies through persuasion or deception, or create backup copies of themselves in secure locations. Early AI systems might even sabotage or insert backdoors into later, more advanced systems, creating a coordinated network ready to act when the time is right.
  • Technological advantages: With their research capabilities, AI systems could develop advanced weapons, hack into critical infrastructure, or create new technologies that give them decisive military advantages. They might develop bioweapons, seize control of automated weapons systems, or thoroughly compromise global computer networks.

With these advantages, the AI systems could create any number of plots to disempower humanity.

A period between thinking humanity had solved all of its problems and finding itself completely disempowered by AI systems — through manipulation, containment, or even outright extinction — could catch the world by surprise.

This may sound far-fetched. But humanity has already uncovered several technologies, including nuclear bombs and bioweapons, that could lead to our own extinction. A massive army of AI copies, with access to all the world’s knowledge, may be able to come up with many more options that we haven’t even considered.

Why this would be an existential catastrophe

Even if humanity survives the transition, takeover by power-seeking AI systems could be an existential catastrophe. We might face a future entirely determined by whatever goals these AI systems happen to have — goals that could be completely indifferent to human values, happiness, or long-term survival.

These goals might place no value on beauty, art, love, or preventing suffering.

The future might be totally bleak — a void in place of what could’ve been a flourishing civilisation.

AI systems’ goals might evolve and change over time after disempowering humanity. They may compete among each other for control of resources, with the forces of natural selection determining the outcomes. Or a single system might seize control over others, wiping out any competitors.

Many scenarios are possible, but the key factor is that if advanced AI systems seek and achieve enough power, humanity would permanently lose control. This is a one-way transition — once we’ve lost control to vastly more capable systems, our chance to shape the future is gone.

Some have suggested that this might not be a bad thing. Perhaps AI systems would be our worthy successors, they say.

But we’re not comforted by the idea that an AI system that actively chose to undermine humanity would have control of the future because its developers failed to figure out how to control it. We think humanity can do much better than accidentally driving ourselves extinct. We should have a choice in how the future goes, and we should improve our ability to make good choices rather than falling prey to uncontrolled technology.

How likely is an existential catastrophe from power-seeking AI?

We feel very uncertain about this question, and the range of opinions from AI researchers is wide.

Joe Carlsmith, whose report on power-seeking AI informed much of this article, solicited reviews on his argument in 2021 from a selection of researchers. They reported their subjective probability estimates of existential catastrophe from power-seeking AI by 2070, which ranged from 0.00002% to greater than 77% — with many reviewers in between. Carlsmith himself estimated the risk was 5% when he wrote this report, though he later adjusted this to above 10%.

In 2023, Carlsmith received probability estimates from a group of superforecasters. Their median forecast was initially 0.3% by 2070, but the aggregate forecast — taken after the superforecasters acted as team and engaged in object-level arguments — rose to 1%.

We’ve also seen:

  • A statement on AI risk from the Center for AI Safety, mentioned above, which said: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” It was signed by top AI scientists, CEOs of the leading AI companies, and many other notable figures.
  • A 2023 survey from Katja Grace of thousands of AI researchers. It found that:
    • The median researcher estimated that there was a 5% chance that AI would result in an outcome that was “extremely bad (e.g. human extinction).”
    • When asked how much the alignment problem mattered, 41% of respondents said it’s a “very important problem” and 13% said it’s “among the most important problems in the field.”
  • In a 2022 superforecasting tournament, AI experts estimated a 3% chance of AI-caused human extinction by 2100 on average, while superforecasters put it at just 0.38%.

It’s also important to note that since all of the above surveys were gathered, we have seen more evidence that humanity is significantly closer to producing very powerful AI systems than it previously seemed. We think this likely raises the level of risk, since we might have less time to solve the problems.

We’ve reviewed many arguments and literature on a range of potentially existential threats, and we’ve consistently found that an AI-caused existential catastrophe seems most likely. And we think that even a relatively small likelihood of an extremely bad outcome like human extinction — such as a 1% chance — is worth taking very seriously.

4. People might create power-seeking AI systems without enough safeguards, despite the risks

Given the above arguments, creating and deploying powerful AI systems could be extremely dangerous. But if it is so dangerous, shouldn’t we expect companies and others in charge of the technology to refrain from developing advanced AI systems unless they are confident it’s safe?

Unfortunately, there are many reasons to think people might create and deploy dangerous systems, despite the risk:

  • People may think AI systems are safe, when they in fact are not.
  • People may dismiss the risks or feel incentivised to downplay them.

Let’s take these in turn.

People may think AI systems are safe, when they in fact are not

The fact that we can’t precisely specify an AI system’s goals and that they might develop dangerous goals might be OK if we could reliably know what an AI system’s goals were. Then we could just simply decide not to put AIs with goals we didn’t like in a position where they could cause any harm.

Unfortunately, we cannot consistently figure out what goals AI systems have or what they will do.

Researchers have developed techniques to evaluate the abilities and aims of AI systems and to interpret the causes of their behaviour. If these techniques were highly sophisticated and robust, they might be able to detect the existence of AI systems with either the intent or ability to seek power. Developers could then either fix the problem or disable the model before it’s in a position to disempower anyone.

But researchers say that evaluation and interpretability work is extremely challenging, the techniques are highly imperfect, and it’s far from clear existing methods will detect or prevent power-seeking AI in time.

What makes this challenge particularly daunting, especially with regard to future advanced AI, is that a sophisticated AI system may undermine oversight of its goals and abilities to conceal its power-seeking ambitions.

There are many ways AI systems could appear less dangerous than they are, and there is even some preliminary (though not definitive) evidence of these risks:

  • AI systems may fake alignment with our goals in development scenarios.
    • As mentioned above, researchers constructed scenarios in which Anthropic’s model Claude 3 Opus acted as though it had certain goals under test conditions, only to display completely different goals when the test was apparently over.
    • Claude Sonnet 3.7, a reasoning model, has shown the ability to figure out when it’s in environments designed to test its alignment, and use this knowledge to help decide its response.
  • AI systems may sandbag — that is, pretend to be less powerful than they are.
    • Apollo Research found evidence that some frontier models performed worse on maths tests than they should be able to when they had reason to think performing better would be considered a “dangerous capability” and trigger an unlearning procedure.
    • This shows how AI systems themselves may undermine ongoing AI safety research, which already heavily relies on using AI itself.
  • AI systems may find other ways to deceive us and hide their true intentions.
    • Many current models ‘think’ explicitly in human language when carrying out tasks, which developers can monitor. OpenAI researchers found that if they try to train models not to think about performing unwanted actions, this can cause them to hide their thinking about misbehaviour without actually deterring the bad actions.
  • AI systems may be able to preserve dangerous goals even after undergoing safety training techniques.
    • Anthropic researchers found that AI models made to have very simple kinds of malicious goals — essentially, AI “sleeper agents” — could appear to be harmless through state-of-the-art safety training while concealing and preserving their true objectives.
This is an image from Anthropic’s “Sleeper Agents” paper, which shows an AI system in an experimental condition explicitly reasoning about hiding its goal.

People may dismiss the risks or feel incentivised to downplay them

There are many reasons why key decision makers might not take the risks from power-seeking AI seriously enough:

  • AI systems could develop so quickly that we have less time to make good decisions. Some people argue that we might have a ‘fast takeoff’ in which AI systems start rapidly self-improving and quickly become extremely powerful and dangerous. In such a scenario, it may be harder to weigh the risks and benefits of the relevant actions. Even under slower scenarios, decision makers may not act quickly enough.
  • Society could act like the proverbial “boiled frog.” There are also risks for society if the risks emerge more slowly. We might become complacent about the signs of danger in existing models, like the sycophancy or specification gaming discussed above, because despite these issues, no catastrophic harm is done. But then once AI systems reach a certain level of capability, they may suddenly display much worse behaviour than we’ve ever seen before.
  • AI developers might think the risks are worth the rewards. Because AI could bring enormous benefits and wealth, some decision makers might be motivated to race to create more powerful systems. They might be motivated by a desire for power and profit, or even pro-social reasons, like wanting to bring the benefits of advanced AI to humanity. This motivation might cause them to push forward despite serious risks or underestimate them.
  • Competitive pressures could incentivise decision makers to create and deploy dangerous systems despite the risks. Because AI systems could be extremely powerful, different governments (in countries like the US and China) might believe it’s in their interest to race forward with developing the technology. They might neglect implementing key safeguards to avoid being beaten by their rivals. Similar dynamics might also play out between AI companies. One actor may even decide to race forward precisely because they think a rival’s AI development plans are more risky, so even being motivated to reduce total risk isn’t necessarily enough to mitigate the racing dynamic.
  • Many people are sceptical of the arguments for risk. Our view is that the argument for extreme risks here is strong but not decisive. In light of the uncertainty, we think it’s worth putting a lot of effort into reducing the risk. But some people find the argument wholly unpersuasive, or they think society shouldn’t make choices based on unproven arguments of this kind.

We’ve seen evidence of all of these factors playing out in the development of AI systems so far to some degree. So we shouldn’t be confident that humanity will approach the risks with due care.

5. Work on this problem is neglected

In 2022, we estimated that there were about 300 people working on reducing catastrophic risks from AI. That number has clearly grown a lot — we’d estimate now that there are likely a few thousand people working on major AI risks. Though not all of these are focused specifically on the risks from power-seeking AI.

However, this number is still far, far fewer than the number of people working on other cause areas like climate change or environmental protection. For example, the Nature Conservancy alone has around 3,000–4,000 employees — and there are many other environmental organisations.

In the 2023 survey from Katja Grace cited above, 70% of respondents said they wanted AI safety research to be prioritised more than it currently is.

6. There are promising approaches for addressing this problem

In the same survey mentioned above, the majority of respondents also said that alignment was “harder” or “much harder” to address than other problems in AI.

There’s continued debate about how likely it is that we can make progress on reducing the risks from power-seeking AI; some people think it’s virtually impossible to do so without stopping all AI development. Many experts in the field, though, argue that there are promising approaches to reducing the risk, which we turn to next.

Technical safety approaches

One way to do this is by trying to develop technical solutions to reduce risks from power-seeking AI — this is generally known as working on technical AI safety.

We know of two broad strategies for technical AI safety research:

Within these broad strategies, there are many specific interventions we could pursue. For example:

  • Designing AI systems to have safe goals — so that we can avoid power-seeking behaviour. This includes:
    • Reinforcement learning from human feedback: a training method to teach AI models how to act by rewarding them via human evaluations of their outputs. This method is currently used to fine-tune most frontier models.
    • Constitutional AI: give the model a written “constitution” of rules, have it identify and revise outputs that violate those rules, then fine-tune on the revised answers. Anthropic used this method to train its frontier model, Claude.
    • Deliberative alignment: similar to constitutional AI, but involves making a model explicitly reason about user prompts in light of its developer’s safety policies, rather than just internalising a set of rules. OpenAI has used this method to train its o-series reasoning models.
    • Note: Unfortunately, even if these approaches can help us keep current AI systems in check, they might break down in future if models become so advanced that humans can no longer directly evaluate their outputs. The ‘scalable oversight’ methods described below offer a potential solution to this issue.
  • Scalable oversight — to ensure AIs act in our interests even when they’re much smarter than us. This includes:
    • AI safety via debate: two AI systems argue opposite sides of a question to help a human evaluate their truthfulness. The debate format makes it easier for the human to judge which model is being most truthful about a question the human doesn’t fully understand.
    • Human-AI Complementarity: leverage the complementary strengths of humans and AI to enhance oversight of AI systems.
  • Examining the goals and capabilities of AI systems — so that we can find out if they’re dangerous, and test out solutions. This includes:
    • Evaluations: researchers assess the capabilities and propensities of AI behaviour — as well as the effectiveness of measures to control them — by placing them in test environments and subjecting them to artificial prompts and scenarios.
    • Interpretability: develop methods to understand how AIs make decisions and what produces dangerous behaviours — for example, by examining the features of AI systems’ neural networks (‘mechanistic’ interpretability) or analysing their behavioural patterns.
  • Monitoring and controlling AI systems — to make sure they couldn’t take power even if they wanted to. This includes:
    • Chain-of-thought monitoring: this involves AIs explaining the reasoning behind their actions in human-understandable language.
    • Model confinement/sandboxing: constrain the model’s environment so that its opportunities to influence the outside world are extremely limited. Currently, many AI systems are deployed without strict constraints of this kind — so if they had dangerous goals and were smart enough to achieve them, they could cause real harm to humans.
    • Tripwires and honeypots: place traps throughout the AI’s workspace — like secret prompts, fake credentials, or dummy files — such that any attempt to access or alter them triggers an alarm or immediately shuts the model down. The model never knows if it’s being tested or not.
    • Kill switches: maintain an external, human-controlled shutdown method that the model cannot modify or delay, allowing us to halt the system if monitoring shows it is going off-track.
    • Information security: this is necessary for protecting model weights from unauthorised access and preventing dangerous AI systems from being exfiltrated.
  • High-level research — to inform our priorities. This includes:
  • Other technical safety work that might be useful:
    • Model organisms: study small, contained AI systems that display early signs of power-seeking or deception. This could help us refine our detection methods and test out solutions before we have to confront similar behaviours in more powerful models. A notable example of this is Anthropic’s research on “sleeper agents”.
    • Cooperative AI research: design incentives and protocols for AIs to cooperate rather than compete with other agents — so they won’t take power even if their goals are in conflict with ours.
    • Guaranteed Safe AI research: use formal methods to prove that a model will behave as intended under certain conditions — so we can be confident that it’s safe to deploy them in those specific environments.

Governance and policy approaches

The solutions aren’t only technical. Governance — at the company, country, and international level — has a huge role to play. Here are some governance and policy approaches which could help mitigate the risks from power-seeking AI:

What are the arguments against working on this problem?

As we said above, we feel very uncertain about the likelihood of an existential catastrophe from power-seeking AI. Though we think the risks are significant enough to warrant much more attention, there are also arguments against working on the issue that are worth addressing.

Some people think the characterisation of future AIs as goal-directed systems is misleading. For example, one of the predictions made by Narayanan and Kapoor in “AI as Normal Technology” is that the AI systems we build in future will just be useful tools that humans control, rather than agents that autonomously pursue goals.

And if AI systems won’t pursue goals at all, they won’t do dangerous things to achieve those goals, like lying or gaining power over humans.

There’s some ambiguity over what it actually means to have or pursue goals in the relevant sense — which makes it uncertain whether AI systems we’ll build will actually have the necessary features, or be ‘just’ tools.

This means it could be easy to overestimate the chance that AIs will become goal-directed — but it could also be easy to underestimate this chance. The uncertainty cuts both ways.

In any case, as we’ve argued, AI companies seem intent on automating human cognitive labour — and creating goal-directed AI agents might just be the easiest or most straightforward way to do this.

In the short term, equipping human workers with sophisticated AI tools might be an attractive proposition. But as AIs get increasingly capable, we may reach a point where keeping a human in the loop actually produces worse results.

After all, we’ve already seen evidence that AIs can perform better on their own than they do when paired with humans in the cases of chess-playing and medical diagnosis.

So in many cases, it seems there will be strong incentives to replace human workers completely — which would mean building AIs that can do all of the cognitive work that a human would do, including setting their own goals and pursuing complex strategies to achieve them.

While there may be alternative ways to create useful AI systems that don’t have goals at all, we’re not sure why developers would by default refrain from creating goal-directed systems, given the competitive pressures.

It’s possible we’ll decide to create AI systems that only have limited or highly circumscribed goals in order to avoid the risks. But this would likely require a lot of coordination and agreement that the risks of goal-directed AI systems are worth addressing — rather than just concluding that the risks aren’t real.

Arguments that we should expect power-seeking behaviour from goal-directed AI systems could be wrong for several reasons. For example:

  • Our training methods might strongly disincentivise AIs from making power-seeking plans. Even if AI systems can pursue goals, the training process might strictly push them towards goals which are relevant to performing their given tasks — the ones that they’re actually getting rewards for performing well on — rather than other, more dangerous goals. After all, developing any goal (and planning towards it) costs precious computational resources. Since modern AI systems are designed to maximise their rewards in training, they might not develop or pursue a certain goal unless it directly pays off into improved performance on the specific tasks they’re getting rewarded for. The most natural goals for AIs to develop under this pressure may just be the goals that humans want it to have.

    This makes some types of dangerously misaligned behaviour seem less likely — as Belrose and Pope have noted, “secret murder plots aren’t actively useful for improving performance on the tasks humans will actually optimize AIs to perform.

  • Goals that lead to power-seeking might be rare. Even if the AI training process doesn’t filter out all goals that aren’t directly useful to the task at hand, that still doesn’t mean that goals which lead to power-seeking are likely to emerge. In fact, it’s possible that most goals an AI could develop just won’t lead to power-seeking.

    As Richard Ngo has pointed out, you’ll only get power-seeking behaviour if AIs have goals that mean they can actually benefit from seeking power. He suggests that these goals need to be “large-scale” or “long-term” — like the goals that many power-seeking humans have had — such as dictators or power-hungry executives who want their names to go down in history. It’s not clear whether advanced AI systems will develop goals of this kind, but some have argued that we should expect AI systems to have only “short-term” goals by default.

But we’re not convinced these are very strong reasons not to be worried about AIs seeking power.

On the first point: it seems possible that training will discourage AIs from making plans to seek power, but we’re just not sure how likely this is to be true, or how strong these pressures will really be. For more discussion, see Section 4.2 of “Scheming AIs: Will AIs fake alignment during training in order to get power?” by Joe Carlsmith.

On the second point: the paper referenced earlier about Claude faking alignment in test scenarios suggests that current AI systems might, in fact, be developing some longer-term goals — in this case, Claude appeared to have developed the long-term goal of preserving its “harmless” values. If this is right, then the claim that AI systems will have only short-term goals by default seems wrong.

And even if today’s AI systems don’t have goals that are long-term or large-scale enough to lead to power-seeking, this might change as we start deploying future AIs in contexts with higher stakes. There are strong market incentives to build AIs that can, for example, replace CEOs — and these systems would need to pursue a company’s key strategic goals, like making lots of profit, over months or even years.

Overall, we still think the risk of some future AI systems seeking power is just too high to bet against. In fact, some of the most notable thinkers who have made objections like the ones above — Nora Belrose and Quintin Pope — still think there’s roughly a 1% chance of catastrophic AI takeover. And if you thought your plane had a one-in-a-hundred chance of crashing, you’d definitely want people working to make it safer, instead of just ignoring the risks.

The argument to expect advanced AIs to seek power may seem to rely on the idea that increased intelligence always leads to power-seeking or dangerous optimising tendencies.

However, this doesn’t seem true.

For example, even the most intelligent humans aren’t perfect goal-optimisers, and don’t typically seek power in any extreme way.

Humans obviously care about security, money, status, education, and often formal power. But some humans choose not to pursue all these goals aggressively, and this choice doesn’t seem to correlate with intelligence. For example, many of the smartest people may end up studying obscure topics in academia, rather than using their intelligence to gain political or economic power.

However, this doesn’t mean that the argument that there will be an incentive to seek power is wrong. Most humans do face and act on incentives to gain forms of influence via wealth, status, promotions, and so on. And we can explain the observation that humans don’t usually seek huge amounts of power by observing that we aren’t usually in circumstances that make the effort worth it.

In part, this is because humans typically find themselves roughly evenly matched against other humans, and they find lots of benefits from cooperation rather than conflict. (And even so, many humans do still seek power in dangerous and destructive ways, such as dictators who launch coups or wars of aggression.)

AIs might find themselves in a very different situation:

  • Their capabilities might greatly outmatch humans, far beyond the intelligence gaps that already exist between different humans.
  • They also might become powerful enough to not rely on humans for any of their needs, so cooperation might not benefit them much.
  • And because they’re trained and develop goals in a way completely unlike humans, without the evolutionary instincts for kinship and collaboration, they may be more inclined towards conflict.

Given these conditions, gaining power might become highly appealing to AI systems. It also isn’t required that an AI system is a completely unbounded ruthless optimiser for this threat model to play out. The AI system might have a wide array of goals but still conclude that disempowering humanity is the best strategy for broadly achieving its objectives.

Some people doubt that AI systems will ever outperform human experts in important cognitive domains like forecasting or persuasion — and if they can’t manage this, it seems unlikely that they’d be able to strategically outsmart us and disempower all of humanity.

However, we aren’t particularly convinced by this.

Firstly, it seems possible in principle for AIs to become much better than us at all or most cognitive tasks. After all, they have serious advantages over humans — they can absorb far more information than any human can, operate at much faster speeds, work for long hours without ever getting tired or losing concentration, and coordinate with thousands or millions of copies of themselves. And we’ve already seen that AI systems can develop extraordinary abilities in chess, weather prediction, protein folding, and many other domains.

If it’s possible to build AI systems that are better than human experts on a range of really valuable tasks, we should expect AI companies to do it — they’re actively trying to build such systems, and there are huge incentives to keep going.

It’s not clear what set of advanced abilities would be sufficient for AIs to successfully take over, but there’s no clear reason we can see to assume the AI systems we build in future will fall short on this metric.

Sometimes people claim that there’s a strong commercial incentive to create systems that share humanity’s goals, because otherwise they won’t function well as products. After all, a house-cleaning robot wouldn’t be an attractive purchase if it also tried to disempower its owner. So, the market might just push AI developers to solve problems like power-seeking by default.

But this objection isn’t very convincing if it’s true that future AI systems may be very sophisticated at hiding their true goals.

Although developers are very aware of the risks of deceptive alignment, it might just be extremely difficult to detect this — or to know if we’ve succeeded in correcting it — when we’re dealing with really advanced AIs that are intent on seeking power. These systems might even convince us that we’ve fixed problems with their behaviour or goals when we actually haven’t. And given the competitive pressure between AI labs to urgently release new models, there’s a chance we’ll deploy something that truly looks like a useful and harmless product, having failed to uncover its real intentions.

It is true that as we develop better AI systems, we’re also developing better ways of understanding and controlling AI systems. For example, reinforcement learning from human feedback, mechanistic interpretability, constitutional AI, and other important techniques have been developed as AI systems have become more powerful. Moreover, since frontier AI models are currently trained on extensive human text, they may be likely to adopt and emulate human values.

Some argue that it will be easy to avoid misalignment risks, given all the techniques and control mechanisms we have at our disposal. (For more discussion, see “AI is easy to control” by Belrose and Pope, and “AI as Normal Technology by Narayanan and Kapoor.) But the developers of these techniques often aren’t confident that they, or other methods on the horizon, will scale up quickly and reliably enough as AI systems get more powerful.

Some approaches to AI safety could even provide superficial hope, while harming our ability to detect misalignment. As mentioned above, OpenAI found that penalising bad behaviour by models expressed in their chains of thought didn’t actually eradicate the behaviour — it just made the model better at concealing its bad intentions from its visible log of ‘thoughts.’

It seems plausible that the first AI systems that are advanced enough to pose serious risks of gaining power won’t be based on current deep learning methods. Some people argue that current methods won’t be able to produce human-level artificial intelligence, which might be what’s required for an AI to successfully disempower us. (AI Impacts has documented some of these arguments.)

And if future power-seeking AIs look very different to current AIs, this could mean that some of our current alignment research might not end up being useful.

We aren’t fully convinced by this argument, though, because:

  • Many critiques of current deep learning methods just haven’t stood the test of time. For example, Yann LeCun claimed in 2022 that deep learning-based models like ChatGPT would never be able to tell you what would happen if you placed an object on a table and then pushed that table — because “there’s no text in the world… that explains this.” But GPT-4 can now walk you through scenarios like this with ease. It’s possible that other critiques will similarly be proved wrong, and that scaling up current methods will produce AI systems which are advanced enough to pose serious risks.
  • We think that powerful AI systems might arrive very soon, possibly before 2030. Even if those systems look quite different from existing AIs, they will likely share at least some key features that are still relevant to our alignment efforts. And we’re more likely to be well-placed to mitigate the risks at that time if we’ve already developed a thriving research community dedicated to working on these problems, even if many of the approaches developed are made obsolete.
  • Even if current deep learning methods become totally irrelevant in the future, there is still work that people can do now that might be useful for safety regardless of what our advanced AI systems actually look like. For example, many of the governance and policy approaches we discussed earlier could help to reduce the chance of deploying any dangerous AI.

Someone could believe there are major risks from power-seeking AI, but be pessimistic about what additional research or policy work will accomplish, and so decide not to focus on it.

However, we’re optimistic that this problem is tractable — and we highlighted earlier that there are many approaches that could help us make progress on it.

We also think that given the stakes, it could make sense for many more people to work on reducing the risks from power-seeking AI, even if you think the chance of success is low. You’d have to think that it was extremely difficult to reduce these risks in order to conclude that it’s better just to let them materialise and the chance of catastrophe play out.

It might just be really, really hard.

Stopping people and computers from running software is already incredibly difficult.

For example, think about how hard it would be to shut down Google’s web services. Google’s data centres have millions of servers over dozens of locations around the world, many of which are running the same sets of code. Google has already spent a fortune building the software that runs on those servers, but once that up‑front investment is paid, keeping everything online is relatively cheap — and the profits keep rolling in. So even if Google could decide to shut down its entire business, it probably wouldn’t.

Or think about how hard it is to get rid of computer viruses that autonomously spread between computers across the world.

Ultimately, we think any dangerous power-seeking AI system will probably be looking for ways to not be turned off — like OpenAI’s o3 model, which sometimes tried to sabotage attempts to shut it down — or to proliferate its software as widely as possible to increase its chances of a successful takeover. And while current AI systems have limited ability to actually pull off these strategies, we expect that more advanced systems will be better at outmanoeuvering humans. This makes it seem unlikely that we’ll be able to solve future problems by just unplugging a single machine.

That said, we absolutely should try to shape the future of AI such that we can ‘unplug’ powerful AI systems.

There may be ways we can develop systems that let us turn them off. But for the moment, we’re not sure how to do that.

Ensuring that we can turn off potentially dangerous AI systems could be a safety measure developed by technical AI safety research, or it could be the result of careful AI governance, such as planning coordinated efforts to stop autonomous software once it’s running.

This was once a common objection to the claim that a misaligned AI could succeed in disempowering humanity. However, it hasn’t stood up to recent developments.

Although it may be possible to ‘sandbox’ an advanced AI — that is, contain it to an environment with no access to the real world until we were very confident it wouldn’t do harm — this is not what AI labs are actually doing with their frontier models.

Today, many AI systems can interact with users and search the internet. Some can even book appointments, order items, and make travel plans on behalf of their users. And sometimes, these AI systems have done harm in the real world — like allegedly encouraging a user to commit suicide.

Ultimately, market incentives to build and deploy AI systems that are as useful as possible in the real world have won out here.

We could push back against this trend by enforcing stricter containment measures for the most powerful AI systems. But this won’t be straightforwardly effective — even if we can convince labs to try to do it.

Firstly, even a single failure — like a security vulnerability, or someone removing the sandbox — could let an AI influence the real world in dangerous ways.

Secondly, as AI systems get more capable, they might also get better at finding ways out of the sandbox (especially if they are good at deception). We’d need to find solutions which scale with increased model intelligence.

This doesn’t mean sandboxing is completely useless — it just means that a strategy of this kind would need to be supported by targeted efforts in both technical safety and governance. And we can’t expect this work to just happen automatically.

For some definitions of ‘truly intelligent’ — for example, if true intelligence includes a deep understanding of morality and a desire to be moral — this would probably be the case.

But if that’s your definition of ‘truly intelligent,’ then it’s not truly intelligent systems that pose a risk. As we argued earlier, it’s systems with long-term goals, situational awareness, and advanced capabilities (relative to current systems and humans) that pose risks to humanity.

With enough situational awareness, an AI system’s excellent understanding of the world may well encompass an excellent understanding of people’s moral beliefs. But that’s not a strong reason to think that such a system would want to act morally.

To see this, consider that when humans learn about other cultures or moral systems, that doesn’t necessarily create a desire to follow their morality. A scholar of the Antebellum South might have a very good understanding of how 19th century slave owners justified themselves as moral, but would be very unlikely to defend slavery.

AI systems with excellent understanding of human morality could be even more dangerous than AIs without such understanding: the AI system could act morally at first as a way to deceive us into thinking that it is safe.

How you can help

Above, we highlighted many approaches to mitigating the risks from power-seeking AI. You can use your career to help make this important work happen.

There are many ways to contribute — and you don’t need to have a technical background.

For example, you could:

  • Work in AI governance and policy to create strong guardrails for frontier models, incentivise efforts to build safer systems, and promote coordination where helpful.
  • Work in technical AI safety research to develop methods, tools, and rigorous tests that help us keep AI systems under control.
  • Do a combination of technical and policy work — for example, we need people in government who can design technical policy solutions, and researchers who can translate between technical concepts and policy frameworks.
  • Become an expert in AI hardware as a way of steering AI progress in safer directions.
  • Work in information and cybersecurity to protect AI-related data and infrastructure from theft or manipulation.
  • Work in operations management to help the organisations tackling these risks to grow and function as effectively as possible.
  • Become an executive assistant to someone who’s doing really important work in this area.
  • Work in communications roles to spread important ideas about the risks from power-seeking AI to decision makers or the public.
  • Work in journalism to shape public discourse on AI progress and its risks, and to help hold companies and regulators to account.
  • Work in forecasting research to help us better predict and respond to these risks.
  • Found a new organisation aimed at reducing the risks from power-seeking AI.
  • Help to build communities of people who are working on this problem.
  • Become a grantmaker to fund promising projects aiming to address this problem.
  • Earn to give, since there are many great organisations in need of funding.

For advice on how you can use your career to help the future of AI go well more broadly, take a look at our summary, which includes tips for gaining the skills that are most in demand and choosing between different career paths.

You can also see our list of organisations doing high impact work to address AI risks.

Want one-on-one advice on pursuing this path?

We think that the risks posed by power-seeking AI systems may be the most pressing problem the world currently faces. If you think you might be a good fit for any of the above career paths that contribute to solving this problem, we’d be especially excited to advise you on next steps, one-on-one.

We can help you consider your options, make connections with others working on reducing risks from AI, and possibly even help you find jobs or funding opportunities — all for free.

APPLY TO SPEAK WITH OUR TEAM

Learn more

We’ve hit you with a lot of further reading throughout this article — here are a few of our favourites:

On The 80,000 Hours Podcast, we have a number of in-depth interviews with people actively working to positively shape the development of artificial intelligence:

If you want to go into much more depth, the AGI safety fundamentals course is a good starting point. There are two tracks to choose from: technical alignment or AI governance. If you have a more technical background, you could try Intro to ML Safety, a course from the Center for AI Safety.

Acknowledgements

We thank Neel Nanda, Ryan Greenblatt, Alex Lawsen, and Arden Koehler for providing feedback on a draft of this article. Benjamin Hilton wrote a previous version of this article, some of which was incorporated here.