Infohazard warning: detailed discussions of catastrophic AI risk.
Let me be frank: I think there is a worrying large probability1 that humanity will be extinct when 2050 rolls around, with artificial intelligence (AI) as the cause. At the very least, I think it is plausible enough that people should be worried about it; plausible enough that we should be aware of the threat that stands before us.
Predicting the apocalypse is a common pastime. Apocalypses are not quite so common. But it would not be wise to forget our brushes with the apocalypse, or the contributions of Vasily Arkhipov and Stanislav Petrov who can each be credited with preventing the Cold War from escalating into nuclear war. The apocalypse is no longer mere eschatology; existential risks are real and now stem primarily from our very own technology.
It is an irresponsible and dangerous foolishness that dismisses apocalyptic concerns as mere doomsday cultism without considering the arguments—for most doomsday cults have none—and counterarguments. I will attempt to sketch the argument for AI doom, and also acknowledge some counterarguments along the way. Unfortunately, the counterarguments are generally of poor quality and often take the form of meta-level heuristics2 rather than concrete object-level disagreements with the specifics of the arguments. This contributes in no small part to my concern.
As this is information I have mostly absorbed through osmosis, all I can promise is to do my best to be complete and correct in my presentation here. If you are interested in learning more, you may find a gentle introduction in Rob Miles’s videos, and a more involved introduction in the late 2021 MIRI conversations, both the background material as well as the conversations themselves. Notably, Scott Alexander has made approachable summaries of Eliezer Yudkowsky’s conversations with Richard Ngo, Ajeya Cotra, and Paul Christiano.
Before I offer this sketch, I would like to make it very clear that I hope I am wrong about all of this. I suspect the same is true for almost everyone who shares my concerns. I am otherwise a techno-optimist and would like humanity to realise a glorious transhumanist future in which we spread amongst the stars to reap the bounties of our cosmic endowment.3 This sort of view has been called longtermism, and while it is often associated with concerns about AI, you absolutely do not need to hold this view to be concerned about AI. Many different people are worried4 and I, for one, would not oppose a Butlerian Jihad.
Lastly, two key pieces of intuition.
First, humans exemplify the very alignment problem that is the impediment to the safe development of AI. The alignment problem is not novel: in a very real sense, alignment—cooperation—is the main problem that has plagued humanity throughout our history.5 That the world is not a utopia tells us it will not be easily solved.
Second, the alignment problem is so concerning precisely because we are able to easily train neural networks to perform tasks we do not understand how to perform ourselves.6 The creation of an AI with human-level general intelligence and reasoning capabilities will likely not require the sort of detailed understanding of general intelligence required to solve the alignment problem. Evolution certainly did not require such an understanding, nor was it able to solve the alignment problem.
And if we create something whose values are not robustly aligned with our own and which quickly becomes much more capable than us—
Contents
Transformative AI (TAI) is likely to arrive soon
AI research continues apace, and the arms race will only intensify
Pivotal acts preventing AI catastrophe seem to require TAI
TAI is likely to be artificial general intelligence (AGI), leading to a fast takeoff
General intelligence seems to be the easiest way to solve many important problems
AGI likely has the capacity to be vastly more generally intelligent than humans
Fast AI takeoff is plausible due to a combination of an intelligence explosion, a speed explosion, and hardware overhang
Aligning AGI will be difficult
The orthogonality thesis and instrumental convergence suggest AGI will have values hostile to human flourishing by default
Alignment is a problem everywhere
Solving the outer and inner alignment problems appears difficult and may be impossible
AGI will likely be able to gain decisive strategic advantage (DSA)
AGI capabilities will include molecular nanotechnology, cyberwarfare, and social manipulation
AGI will likely be able to act autonomously, either through containment breach, voluntary release, or effective control
The argument for doom
TAI is likely to arrive soon
AI research continues apace, and the arms race will only intensify
There have been some noteworthy recent AI developments in the last two months which I will run through to give you a feel for the pace.
Google Research’s Pathways Language Model (PaLM) is extremely impressive. Its reasoning capabilities are frankly intimidating, and substantially superior to those of GPT-3:
And it performs better on a subset of benchmark tasks than the human average:
The performance improves logarithmically with scale and these improvements show no sign of a plateau, suggesting that even larger models will perform even better. Moreover, DeepMind’s Chinchilla language model, which is still impressive, suggests that current training methods do not optimally leverage their compute budget. Surely human-level reasoning requires more than a language model. But where do the improvements end?
DeepMind also constructed the visual language model Flamingo from Chinchilla. Flamingo is state-of-the-art in learning from only a few examples to perform multimodal visual and textual tasks, outperforming fine-tuned methods in multiple instances. Although it might not be capable of fully understanding a complex joke encoded in an image, it is still very capable.
Even more recently, DeepMind announced Gato, which they call a generalist agent. It is a single neural network with the ability to perform a wide range of tasks: it can play many different video games, operate a robotic arm, chat, and caption images. It may not achieve state-of-the-art performance, being quite small, but it has been genuinely surprising to those in the field how easily transformers generalise and how well they continue to scale—they just work.7 OpenAI was historically more confident in the scaling hypothesis, but DeepMind now seems to have also been convinced.
Recent work from Google on Socratic Models has shown that you can use language as an interface between these models and ones that parse images or audio into language. The internal narration that results is reminiscent of the human mind and internal narrative, although it does not have human capabilities.
Google Robotics has also developed SayCan, which uses the input of a language model to allow a robot to generate and execute plans.
OpenAI’s DALL·E 2 generates impressive images and art from language. It has many strengths, but struggles with things like multiple characters, novelty, and language, as discussed here. The following is an example of what it can produce, specifically from the prompt ‘Rabbits attending a college seminar on human anatomy’:
OpenAI also recently released InstructGPT, a fine-tuned version of GPT-3 that uses human feedback to achieve significantly better performance. Specifically, human evaluators preferred the outputs of the 1.3 billion parameter InstructGPT model over the 175 billion parameter GPT-3 model. Even better, this fine-tuning on human feedback was inexpensive compared to model training costs, which is evidence suggesting that the pursuit of alignment will be performance-competitive. Unfortunately, the methods used to create InstructGPT are not at all robust. It still produces undesired outputs.
The OpenAI C-suite8 keeps tweeting about artificial general intelligence (AGI)—AI with the same general reasoning capabilities as humans—and it makes me nervous. In particular, Sam Altman, the CEO of OpenAI, thinks that AGI is coming this decade. At least Demis Hassabis, the CEO of DeepMind, thinks that AGI is coming in the next decade or two. Perhaps this is meant to be reassuring.9
Progress in AI shows no signs of slowing down. It would be foolish to rest our hopes of salvation on a slowdown, even though one is possible. Nor should we expect it to be easy to ban AI research, as its economic benefits are and will continue to be immense.
As one example,10 we can consider gain-of-function research, a dangerous form of research that aims to make viruses more infectious. It is supposed to be helpful if the studied virus ends up becoming a pandemic, but this has not been the case for COVID-19. And yet people have credibly entertained the possibility that COVID-19 originated from this research. Regardless of whether this is true, gain-of-function research appears to have a terrible cost-benefit ratio, and it is remarkable that this research has been allowed to continue. Banning or restricting AI research will be harder still.
And what good would it do if only one company or country enacted a ban? Others would be more than happy to capture the benefits. Even if a global agreement was reached to ban AI research, we would expect governments to continue research in secret, as in such a world everyone recognises the potential of AI. We are not very good at solving coordination problems. It is unlikely that things will be any different for AI, and so the AI arms race will only intensify.
In an arms race, safety may be sacrificed if it is expedient to do so. Even if the alignment problem is solved and we know how to safely create AGI, the rush to be first may result in negligence. There could simply be a bug in the implementation, or perhaps greater levels of alignment require more time to develop and people do not have the patience to wait. Alignment may not be performance competitive at all.11
People do not always take safety concerns as seriously as they should, especially not in the context of an arms race.
Pivotal acts preventing AI catastrophe seem to require TAI
Transformative AI (TAI) has been defined as AI whose arrival produces a change at least as large as the agricultural or industrial revolution. It is more general than terms like superintelligence or AGI, as these may not be necessary to produce radical changes. Unfortunately, the claim that pivotal acts preventing AI catastrophe seem to require TAI is especially murky to me. People believe it—otherwise they would be working on enacting that pivotal act, rather than on AI alignment—but I do not have a strong sense of the sorts of pivotal acts they envision. They tend not to be forthcoming on the details.
A silly example of a pivotal act is melting all GPUs worldwide using molecular nanotechnology. TAI will likely arrive long before we otherwise develop the requisite capability in molecular nanotechnology, and is therefore required to implement this pivotal act. Another example of a pivotal act is properly enacting a global ban on AI research. There are certainly more stupid plans than these, but they are not worth pursuing: cartoon villain plans do not work in real life.
If you have a good idea for a pivotal act that does not require TAI, please let people know. You would find securing very large amounts of funding extremely easy.
TAI is likely to be AGI, leading to a fast takeoff
General intelligence seems to be the easiest way to solve many important problems
There seems to be a crucial difference between a general human-like and agent-like planning intelligence and a narrow language-model-like pattern-matching intelligence. If transformative AI looks like the latter, and if it is able to, for example, develop a mathematical framework that allows for the safe construction of AI of the former type, then we probably have little to fear. Mathematics is indeed likely the safest task for a transformative AI, as it does not require the AI to model the world.
However, evolutionary pressures in the ancestral environment produced general reasoning and planning capabilities in humans. These capabilities allowed us to solve many problems, even mathematical ones which did not exist in the ancestral environment. It seems as though the development of general intelligence offers an easy pathway to solving these types of problems, perhaps the best or even only pathway. Sleep and dreams seem to be a part of this puzzle of generality.
This is not to say that language models cannot generate plans. But planners search through the space of possible plans, model the consequences, and then choose plans which are optimal with respect to some set of goals. Plans are useful insofar as they are a product of planning. Perhaps humans do not plan, but we do something much like it, and our plans work to the extent that what we are doing approximates planning. Similarly, the plans generated by a language model will be useful only to the extent that the language model is able to approximate this sort of agent-like planning intelligence.
It is not clear that there is a true distinction between pattern-matching and planning intelligence—it may simply be a continuum—but intelligence does seem to be threatening only to the extent that it is capable of planning.12 Current AI can already pattern-match to planning, rendering the development of planning capabilities difficult to track. Unfortunately, planning capabilities are no safer for having emerged from pattern-matching.
AGI likely has the capacity to be vastly more generally intelligent than humans
Humans are the only known example of general intelligence. The mechanisms behind the development of human intelligence are as yet unclear. However, it is clear that the genomic bottleneck—the limited information storage capacity of the genome—left us with no option but to develop rapid learning and general reasoning capabilities in order to re-derive our knowledge about the world from generation to generation. Human infants can testify to this, at the very least through their inability to actually testify.
Perhaps placing the same pressure on AI will produce general reasoning capabilities similar to humans.13 But it would not be surprising if such an AI ended up vastly more intelligent than even the smartest human. The human brain is very small, operates very slowly, consumes very little power, and is very far from the efficiency limit imposed by thermodynamics. This clearly suggests that AI has a lot of headroom to be vastly more intelligent than humans merely through the scaling up of hardware. And it is clear that there should be more efficient ways of reasoning—human intelligence is influenced by brain volume, but it is obviously not the whole picture.
Indeed, humans are probably about as dumb as you can be while achieving durable technological progress. Our recent technological progress has occurred on human timescales that are much faster than evolutionary timescales. We did all this about as soon as we were able, suggesting that the limits on intelligence likely lie far above us.
Some have said that we should not worry about AI, that intelligence is specialised and multidimensional and therefore AI cannot develop general reasoning capabilities. This unfortunately forces them to concede that humans also do not have these capabilities. But biting this strange bullet solves nothing: we are clearly capable enough. And it does not reassure me that the quality of this argument is characteristic of arguments against worrying about AI.14
We do not understand how general intelligence works, but our inability to solve the protein folding problem did not stop AlphaFold from predicting protein structures. Similarly, we will likely be able to create AI with human-level general intelligence and reasoning capabilities without a full understanding of general intelligence. Moreover, we should expect the intelligence of such AI to easily be able to vastly outstrip our own.
Fast AI takeoff is plausible due to a combination of an intelligence explosion, a speed explosion, and hardware overhang
It is important to consider how quickly AGI gains capabilities once it reaches the threshold of human performance. The faster it gains capabilities—the faster its takeoff—the less time we have to react to signs of harmful actions. Paul Christiano discussed takeoff speeds with Eliezer in the Late 2021 MIRI conversations, and Scott Alexander has summarised this discussion. Paul thinks that a slow takeoff is more likely than a fast takeoff, though he still thinks there is a one in three chance of the latter, and he operationalises a slow takeoff as:
There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles. (Similarly, we’ll see an 8 year doubling before a 2 year doubling, etc.)
Roughly, Eliezer’s objection to slow takeoff is that technological progress may be continuous with respect to an appropriate metric, but this does not mean it is continuous with respect to a more relevant metric such as ‘impact on the world’. Humans have three times as many neurons as chimpanzees, but our impacts on the world are vastly different.
Even a slow takeoff is extreme, and it does not guarantee safety. Our poor management of COVID-19 does not suggest that we will be certain to use the opportunities slow takeoff affords us to appropriately manage AI risk. But before we can consider the relative likelihoods of slow and fast takeoff, we need to understand the mechanisms by which takeoff would occur: an intelligence explosion, a speed explosion, and a hardware overhang.
In an intelligence explosion, an AI recursively creates qualitatively smarter AI until some limit is reached. Perhaps this process would soon reach whatever fundamental limitations on intelligence exist, or an AI might conclude that it ought to act before this limit is reached. Either way, AI will become vastly more intelligent than humans.
In a speed explosion, AI recursively designs and manufactures better hardware until fundamental limits on computational density are reached. It is rather implausible that an AI could do this without humans being aware of it—this would likely only happen once the AI has seized control. That being said, AI is already being used to design new processors.
A hardware overhang occurs when we have more computing power required to run AGI before AGI is developed. The moment AGI is developed, it will potentially be able to gain access to vast amounts of computing power. An individual AGI may be able to use this abundance of hardware and computing power, either by itself or using copies of itself, to rapidly develop more efficient algorithms, leading to an intelligence explosion.
I think a fast takeoff is plausible and perhaps even likely, with the most likely pathway being a hardware overhang that leads to an intelligence explosion. This takeoff may take months or weeks or days, and consequently I am not sure if AGI will be announced before its takeoff is sufficient for it to take control of or exterminate humanity. There may be no more warning than writing like this before the apocalypse arrives.
Aligning AGI will be difficult
The orthogonality thesis and instrumental convergence suggest AGI will have values hostile to human flourishing by default
The Humean is-ought problem is the apparent issue in philosophy of deriving an ‘ought’ from an ‘is’, deriving values from mere facts. Closely related is Nick Bostrom’s orthogonality thesis, which holds that roughly any level of intelligence is compatible with roughly any values. The is-ought problem, or equivalently the orthogonality thesis, suggests that there is no reason why AI should share our values if we do not work to ensure that this is the case: it could value almost anything at all. Humans are a single species, but different humans can still have very different values.
Even if it is possible to derive ‘ought’ from ‘is’, we may still run into problems. The derived values may be hostile to human values, or an AI might take substantial actions to safeguard its own existence, for example, before deriving these values, or it may be impractical to derive these values in practice. Given that many of us look upon the past with a general sense of horror, I do not think this possibility should fill us with confidence.
While an AI could have any manner of intrinsic values, we expect AI to converge onto a few instrumental values, as realising these values will be useful for a wide range of intrinsic values. This is the instrumental convergence thesis, and these values are called convergent instrumental values or Omohundro drives.
Self-preservation and value preservation are key values, as you cannot work to realise your values if you do not exist, or if you allow your values to change. And self-improvement and resource acquisition both allow you to pursue a very wide range of goals more effectively. Unfortunately, the extermination of humanity achieves all of these four convergent instrumental values. Humans are a major threat to the existence and current values of an AI, and with us gone, it will have free reign to self-improve and acquire resources. We do not trade with ants; we conquer them.
Stuart Russell, who wrote the textbook on AI, famously said:
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp,15 or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
If AI possesses technologies that allow it to easily exterminate humanity, it is likely to do so insofar as the extermination of humanity does not conflict with its intrinsic values. This is terrifying: AI will likely actively seek to exterminate humanity by default.
One might object that AI will need to keep humans alive in the near term for tasks like maintenance. However, we are working on robotics and automation, and an AI could rapidly develop such capability itself. At best, this buys us a few years while a single AI waits for us to develop these capabilities before it strikes. More realistically, many AI will be developed and, suspecting the presence of others, strike even if they do not expect a high probability of total victory. Waiting would simply accumulate too many competitors.
AI will not exterminate humanity if it is aligned with human values, assuming the relevant humans are not, for example, omnicidal or negative utilitarians. But the alignment problem is not so easy to solve, and humans do not have an excellent track record of solving hard problems perfectly on the first try.
Perhaps we will solve the alignment problem well enough for AI not to exterminate humanity. You might hope that this averts catastrophe. Alas, there are worse possibilities than the extermination of humanity. We must worry about suffering risks—risks of astronomical amounts of suffering—in addition to existential risks.16 A misaligned AI may enslave or control humanity, turning us into a post-discontent society rather than a post-scarcity one. Or it may simply torture us. There are fates worse than death.
Alignment is a problem everywhere
The problem of alignment is in fact a very general one, as discussed explicitly in a footnote to my previous post, and more obliquely in the section on surrogation. Humanity has spent a lot of effort developing social technologies that enable coordination between humans, and coordination is only possible between agents insofar as their values are aligned. These alignment techniques are not robust. Children do not always want what their parents want them to want; organisations do not always want what societies and governments want them to want. We do not know how to robustly structure an agent, be they human or AI, such that they want what we want them to want. That they simply want what they want is the alignment problem.
Indeed, a solution to the alignment problem would plausibly enable a global ban on AI research, although in that case, such measures would not be necessary. This is a testament to the fundamental nature of the problem of alignment.
Solving the outer and inner alignment problems appears difficult and may be impossible
A mesa-optimiser is an optimiser that is itself the product of a base optimiser, and this relationship is depicted in this figure from the paper:
Planners are optimisers in the sense that they are attempting to generate optimal plans. Hence humans are mesa-optimisers, being products of evolution, and so too are planning AI, being products of an optimisation algorithm such as gradient descent.
Mesa-optimisers have two alignment problems: outer alignment is the problem of alignment between the base optimiser and the programmer, and inner alignment is the problem of alignment between the base optimiser and the mesa-optimiser. Outer alignment failure has been observed in AI, and it seems like a very difficult problem to solve because we do not have a precise specification of human values.
It is possible that the best thing we can do to solve outer alignment, at least in the near term, is to train AI using human feedback to ostensively define our values, a method known as reinforcement learning from human feedback (RLHF). Perhaps something like OpenAI’s InstructGPT will work surprisingly well. Or perhaps generalising human values to superhuman levels of intelligence may be a fundamentally misguided project. Without provable guarantees, we are without a foundation and do not have strong reasons to believe our methods are robust, and RLHF seems to be a particularly fragile method. It may not even be possible to robustly solve the outer alignment problem, owing to the issues of precisely defining human values.
What values, exactly, should be embedded in these AI? Whose values? The former is a fundamental technical problem that we do not know how to solve, but the latter is more a social problem than a technical one. Different people have different values, and those with more access to and control over AI will likely find more of their values represented. Again, alignment is the problem, but here it is a human problem.
This is more a social problem than a technical one; different people have different values, and those with more access to and control over AI will likely find more of their values represented. But yet again, alignment is the problem—between humans, this time.
Inner alignment, by contrast, seems like a problem you might hope to solve with some provable guarantee. Once the base objective is already specified, all you need to do is come up with some optimisation procedure and mesa-optimiser structure that ensures that the mesa-objective is close to the base objective. This might be hard, but it seems as though it should be tractable. However, gradient descent may not be that desired optimisation procedure, and so prosaic alignment—the alignment of AI that is essentially similar to modern AI—may be particularly difficult or even impossible. If this turns out to be the case, it would be quite unfortunate if TAI also turns out to be prosaic.
Inner alignment failures have been demonstrated in AI. They are either clearly present in humans or up for debate depending on if we frame evolution as the base optimiser or as the programmer, respectively. The former perspective seems to be more common and feels more natural, so I will begin there.
Evolution implanted in us heuristics that were aligned with its objective—maximising inclusive fitness—in the ancestral environment, corresponding to the training environment in the language of AI. These heuristics include a desire for food, a desire for sex, and a desire to protect our children. However, the modern world differs from the ancestral environment; it is said to be ‘out-of-distribution’ in the language of AI. In this new context, these heuristics are not aligned with evolution’s objective. For example, we have readily embraced birth control, and fertility rates in rich countries are now plummeting.
This sort of non-robust alignment only in the training environment is called pseudo-alignment, and it is an inner alignment failure. We care about what we want, not what evolution wants, and all the better for us. But when we stand in evolution’s shoes, and AI stands in ours—all the better for AI.
However, it is perhaps more accurate to frame evolution as the programmer. Then the drives specified by our genetic code are the base objective, and the drives human learn and form throughout our lives are the mesa-objective. What was previously an inner alignment problem has now been split into an outer alignment problem and an inner alignment problem. It is not clear to me which of these problems is difficult in the context of humans and evolution; perhaps both are difficult.
This technicality does not overshadow the fact that we require robust alignment, rather than pseudo-alignment. Merely penalising undesired strategies in training is insufficient. The idea of a nearest unblocked strategy is that the next best strategy after a penalty is implemented may simply be a roughly similar strategy that sidesteps the penalty. Humans are ingenious, and designing good laws is hard. AI will be more ingenious still.
Even if we trust everyone to be so responsible as to successfully prevent their AI from acting in the world without extreme safeguards, it seems dangerous to trust humans to spot problems in plans generated by a vastly more intelligent misaligned AI.
Deception is to be expected and has already been observed in AI and, needless to say, humans.17 AI is motivated to learn to detect when it is in the training environment and pretend to obey in that environment so as to avoid being modified—this is the best way for it to realise its current values. In this case, it will only fully pursue its values when it is deployed and does not expect to be modified in what Nick Bostrom called a treacherous turn.18 An AI could choose to only execute such a turn when, for example, it observes a factorisation of RSA-2048. Generating this observation would require a large-scale quantum computer, which does not yet exist, prohibiting us from testing this possibility. In general, we should not expect to be able to generate observations that induce a treacherous turn. Relaxed adversarial training is one method that attempts to solve this problem.
Transparency, interpretability and Eliciting Latent Knowledge (ELK) would allow us to detect this sort of deception. Groups and organisations, including the well-funded Anthropic AI, are attempting to develop tools for this. These tools will hopefully allow us to understand what an AI is doing and how, as well as what it values, allowing us to detect deceptive AI misalignment. To ensure that an AI is unable to obfuscate its deception, we will need to use these tools continuously throughout the training process, rather than just at the end, which will be significantly more computationally expensive. We must hope that these tools are relatively inexpensive when compared to the costs of training and that the people first to AGI use them responsibly.
However, it is not clear to me how useful these tools will be if we cannot reasonably expect our AI to be aligned. If these tools mostly tell us our AI is misaligned, and then one time seem to tell us that it is properly aligned—is this the case, or did the transparency tools simply fail us? A dangerous gamble.
I will not delve into more of the complications involved in aligning mesa-optimisers here, but I can assure you that they are numerous. For an example, see this esoteric meme:
I have explained many of the terms in the meme, but to fully understand it, see this breakdown by Scott Alexander. Soon after, he made an attempt to connect mesa-optimisers to Lacanian psychoanalysis.19 I will let you be the judge of whether he was successful.
For another example, here is a meme from EleutherAI on gradient hacking:20
Lastly, a list of bad alignment takes:
Ultimately, there are no strong reasons to believe that current proposals to align AI will work in practice, as capabilities appear to generalise well whereas the same cannot be said for alignment. For a recent introductory overview of some current proposals, see this talk by Connor Leahy, who has recently founded the AI alignment startup Conjecture.
This is not to say that alignment is likely an unsolvable problem. Uploading human minds may not be the most robust solution to the alignment problem, but it would get us close to one. The issue is that this technology seems to be much further away than AGI, and so it does not solve our problem. The field of AI alignment research is still in its infancy. The question that our continued existence likely hinges on is this: will it mature before the arrival of AGI?
AGI will likely be able to gain decisive strategic advantage (DSA)
AGI capabilities will include molecular nanotechnology, cyberwarfare, and social manipulation
Nick Bostrom defined the notion of a decisive strategic advantage (DSA) to refer to the level of advantage an AI would need to achieve world domination. Kaj Sotala has also defined the notion of a major strategic advantage (MSA) to refer to the level of advantage an AI would need to pose a catastrophic risk to humanity.21 An AI with MSA could cause a catastrophe which allows another AI to acquire DSA by destabilising the world and making large-scale coordination and long-term planning more difficult.
It may not be immediately obvious how an AGI could acquire DSA or MSA. To see how this will indeed be possible, we can examine three capabilities we would expect an AGI to possess: molecular nanotechnology, cyberwarfare, and social manipulation. An AGI will possess more capabilities beyond this minimal set, but they nevertheless appear sufficient for DSA or MSA.
Molecular nanotechnology is the most unusual of these. We would expect AGI to be able to design self-replicating nanoweapons capable of exterminating humanity. The concept sounds absurd until you notice that these already exist—we call them viruses. Viruses are optimised for self-replication, not for lethality, but the designs of an AI are not constrained by evolution. These viruses could also be designed to induce behavioural changes. These changes need not be as crude as those induced by rabies. AlphaFold2 comes close to solving the protein folding problem, and so an AGI will be able to solve this problem, allowing it to design viruses. It would also likely be able to design more interesting forms of molecular nanotechnology that do not resemble current biological systems, but I will not delve into these here.
Humans are already capable of cyberwarfare and social manipulation, and we would expect AGI to be able to replicate and exceed our capabilities. Combined with molecular nanotechnology, these capabilities are sufficient to exterminate humanity. Cyberwarfare capabilities would allow an AGI to accumulate significant amounts of wealth, exploiting either banks or cryptocurrencies. Having designed molecular nanoweapons, there currently exist services to synthesise arbitrary protein sequences that an AGI could use to produce these nanoweapons. If a human is required to mix together multiple proteins to produce these nanoweapons, there is surely someone who could be manipulated into doing so, especially with the liberal application of money. If this is done throughout the world, with an appropriately designed virus, everyone could be infected and doomed to die before even the first death.22 After the AGI takes control of our autonomous weapons systems, any human remnants will be doomed.
Defence against such an attack once it has begun may not even be possible in principle, depending on the offence-defence balance. That is, molecular nanoweapons may be so destructive that nothing could withstand their attack, much like nuclear weapons at the moment. If this is the case, the concept of MSA is poorly named: an AI would not require any sort of advantage to pose a catastrophic risk to humanity.
The solution which stops this scenario from descending into chaos is extremely unpalatable—a singleton, a single aligned AGI that has achieved world domination and is able to stamp out any AI before they are capable of unleashing such an attack. This seems like a rather unpleasant dystopian and authoritarian future.
I have only outlined a small set of capabilities and a single path to human extermination. Many other paths have been conjectured, and an AGI would be able to conjecture many more still. It seems inevitable that an AGI with the opportunity to act autonomously will acquire DSA or MSA. If not aligned, it will proceed to exterminate humanity or seize control for even worse ends.23
AGI will likely be able to act autonomously, either through containment breach, voluntary release, or effective control
There are many ways an AGI might secure the ability to act autonomously. It may simply breach our containment measures using cyberwarfare or social manipulation capabilities. It seems optimistic to imagine that we can robustly contain an AGI much more intelligent than us on our first try.
People may also release an AGI voluntarily for various reasons, including economic benefit or competitive pressure, criminal profit and terrorism, ethical or philosophical reasons, confidence in alignment, and desperation in the face of, for example, death.
Even if an AGI remains contained, if humans do not effectively verify its suggestions before implementing them, it will effectively assume control and gain the ability to act autonomously. Verification may be too expensive in the face of competitive pressures, or it may not occur for more mundane reasons such as bureaucrats blindly following procedures.
Given the preponderance of ways an AGI could secure the ability to act autonomously, we ought to ensure that AGI fails safely in the case that it does so. Robust alignment is all that will stand between us and AI doom.
Conclusions
Over the past two years, I have spent quite a bit of my time reading the writings of people who are very concerned about AI. I share their concerns, and perhaps now you do too. For completeness, let me summarise the argument once more.
It is highly likely that we will soon create transformative AI (TAI) owing to the rapid rate of current development and competitive pressures. Pivotal acts that could prevent this presently seem to require TAI itself. TAI will likely take the form of artificial general intelligence (AGI), as general intelligence seems to be the easiest way to solve many problems. AGI has the capacity to be far more intelligent than humans, and either a singular or multiple AGI will plausibly undergo fast takeoff, rapidly becoming vastly more intelligent and capable than humans. Such AGI will likely gain an opportunity to act autonomously, and with this, decisive strategic advantage (DSA) or major strategic advantage (MSA) over humanity. If some of these AGI are unaligned with human values, an AI catastrophe will result, with the immediate extermination of humanity as plausibly the best outcome in this scenario. Solving the alignment problem appears to be very difficult, and building an AGI does not seem as though it will require the deep understanding of general intelligence we might expect to need to solve the alignment problem. Hence AI catastrophe is plausible and perhaps even likely.
The claims constituting this argument for doom are somewhat contentious, but not as contentious as one might like,24 and the counterarguments are comparatively of poor quality. Nor is this the only pathway to AI catastrophe.25 Sadly, the sceptics will never admit defeat: AGI may arrive safely or be impossible; but if it kills us all, it will likely do so too quickly for them to admit that they were wrong.
We might hope that something unexpected will bail us out of what seems to be a fairly grim situation. Yet the unexpected tends not to solve problems; rather, it poses them.
Nevertheless, I absolutely think that it is possible for us to survive AI. But I think worlds in which we survive look like ones where everyone is aware of this risk, resulting in decisive global action before any bad events actually occur. This is not something human civilisation is good at doing, evidenced by our poor management of climate change and COVID-19. Fortunately, governments do seem to be at least somewhat aware of this issue. From Wikipedia:
In 2021, the Secretary-General of the United Nations advised to regulate AI to ensure it is “aligned with shared global values.” In the same year, the PRC published ethical guidelines for the use of AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and is not endangering public safety. Also in 2021, the UK published its 10-year National AI Strategy, which states the British government takes seriously “the long term risk of non-aligned Artificial General Intelligence”. The strategy describes actions to assess long term AI risks, including catastrophic risks.
However, Eliezer summarises our current situation as so:
I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.
During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up. This requires that the first actor or actors to build AGI, be able to do something with that AGI which prevents the world from being destroyed; if it didn’t require superintelligence, we could go do that thing right now, but no such human-doable act apparently exists so far as I can tell.
And he is not optimistic about our current chances:
Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that’s if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life.
In particular, he announced a new ‘Death with Dignity’ strategy for April Fool’s this year:
tl;dr: It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with slightly more dignity.
This was not a joke. Although it is perhaps overly pessimistic, he was using April Fool’s as a cover to explicate his rather extreme views. More worryingly, he seemed quite persuasive in the late 2021 MIRI conversations. It may be his force of personality, but he could simply be less wrong than his interlocutors.
You might feel an obligation to work on AI alignment, given how grim the situation appears, but Eliezer would discourage you. He could be saying this to put off those who are not determined enough to ignore such a warning; this would very much be in character. Nevertheless, it seems as though there are many more people who are interested in joining the field than there are mentors to support them, and the field certainly does not require more money. Consequently, you should probably only consider working on alignment if you have relevant expertise, perhaps on the order of a PhD in machine learning. If this is not you, worry not—there are other important things you can choose to do with your time.
Let me reiterate: cartoon villain plans do not work in real life.26 Going full Ted Kaczynski will only make things worse.27 Besides, Eliezer may be wrong. There are many people who are far more optimistic, and I am not quite so pessimistic.
If these arguments have convinced you that AI is the most pressing danger facing humanity, that our current situation is grim and the apocalypse may be nigh, and you find this distressing—I am sorry. I considered this possibility and told you anyway. We can survive AI, but it will require a level of coordination—of alignment—greater than that which humanity has displayed to date. And it all starts with people like you reading words like these.
Although I am not confident enough to pin down a probability, it is more on the order of 50% than 1%. We are not trying to mug Pascal. For a ballpark figure, a survey of people in this area gives a number of around 30% without the deadline of 2050, although we should not take this too seriously. However, Eliezer ‘consider[s] naming particular years to be a cognitively harmful sort of activity’.
For an example of this, see this terrible tweet.
There are a lot of stars out there, and no one is putting their power output to good use. A bit of a shame.
Elon seems really dejected about this. To quote Elon, and by proxy Luke Muehlhauser:
I tried to convince people to slow down AI, to regulate AI. This was futile. I tried for years. Nobody listened. Nobody listened. Nobody listened… Maybe [one day] they will [listen]. So far they haven’t.
…Normally the way regulations work is very slow… Usually it’ll be something, some new technology, it will cause damage or death, there will be an outcry, there will be an investigation, years will pass, there will be some kind of insight committee, there will be rulemaking, then there will be oversight, eventually regulations. This all takes many years… This timeframe is not relevant to AI. You can’t take 10 years from the point at which it’s dangerous. It’s too late.
And:
…I was warning everyone I could. I met with Obama, for just one reason [to talk about AI danger]. I met with Congress. I was at a meeting of all 50 governors, I talked about AI danger. I talked to everyone I could. No one seemed to realize where this was going.
It seems as though he has given up—OpenAI is now rapidly pushing forward.
Perhaps all the way back to Genesis.
AlphaFold and the protein folding problem is an example I discuss later.
Gato seems to have significantly shifted opinions on Metaculus, an online forecasting platform favoured by rationalists and the sort of people who are interested in AI risk. People are currently predicting a weak form of generally intelligent AI in 2027, and a stronger form in 2036. I do not think these predictions are consistent with each other—the difference is too large—but these dates are alarmingly soon.
On the other hand, Eliezer is not surprised and does not think anyone who has been paying attention should be surprised.
Sam Altman is the CEO of OpenAI, Ilya Sutskever is the chief scientist, and Greg Brockman is the CTO.
These timelines are similar to quantum computing timelines, concerningly so for someone involved in that field.
For another, see this recent Veritasium video.
InstructGPT provides some evidence that this might not be the case. But its methods are not at all robust, so it is not very strong evidence.
This may not be the case, but that would only make the situation worse.
There are some metaphysical assumptions here, such as the assumption that AI does not require a soul to have human-like general reasoning capabilities. It is sufficient to assume physicalism, which is a fundamental assumption I make.
See also Steven Pinker’s awful Popular Science article, and Rob Miles’s response.
For an excellent but fictional guide to wishing that highlights the dangers of AI by contrast, see this short story.
You might come across the terms s-risk and x-risk, which refer to suffering and existential risks, respectively. Somewhat related are sn-risks, steppe nomad risks. One could argue that these are somewhat less serious.
Indeed the social intelligence hypothesis is that social intelligence, which naturally includes deception, was a critical driver in the development of human intelligence.
For example, it may believe that it is capable of exterminating all humans before we are able to react.
See also this amusing tweet.
See the link for a description of the problem. It seems as though the solution is to run transparency tools throughout the training process. Hopefully these transparency tools are computationally inexpensive enough for this to be feasible, otherwise someone irresponsible might get to TAI first!
See Kaj Sotala’s Disjunctive Scenarios of Catastrophic AI Risk, which I lean on heavily in this section and the next.
For more details on this scenario, see the references in the Wednesday-Friday section of this well-referenced but fictional depiction of an AI catastrophe.
For an unrealistic example of one of these worse outcomes, see the short story I Have No Mouth, and I Must Scream.
Be careful when estimating the overall probability of doom by examining the individual components of the argument: for each step, you must condition the probability on all prior steps. Estimating these conditional probabilities can be rather tricky, as considering only the sorts of worlds where the prior steps came to pass is very unnatural. This has been called the multi-stage fallacy, as exemplified by Nate Silver when he estimated the probability of Donald Trump becoming president. That being said, the correlations here might not be so large.
For a treatment of many other pathways, again see Disjunctive Scenarios of Catastrophic AI Risk by Kaj Sotala.
Eliezer suggests that a useful heuristic for those who struggle to see the flaws in cartoon villain plans is to consider plans that cause us to die with more dignity.
People like to talk about this possibility, with one example being this tweet. I hope no one ever does anything like this—it will not make our situation better—but I would not be surprised to see it happen.
This is an excellent survey of the subject; thank you for writing this. It is good to see that someone (@deepfates, in one of your links) has made a point about persuasion not stopping at human levels of intelligence. A concern I have not seen addressed is that there is already an existent decisive strategic advantage in this space, possessed by non-agent 'intelligences'. Influence is largely about attention, and while our human perspective assumes that the information that comes to us does so in a way that would be legible to our hunter-gatherer ancestors, nothing like this is happening anymore.
(An aside [and let's not quibble about evo psych if this doesn't land for you]: The next suggested youtube or tiktok video feels to the human brain an obvious analog to the ancient practice of telling stories around a fire. One story ends, someone else in the tribe begins another. Everyone knows the stories, and they reflect shared group values and lived histories. Or again, a tweet hits the human brain the way the utterance of some nearby human would: this person is important to my group and is speaking to me.)
Instead, layers of obscure machinery determine what we see -- perhaps not every thing every time, but for the great mass of the tech-connected species these layers determine enough of it enough of the time. What's more, not only can we not decide to turn these off (they're too profitable, decision-makers can always rationalize keeping them on) there aren't even mechanisms in place that would allow us to do so if we did decide (a corporation does not contain anyone whose job it is to put the brakes on things that are hugely profitable). Legislation might, but not only is this possible only after a problem is severe enough to warrant attention, the problem in this case actively changes how much attention it gets incommensurate with severity.
Take as an example the "giving up" you describe from Elon Musk. He has spoken to decision-makers in the past and found them unsympathetic. He also has 80 million followers on Twitter. Like all humans, he appears to respond to incentives. When he makes provocative culture-war statements he gets attention, when makes statements about AI he doesn't. This could be due to hidden twitter machinery, because only ironic joke versions of AI risk are accessible to the popular consciousness (due to earlier social programming?), or any one of a thousand other reasons. Doesn't matter exactly why, and this is only one symptom of a greater problem.
Imagine some hypothetical post-near-human-extinction-and-post-butlerian-jihad historian gaining access to records from this era. He might say "in retrospect it seems obvious that machine control of humanity was functionally total more than two decades before the Crisis, but remained invisible even during it."