Why it is so dangerous for AI to learn how to lie: ‘It will deceive us like the rich’

A new scientific article examines how specific models of artificial intelligence have deceived via manipulation, sycophancy and cheating to achieve their goals

Meta Inteligencia artificial
A Meta AI named Cicero tricked his rivals to win in the strategy game Diplomacy. In the photo, Mark Zuckerberg, CEO of Meta, presents his company's new AI projects at the Meta Connect conference in Menlo Park (California) last September.Carlos Barria (REUTERS)
Jordi Pérez Colomé

A poker player has bad cards but makes the biggest bet. The rest of the players are scared off by the bluff and concede victory. A buyer wants to negotiate for a product, but shows no interest. They look at other things first and ask question. Then, casually, they ask for what they’re really want to get a cheaper price. These two examples are not from humans, but from models made with artificial intelligence (AI).

A new scientific article titled AI deception: A Survey of Examples, Risks, and Potential Solutions, published in the journal Patterns, analyzes known cases of models that have deceived via manipulation, sycophancy and cheating to achieve their goals. Robots are not aware of what they are doing and are only looking for the best way to obtain their objective, but researchers believe that these incipient deceptions do not bode well if legislation does not limit AI options.

“At this point, my biggest fear about AI deception is that a super-intelligent autonomous AI will use its deception capabilities to form an ever-growing coalition of human allies and eventually use this coalition to achieve power, in the long-term pursuit of a mysterious goal that would not be known until after the fact,” says Peter S. Park, a postdoctoral researcher in Existential AI Security at the Massachusetts Institute of Technology (MIT) and one of the paper’s lead authors.

Park’s fear is hypothetical, but we have already seen it happen in AI programmed for a game. Meta announced in 2022 that its Cicero model had beaten human rivals at Diplomacy, a strategy game that — in the company’s words — is a mix of Risk, poker and the television show Survivors. As in real diplomacy, one of the resources players have is to lie and pretend. Meta employees noticed that when Cicero lied, its moves were worse and they programmed it to be more honest. But it wasn’t really.

Peter S. Park and his co-authors also tested Cicero’s honesty. “It fell to us to correct Meta’s false claim about Cicero’s supposed honesty that had been published in Science.” The political context of the Diplomacy game involves less risk than the real-life contexts, such as elections and military conflicts. But three facts should be kept in mind, says Park: “First, Meta successfully trained its AI to excel in the pursuit of political power, albeit in a game. Second, Meta tried, but failed, to train that AI to be honest. And third, it was up to us outside independent scientists to, long after the fact, disprove Meta’s falsehood that its power-seeking AI was supposedly honest. The combination of these three facts is, in my opinion, sufficient cause for concern.”

How AI lies

Researchers believe there are several ways in which specific AI models have shown that they can deceive effectively: they can manipulate as in Diplomacy, pretend by saying they will do something knowing they will not, bluff as in poker, haggle in negotiations, play dead to avoid detection and trick human reviewers into believing that the AI has done what it was supposed to do when it has not.

Not all types of deception involve this type of knowledge. Sometimes, and unintentionally, AI models are “sycophants” and simply agree with the human users. “Sycophancy could lead to persistent false beliefs in human users. Unlike ordinary errors, sycophantic claims are specifically designed to appeal to the user. When a user encounters these claims, they may be less likely to fact-check their sources. This could result in long-term trends away from accurate belief formation,” states the study.

No one knows for sure how to make these models tell the truth, says Park: “With our current level of scientific understanding, no one can reliably train large language models not to deceive.” What’s more, many engineers in many companies are working on creating different and more powerful models. Not everyone has the same initial interest in their robots being honest: “Some engineers take the risk of AI deception very seriously, to the point of advocating for or implementing AI safety measures. Other engineers do not take it so seriously and believe that applying a trial and error process will be enough to move towards safe and non-lying AI. And there are still others who refuse to even accept that the risk of AI deception exists,” says Park.

Deceiving to gain power

In the article, the researchers compare super-intelligent AI to how the rich aspire to gain more power. “Throughout history, wealthy actors have used deception to increase their power,” reads the study.

Park explains that this may happen: “AI companies are in an uncontrolled race to create a super-intelligent AI that surpasses humans in the most of the economically and strategically relevant capabilities. An AI of this type, like the rich, would be expert in carrying out long-term plans in the service of deceptively seeking power over various parts of society, such as influencing politicians with incomplete or false information, financing disinformation in the media or investigators, and evade responsibility using the laws. Just as money translates into power, many AI capabilities, such as deception, also translate into power.”

But not all academics are as concerned as Park. Michael Rovatsos, professor of Artificial Intelligence at the University of Edinburgh, told SMC Spain that the research is too speculative: “I am not so convinced that the ability to deceive creates a risk of ‘loss of control’ over AI systems, if appropriate rigor is applied to their design; the real problem is that this is not currently the case and systems are released into the market without such safety checks. The discussion of the long-term implications of deceptive capabilities raised in the article is highly speculative and makes many additional assumptions about things that may or may not happen in the future.”

The study says that the solution to curtailing the risks of AI deception is legislation. The European Union assigns each AI system one of four risk levels: unacceptable, high, limited, and minimal (or no) risk. Systems with unacceptable risk are prohibited, while systems with high risk are subject to special requirements. “We argue that AI deception presents a wide range of risks to society, so they should be treated by default as high risk or unacceptable risk,” says Park.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

More information

Archived In

Recomendaciones EL PAÍS
Recomendaciones EL PAÍS