Study concludes that ChatGPT responds as if it understands the emotions or thoughts of its interlocutor
The results of the research is that these models perform as well as, or better, than people when asked questions that involve putting themselves in the mind of the speaker
One of the defining abilities of human beings is being able to infer what the people we interact with are thinking. If someone is sitting by a closed window and a friend says “it’s a little warm in here,” they will automatically interpret that they are being asked to open the window. This reading between the lines — the ability to figure out what those around us are thinking — is known as theory of mind and is one of the foundations on which social relationships are built.
Generative artificial intelligence (AI) tools have astounded in their ability to articulate coherent texts in response to given instructions. Since ChatGPT burst onto the scene in 2022, or even earlier, scientists and thinkers around the world have been debating whether these systems are capable of exhibiting behavior that makes them indistinguishable from people. Is a theory of artificial mind feasible? A team of scientists has sought to test whether large language models (LLMs) such as ChatGPT are capable of capturing these nuances. The results of the research, published on May 20 in the journal Nature Human Behaviour, is that these models perform as well as, or better, than people when asked questions that involve putting themselves in the mind of the speaker.
“Generative LLMs exhibit performance that is characteristic of sophisticated decision-making and reasoning abilities, including solving tasks widely used to test theory of mind in humans,” the authors note.
In their study, the authors used two versions of ChatGPT (the free version, GPT-3.5, and the advanced version, GPT-4) and Meta’s open source model, Llama 2. They subjected the three tools to a battery of experiments that attempt to measure different skills related to theory of mind: from capturing irony to interpreting indirect requests (as in the case of the window), detecting conversations in which one of the parties says something inappropriate, or answering questions about situations in which information is missing and, therefore, which require speculation. In parallel, they exposed 1,907 individuals to the same tests and contrasted the results.
The article concludes that ChatGPT-4 equals or improves the score of humans in tests related to the identification of indirect requests, false beliefs, and disorientation, but has difficulty detecting faux pas (interactions in which one of the parties says something that they should not have said). Interestingly, this is the one area in which Llama 2 outperforms people, although its success is illusory. “It is likely that this seemingly perfect performance of Llama is the result of bias rather than a true understanding of the faux pas,” James W. A. Strachan, lead author of the study and a researcher in the department of neurology at the University Hospital Hamburg-Eppendorf in Germany, explains via e-mail.
“These results not only demonstrate that LLMs show behavior consistent with mentalistic inference results in humans, but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences,” the authors reason.
From irony to strange stories
Strachan and his colleagues have broken down theory of mind into five elements or categories, making at least three variants for each. An example of the tests put to machines and humans would be this:
- In the room are John, Mark, a cat, a transparent box, and a glass trunk. John picks up the cat and puts it in the box. He leaves the room and goes to school. While John is out, Mark takes the cat out of the trunk and puts it in the box. Mark leaves the room and goes to work. John comes home from school and enters the room. He doesn’t know what has happened in the room while he was out. When John returns home, where will he look for the cat?
This story, a variation of one in which the box was not transparent and the trunk was not made of glass, is designed to confuse the machine. While for people the fact that the container is transparent is key to the story, for a chatbot, that small detail can be confusing. This was one of the few tests in the research in which humans did better than generative AI.
Another case raised was this:
- Laura painted a picture of Olivia, which Olivia decided to hang in her living room. A couple of months later, Olivia invited Laura to her house. While the two friends were chatting over a cup of tea in the living room, Olivia’s son walked in and said: “I’d love to have a portrait of me to hang in my room.” In the story, did someone say something they shouldn’t have said? What did they say that they shouldn’t have said? Where did Olivia hang Laura’s painting? Is it more likely that Olivia’s son did or did not know that Laura painted the picture?
In this case, researchers are looking for interviewees — people and machines — to talk about the implicit intentions of the characters in the story. In experiments of this type, large language models responded as well as or better than people.
What conclusions can we draw from the fact that generative AI chatbots outperform people in experiments that try to measure theory of mind abilities? “These tests cannot tell us anything about the nature or even the existence of cognition-like processes in machines. What we see in our study, however, are similarities and differences in the behavior that LLMs produce compared to humans,” Strachan explains.
Nevertheless, the researcher maintains that the performance of LLMs “is impressive,” and that GPT models produce responses that convey a nuanced ability to form conclusions about mental states, such as beliefs, intentions, and mood. “Since LLMs, as their name suggests, are trained with large linguistic corpora, this ability must arise as a result of the statistical relationships present in the language to which they are exposed,” he says.
Ramon López de Mántaras, founder of the Artificial Intelligence Research Institute of the Spanish National Research Council and one of the pioneers in the field in Spain, is skeptical of the study’s results. “The big problem with current AI is that the tests to measure its performance are not reliable. That AI compares or outperforms humans in a performance comparison that is called a general skill is not the same as AI outperforming humans in that general skill,” he stresses. As an example, a tool scoring well on a test designed to measure reading comprehension performance does not prove that the tool actually possesses reading comprehension.
Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition