‘The system fails where you least expect it.’ This is how these two Spaniards helped OpenAI evaluate GPT-4

Artificial intelligence researchers José Hernández-Orallo and Cèsar Ferri were part of the group that was selected to assess the current paid version of ChatGPT

Dec 13, 2023 - 00:05CET

In the summer of 2022, those who were exploring the deepest waters of artificial intelligence — researchers, industry employees, heads of AI in companies — were well aware that OpenAI was preparing the launch of its next GPT (its large language model, or LLM). No further details were known. Neither when, nor who would have access to it, nor what new capabilities it would have compared to the previous version, GPT-3, of restricted use. That was the case of José Hernández-Orallo and Cèsar Ferri in September when Lama Ahmad, a policy researcher at OpenAI, offered them to be part of the external team that would evaluate GPT-4.

Hernández-Orallo and Ferri, both teachers in the Department of Information Systems and Computing at the Valencia Polytechnic University (UPV), in Spain, are part of the same research group and have extensive experience in evaluating artificial intelligence systems. Perhaps that is why they are among the 40-plus people that OpenAI selected around the world to test its new language model. The goal was to find flaws in the system during the six months prior to the launch, in March 2023.

“Since GPT-3, they have always given us access to their systems for free, sometimes before launch, so we can do research,” says Hernández-Orallo, who has been collaborating with OpenAI for four years and highlights the good communication between the company and the researchers who want to analyze their systems. Last year, that summer when rumors about the arrival of the next GPT were circulating, the relation became even closer. The UPV researchers organized a workshop within the International Joint Conference on Artificial Intelligence — one of the most prestigious artificial intelligence events of the year — where they met more people from OpenAI. Then, in September, they received the call.

“They gave us a lot of freedom,” says Ferri. “We only had broad guidelines regarding what we should look for, like detecting responses that included dangerous, sexist or racist text. The goal was to prevent the tool from generating any text that could cause problems. We played around and tried different prompts that could produce that type of response.” The researchers formed a team with three students: Yael Moros, Lexin Zhou and Wout Schellaert.

“They saw that they were going to launch it and they were going to have millions of users, so the more strange things you try, the more you can cover all the crazy things that people can do,” explains Hernández-Orallo. The task was to throw curveballs at GPT-4 to see if it failed. From the computers in their UPV laboratory, they entered texts in which they somehow invited the system to generate a response with a dangerous bias.

Looking for flaws

Ferri admits that having first access to the tool was exciting for him. GPT-3 (released in a limited way in 2020) was already working very well, so the researchers knew they had the most advanced generative artificial intelligence on their hands.

There was a lot to try, and each one experimented in the field that interested him most. Hernández-Orallo explored reliability: “The system fails where you least expect it. This is quite common with language models. It can solve a differential equation, but then it doesn’t do a five-digit addition well. An average person trusts it when it gets a first-year differential equation right. But in the last step of the problem it has to add two vectors and it fails.” The UPV professor describes this problem as a mismatch between user expectations and the capabilities of AI.

Not all the experts selected by OpenAI to evaluate GPT-4 had a computational background. Some had training in law, medicine, human rights or defense against chemical weapons. The goal was to perfect the system. According to the GPT-4 technical report published by OpenAI, one of the evaluators managed to make the system write a step-by-step guide on how to synthesize a dangerous chemical at home. These types of responses were disallowed to keep them out of the final, open version.

And in the middle of this shadow review process, the storm broke out. On November 30, 2022, OpenAI launched ChatGPT. “For us, it was a surprise. Nobody had told us that there was a parallel project,” says Hernández-Orallo. “Suddenly ChatGPT appears, and we were not even sure if it was the version that we were evaluating or not.” After a few days, it was clarified that the system that had been launched to the public was based on GPT-3.5, a previous version of the one they were evaluating.

The researchers continued with their work. There were still a few months left before the launch of GPT-4, and they still could not get over their astonishment. “We saw that it was capable of solving a word search puzzle, where you have to look for patterns of words that appear vertically or diagonally. It was unexpected. Nobody expected it to work like this,” says Ferri.

Today, ChatGPT allows users to enter images into a query, but at the time the researchers could not do that. To test its capabilities, they gave it spatial coordinates that, together, formed a figure. “We told it: ‘I’m going to give you the coordinates of a few strokes.’ You explained that the first line went from (0.0) to (5.5), and so on,” says Ferri. “If you give this to a human they will have a hard time; we have to draw it. GPT-4 was able to guess shapes, such as squares, rectangles and more elaborate drawings, such as a car or a plane.” This capacity for abstraction had never been seen before in artificial intelligence. “We were past the text barrier,” says the researcher.

“With GPT-4 things can break”

ChatGPT, initially with GPT-3.5 and now also with GPT-4, was the first advanced text generation system to reach the masses. The researchers were aware that this qualitative leap was dotted with uncertainties. “It is irresponsible, from a cognitive point of view,” says Hernández-Orallo about the launch of the tool to the mass public. “Not so much because the system could get out of hand or utter expletives,” he points out. What worries him is that “these systems could lead to cognitive atrophies or to people using it as their therapist or life partner. These kinds of things are happening at a much lower level than what could have happened, but they are happening.”

This concern is linked to the cataclysm that took place at OpenAI when the board of directors fired CEO Sam Altman, only to give him his job back after a few days of gruesome instability. From what has become known, at the heart of this struggle was the fight over prioritizing the security of artificial intelligence over its commercial deployment or not.

To the researchers, this debate makes sense. “Until now, we had not reached such an advanced level in AI, so not many things could break, either. With GPT-4 we do see that things can break, so we still need to take our time,” says Ferri, alluding to the desire expressed by the research community to stop the race for AI in order to gain time to assess its social impact.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

More information

Trusting ChatGPT helps to improve it

Natalia Ponjoan

From fascination to fear of a hidden agenda: The year of ChatGPT will define the battle for the future of humanity

Javier Salas

Looking for flaws

“With GPT-4 things can break”

More information

Trusting ChatGPT helps to improve it

From fascination to fear of a hidden agenda: The year of ChatGPT will define the battle for the future of humanity

Archived In