Spanish researchers discover the trick AI uses to get such good grades: ‘It’s true kryptonite for the models’

Elon Musk has announced Grok 3 from his company xAI, and there’s already evidence that it’s the next best chatbot. But new research shows that the tests have many limitations

Elon Musk in Washington on February 13.Nathan Howard (REUTERS)

“Grok 3 is the world’s smartest AI,” the Grok X account posted on Tuesday. Elon Musk, owner of the company that develops it, xAI, spent the whole day repeating messages about how Grok is “the best chatbot in the world.” Hours earlier, OpenAI founder Sam Altman had written: “Trying GPT-4.5 has been much more of a “feel the AGI” moment among high-taste testers than I expected!”

Many of these claims are pure marketing. AI chatbots are an extremely competitive field today, and claiming to be the best attracts a lot of investment. But there are also a handful of benchmarks that serve as proof as to which AI models perform best in similar tests. If you’re not at the top of those tests, you’re a nobody.

“The numbers from Grok 3 at launch are a perfect example of the problems with current assessment,” says Julio Gonzalo, professor of computer languages and systems at Spain’s Open University (UNED). “If there is a lot of competitive pressure, there is too much attention on the benchmarks and it would be easy for companies to manipulate them, so we cannot trust the numbers they report.” Together with two other Spanish researchers, Gonzalo has tested a simple but relentless trick to check the effectiveness of some of these most prominent tests. The basic objective was to find out if the models read and responded like any other student or, instead, only looked for the answer in the huge body of data that has been used for their training.

The result is that they are still, above all, the most geeky machines ever devised: “In their first training phase, in which they learn the language, the procedure is a trawling procedure: they read, essentially, all the online content. Therefore, the developers know that the probability that they have seen the answer to an exam available online is very high,” explains Gonzalo.

How to hook up with models

What detail did they change in the experiment to fool the models? The researchers replaced the correct answer with a general one that says: “None of the others.” In this way, the model had to understand the question and reason, not just find the most probable answer in its memory banks. “The correct answer has a vocabulary completely disconnected from the question, which forces it to reason about each of the other possible answers and discard them; it is a much more demanding variation,” says Gonzalo. “It’s true kryptonite for the models,” he adds.

“The results show that all models lose accuracy significantly with our proposed variation, with an average drop of 57% and 50% [on two traditional benchmarks], and ranging from 10% to 93% depending on the model,” the researchers write in their paper.

This type of variation had already been tested, especially with questions, but it was this change in the answers that gave the clearest results. “This simple change suddenly lifts the veil on experimentation with benchmarks and allows us to see the real progress in the approximate reasoning capabilities of the systems without the noise produced by memorization,” says Gonzalo.

This change doesn’t aim to suggest that AI is suddenly useless, but it does prove that its reasoning ability was inflated and that it evolves more slowly than marketing departments and hype experts pretend: “Our results show that chatbots, in general, continue to apply a type of intuitive reasoning and have a poor capacity for generalization,” says Gonzalo. “In other words, they continue to answer by hearsay, intuitively, and they are still, in essence, super-brothers-in-law who have read everything but have not assimilated anything.”

The debate about the limitations of benchmarks is more widespread than it seems. Just this Tuesday, one of the leading AI educators, Ethan Mollick, called for more reliable tests.

A few weeks ago, another test called the “ultimate exam in humanity” was released, which, again, the models seem to pass quickly, faster than expected. These are more difficult questions, at a doctoral level, and with answers that cannot be found online. An added problem with this test is that the corrector is another model: ChatGPT-o3 mini. It does not seem to be the solution to the measurement problems, either: “It is much more important to design the exams well, so that the results are interpretable, than to invent more difficult exams as if the chatbots already had the level of graduates and they had to be asked to write a doctoral thesis,” says Gonzalo.

The difference between languages is also substantial. These models get better marks in English. The researchers tested them with Spanish for comparison and the results are worse. In more minority languages, the results should be even weaker: “We have done the work within the Odesia project, an agreement between Red.es and UNED to measure the distance between English and Spanish in AI,” says Gonzalo. “We have detected a very clear trend: the worse the model (in general, when they are artificial brains with fewer neurons), the more noticeable the difference between Spanish and English is.” This difference is more important than it seems because the small-sized models can be installed locally on devices and this guarantees the privacy of the data. “This ends up using models that work much worse in Spanish than ChatGPT or Claude,” adds Gonzalo.

All this does not mean that AI models have a clear ceiling. Pure language models do seem to have a limit, but the new reasoning models are more complete than the previous ones. “For example, ChatGPT-o3 mini, although its performance drops significantly, is the only one that manages to pass [one of the benchmarks]. New techniques are being sought to overcome the performance of language models,” says Gonzalo. In the researchers’ tests, along with the only barely passing GPT-o3 mini, the other model that does best is DeepSeek R1-70b, because its performance drops less than the rest with the new test.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

More information

Archived In