José Hernández-Orallo, AI expert: ‘We can’t use human measures to evaluate artificial intelligence’

The researcher joined 15 other scientists in calling for more transparent models

José Hernández-Orallo
José Hernández-Orallo, an expert in artificial intelligence evaluation, at the Technical University of Valencia (Spain).
María Fabra

José Hernández-Orallo won his first computer in a raffle when he was 10. “It was a Spectrum. My brother had a subscription to a computer encyclopedia series, and once you had every issue, they entered you in a raffle,” he recalls. The brothers won. “We played games like other kids, but we also learned to program. We had complete control of the computer — it wasn’t locked down like personal computers today.” Hernández-Orallo now has a doctorate and is a professor at Spain’s Technical University of Valencia (UPV). He’s an expert in the evaluation of artificial intelligence (AI) and recently co-signed a letter published in Science calling for the need to rethink reporting of evaluation results in AI so the technology can move toward more transparent models that divulge its effectiveness and capabilities.

Question. What do you think of Geoffrey Hinton’s decision to leave his job at Google to focus on raising public awareness and speak freely about potential risks of artificial intelligence?

Answer. I think it’s very reasonable, but I’m a little surprised that he waited this long. We have been saying much the same thing for a while at the University of Cambridge’s Leverhulme Centre for the Future of Intelligence and the Centre for the Study of Existential Risk (U.K.). I believe Hinton has made similar statements before, but perhaps not as clearly or as loudly. I’m also surprised that Hinton is just now recognizing that artificial and natural systems are very different, apart from the obvious differences in scale and multiplicity (they can replicate, communicate, and update much faster than humans). What works for one (capabilities, evaluation, control, ethics, etc.) doesn’t necessarily work for the other. But it’s good that such a prominent scientist in the field is speaking out about this. There is a very high level of agreement in the field about the risks of AI, even if we may differ on priorities. For example, I don’t think artificially generated material (text, images and video) is that problematic, because arousing skepticism and forcing us to check sources is healthy. I’m more concerned about solutions to the “alignment problem” that allows certain countries, political and religious groups to use AI to further their interests and ideology, or to censor AI systems. This singular AI alignment [which aims to steer these systems towards a single goal, preference, or ethical principle] reminds me of some of humanity’s darkest periods.

Q. How did you enter the field of artificial intelligence?

A. We had another encyclopedia at home — it was about human evolution. I was fascinated by intelligence and wanted to understand how it evolved. I also read philosophy books. I studied computer science because that was what my brother was studying, although artificial intelligence was only half of one course back then. Then I did my doctoral thesis at the University of Valencia’s (Spain) Department of Logic and Philosophy of Science, which had a program more oriented toward the philosophy of artificial intelligence. I was fascinated by this subject and really had no other options because we didn’t have the means to do something else. That year, I did work I enjoyed, wrote a book and fulfilled my social service. Sometimes you don’t choose — it’s the other way around. Ultimately, I focused on a pursuit I’ve always liked — understanding intelligence, both natural and artificial.

Q. How do you evaluate artificial intelligence systems?

A. We know what bikes and kitchen robots are for and what they can do. We evaluate them from a quality perspective. Until recently, that’s the way AI systems were developed. If their purpose was to classify cats and dogs, they were developed to do that as perfectly as possible. They were task-oriented systems. If you know how to evaluate something, you can determine how well it performs a certain task and how many mistakes it makes. But those types of AI systems are very different from systems like GPT4, which have cognitive capabilities.

Q. What do those systems look like now?

A. A system is good if it works for you, if it meets your expectations, and if it doesn’t unexpectedly produce poor results. AI systems are usually general-purpose systems. You must give them instructions to produce the desired output. They’re pretty good, but they’re not human beings. Problems start when you think they’re going to react like a person. Their answers sound confident, which makes you believe they’re correct. I’m not saying that humans always answer correctly, but we are accustomed to judging the reliability of people. These systems don’t have the intuition we use when interacting with other people.

Q. How do you improve the evaluation of general-purpose tools that can do so many things?

A. Well, the approach that’s been tried is called competency-based assessment, not task-based assessment. It has a long tradition and a scientific basis. But many people apply the same tests used to evaluate humans to AI systems. They’re just not meant for machines. It’s like using a weather thermometer to take body temperature — it’s not going to work.

Q. But is there a way to evaluate the capabilities of artificial intelligence?

A. That’s what we’re trying to develop. For example, GPT4 can take all sorts of tests, like academic and university admission tests, chemistry, physics, language — a bit of everything. Trying to compare those test results with humans so we can declare that GPT4 was in the 70th percentile makes little sense. It may be an indicator but that doesn’t mean that its score was better than 70% of the people who took the same test. When comparing these instruments with humans, you make a lot of assumptions, like the assumption that it’s capable of serving you coffee, for example... Show me a system that can serve you coffee.

Q. So there is no way to evaluate them?

A. We can’t measure how they work on a task-by-task basis because we would never finish. To evaluate a system like this, we have to extract indicators that enable us to extrapolate how the system will work in the future. The indicator for an AI system could be its capabilities, but the goal is not to produce a number or a score. We should be able to compare humans and AI systems, but it’s being done poorly right now. It is a very complex problem, but I’m hopeful. Where we are now is comparable to physics in the 15th or 16th century. Everything is very confusing right now, but we have to disrupt paradigms. The end goal, whether it takes decades or centuries, is to derive a series of universal indicators that can be applied not only to humans and artificial intelligence but also to other animals.

Q. You know that sounds scary – right?

A. We are an evolving species that lives in a evolving planet. Humans are only one form of intelligence among many. Sometimes we think we are sublime, but we are what we are because of a great deal of evolutionary chance. Bonobos [chimpanzees] are the closest species to humans, but there is still an enormous gap between us because we have language. We think humans are the highest form of natural life, but that’s not necessarily so. Artificial intelligence makes us question this. Most people agree we shouldn’t fool around with creating new species. But we’re fooling around with AI and when you do that, you can get burned. We are reaching levels of sophistication that must be taken seriously. It’s fascinating — it’s like creating a whole new world.

Q. You and the others who signed the letter in Science are proposing a roadmap for AI models, in which results are presented in more nuanced ways and instance-by-instance evaluation results are made publicly available.

A. Yes. We need higher levels of scrutiny. We can evaluate other systems by looking at the training data and running algorithms and code. But I can’t do that with these systems because of the computational and energy costs.

Q. So, how can they be more transparent?

A. You can make the process transparent. We are asking for more detailed evaluation results for each instance. If there are a million instances or sample cases, I want to see the results for each one because I can not reproduce them on my own. I don’t have access to the computing resources needed to do that. This limitation hampers a basic aspect of scientific validation — peer review. We don’t have access to or insight into the system’s failures.

Q. Is regulation a solution?

A. Regulation is necessary, but it has to be done right. There are consequences to a lack of regulation. If you don’t regulate aviation, accidents happen, people lose confidence and the industry suffers. If something terrible happens [with AI], our society might turn against these systems. There will be much lower adoption of these tools over the medium and long term, even though they are generally positive for society. Regulation is necessary, but it can’t be too restrictive. Some people are afraid of flying even though they know aviation regulation is very strict and flying is one of the safest means of transport. Businesses know that in the long term, regulation is beneficial for them.

Q. Can there be a single global regulation?

A. We have an International Atomic Energy Agency and certain agreements on recombinant DNA. But the world can’t agree on genetically modified foods, which we are consuming in Europe even though we’re not allowed to make them here. That could happen with AI. The European Union’s [AI] regulation may be flawed, but we have to take the plunge and implement it.

Q. Do you think this regulation should be strict or lax?

A. I think it has to be tailored appropriately. It should be strict with big companies, but a little more lax with small ones. You can’t hold Google to the same standard as a startup run by four college kids because you’ll kill innovation.

Q. Is there another gap between regulation and science?

A. Artificial intelligence is moving quickly, and you can’t foresee everything. It’s difficult to regulate something that’s so transdisciplinary and so cognitive. We are moving slowly, but we are also late [on regulating] social networks and centuries late with tobacco.

Q. Would regulation shed some light on how these [AI] black boxes work?

A. A black box system is one that doesn’t reveal how it works. You need to do a lot of evaluation to really understand one and how it fails. To evaluate students, we don’t give them a scan — we give them tests. To evaluate a car, we usually want to know how it handles a curve and what safety tests were conducted, not how many spark plugs it has. That’s why the evaluation issue is critical. We want to test these systems to determine where they can be safely used. That’s how cars and airplanes are evaluated.

Q. Why does artificial intelligence cause so much anxiety?

A. There are public awareness efforts underway, but their goal is not to explain how AI works. OpenAI has been criticized for giving hundreds of millions of people access to the most powerful artificial intelligence system, including children and people with mental health problems. The access comes with a disclaimer absolving OpenAI of responsibility for how the system is used. That’s the culture we have today. We download apps and nobody takes responsibility. Perhaps they [OpenAI] thought that if they didn’t release it, then they wouldn’t be able to discover all the risks. But you can do pilot tests. They say they offer different access levels, but it’s really all about the AI race — they’re challenging Google’s internet search business. And people are afraid because a few players dominate the industry — it’s an oligopoly.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

More information

Archived In

Recomendaciones EL PAÍS
Recomendaciones EL PAÍS