How Wikipedia is surviving in the age of ChatGPT
Faced with advances in artificial intelligence, volunteers at the great online encyclopedia are strengthening their control to prevent the site from being filled with misinformation or promotional content
There has always been a risk of fake articles appearing on Wikipedia. Here’s just one example: for a time, the page of a Northern Irish radio presenter stated that he had been a promising break-dancer, but that his career as a dancer had been cut short by a spinal injury. But this was just trolling. Sometimes, however, the fake information is for promotional purposes or to spread disinformation. Wikipedia has a long tradition of tackling such problems. Its committed community of 265,000 active volunteers has kept these problems under control. But the rise of content generated by artificial intelligence (AI) is posing new challenges.
With more than 16 billion visits per month, Wikipedia’s prestige is beyond question. That’s why it’s the best place to insert disinformation or marketing messages from companies or individuals. And with AI, credible texts can be generated at will, easily and effortlessly.
Following the launch of ChatGPT, the site expanded its machine learning team. Wikipedia co-founder Jimmy Wales has said that AI is both “an opportunity and a threat.” And in its latest fundraising campaign, it highlighted the platform’s role in the “age of artificial intelligence.”
Miguel Ángel García, a Wikimedia Spain partner and former board member, admits that he has already come across texts that are suspected of having been generated with AI. “We have noticed that new editors appear who want to add content. And they add very extensive and highly developed content, which is unusual. Because when you are a volunteer starting out, you build the articles little by little. You go paragraph by paragraph.”
García knows these patterns well. He started contributing to Wikipedia in 2006, when he was in high school. He would correct the occasional spelling mistake or make obvious grammatical changes. He created his first article because he had written a paper about his parents’ village, Campaspera, near the Spanish city of Valladolid. There was no information about this town on the site, so he uploaded his text with photos he had taken himself.
“Since artificial intelligence has existed, more and more volunteers are appearing who give you a giant text, apparently well-structured and well-developed. But then you read it, and discover the redundancies that a person is often able to detect in texts made with artificial intelligence,” says García, referring to taglines and a certain way of presenting information, with hackneyed introductions and conclusions.
Such texts risk getting lost in an ocean of more than 62 million articles in more than 300 languages. Chris Albon, director of Machine Learning at the Wikimedia Foundation, which controls Wikipedia, points out that since 2002 some volunteers have used AI tools, especially in redundant tasks. Technology is no stranger to them. And the key to controlling inappropriate texts lies precisely in the community of volunteers, who moderate the content. They not only write texts, they also edit them and determine which ones may not be valuable.
“In this new era of artificial intelligence, the strength of this human-led model of content moderation is more relevant. Wikipedia’s model, based on debate, consensus and strict rules for citing [sources], has proven resilient in maintaining content quality over the past two decades,” says Albon. All texts must be referenced with secondary sources, i.e. links to pages on other websites.
Suspicious surge following ChatGPT
If an article has no sources, the community of volunteers detects this and takes action. “In most cases, articles are deleted instantly, because with just two clicks you can detect that the text is completely pointless. If not, they are usually marked to be automatically deleted within a maximum period of 30 days if the author is unable to prove what is written with sources,” explains García.
The Wikimedia Spain partner says that when ChatGPT emerged, there was a peak of AI-generated texts that were uploaded to the site. But now the trend has stabilized thanks to the efforts of the community. For his part, Albon says that we have to learn to live with these tools. “Wikipedia’s approach to AI has always been that people edit, improve and audit the work that AI does. Volunteers create the policies for the responsible use of AI tools on Wikipedia and monitor their correct application,” he says. The site does not penalize the use of artificial intelligence in texts, but rather those that do not meet the quality required by its policies.
According to García, the biggest risk for Wikipedia comes from outside of Wikipedia. The platform relies on secondary sources. “I see a medium-term problem in relation to possible AI-generated texts that become apparently reliable sources in the real world. More and more digital newspapers are emerging that publish almost anything. There comes a point where there are people who want to reference texts with these pseudo-media outlets,” he says.
The solution, like almost everything on the platform, lies with the editors. If volunteers detect that a site is unreliable, the community can decide to blacklist it. This has happened with established media outlets, including the Daily Mail. A few years ago, the British tabloid was banned from being used as a source because it had repeatedly published unverified information.
Wikipedia vs AI chatbots
There is another concern regarding the future of Wikipedia in the era of artificial intelligence. In a hypothetical scenario where chatbots, such as ChatGPT or Google Gemini, resolve user queries with a summary, who will visit Wikipedia articles? And more importantly, who will edit them?
“If there is a disconnect between where knowledge is generated, such as on Wikipedia, and where it is consumed, such as on ChatGPT, we risk losing a generation of volunteers,” Albon reasons.
Connecting knowledge-rich sites to AI chatbots that extract and replicate it is also of general concern. “Without clear attribution and links to the original source from which information was obtained, AI applications risk introducing an unprecedented amount of misinformation into the world. Users will not be able to easily distinguish between accurate information and hallucinations. We have been thinking a lot about this challenge and believe that the solution is attribution,” says Wikimedia’s Machine Learning Director.
The timing is ironic. Because, as we know, applications like ChatGPT or Google Gemini are based on systems that have been trained on Wikipedia content. Thus, part of the knowledge acquired by large language models (LLM) comes from those millions and millions of articles uploaded and edited by volunteers.
Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition