Luciana Benotti, computational linguistics expert: ‘Data extraction for AI is a new form of colonization’

The Argentine researcher claims that tech giants gather free data from the same sources once exploited by colonial powers

Luciana Benotti is a researcher with Argentina's National University of Córdoba.RAMIRO PEREYRA

Mariana Otero

Córdoba (Argentina) - Jan 26, 2024 - 00:06CET

Share on Whatsapp

Share on Facebook

Share on Twitter

Prefer EL PAÍS on Google

Artificial intelligence (AI) and technology giants utilize everyone’s personal data, including billions of citizens from less developed nations, without providing any compensation or benefits. “These companies contribute very little wealth to the Spanish-speaking community, yet they profit from the data we generate in Spanish, free of charge,” said Luciana Benotti, who earned a doctorate in computer science specializing in computational linguistics. Benotti argues that the unrestricted use of data represents a new form of extractive colonization. “The data is captured from the same places where slaves were once captured, and moved to the same places where slaves were once taken.”

The researcher at Argentina’s National Scientific and Technical Research Council (CONICET) is the first Latin American chair of the executive board of the North American Association in Computational Linguistics (NAACL), an association of 5,000 researchers and language model developers from universities and large technology companies like Google and Meta. Benotti also collaborates with the Vía Libre Foundation for digital rights, and is a member of the steering committee of Khipu, a community of AI researchers and developers in Latin America.

Benotti was the only Latin American academic who participated in the most recent 2023 Security Summit on Artificial Intelligence in the United Kingdom. “Latin America was the least represented region at the summit, and Spanish wasn’t one of the seven languages that had simultaneous translation,” she said. Currently, Benotti leads a research team developing a tool to detect social biases in Spanish-language AI models.

Question. How does the Spanish-speaking world contribute to developing AI tools?

Answer. According to data from the Inter-American Development Bank (2020), Latin America and the Caribbean are underrepresented in patents and scientific articles on AI development. The upcoming conference in Singapore on natural language processing (NLP) and language models for AI has over 3,000 registered attendees, with only 23 from Spain and 13 from Spanish-speaking Latin America. A language analysis of scientific articles in the computational linguistics field over the past decade ranks Spanish as the eighth most studied language, lagging behind English, Chinese and German, which represent over 70% of the research.

Q. Does AI speak English?

A. Even when the AI speaks Spanish, it thinks in English or Chinese because most of its training data is in those languages. The AI’s “positionality” [where one is located in relation to their various social identities] is influenced by individuals born in English and Chinese-speaking countries. Recent research shows that publicly available AI models largely align with white, college-educated, native English speakers from the Northern Hemisphere.

Q. Is this a new form of colonization?

A. Indeed, it represents a modern form of extractive colonization. The data is captured from the same places where slaves were once captured, and moved to the same places where slaves were once taken. While oil, mining and intensive agriculture yield royalties to their owners, data extraction does not. However, it does consume the time of the individuals who generate the data.

Q. How can the Spanish-speaking world play a leading role in this field?

A. Companies don’t sell AI — they rent it by storing their clients’ data (companies and governments) on cloud computing networks. This data usually becomes their property and is used to train new AI models. One way to protect our data or charge for it is to reconsider the moratorium on customs taxes on data leaving Spanish-speaking countries. This would require big tech to not only pay for the hardware but also for the data, which is the raw material. This approach could encourage the creation of companies or institutions that store data in Spanish-speaking territories and leverage it for AI development. But it’s very challenging to compete with big tech these days.

Q. So, it’s important to ensure you position yourself well in the AI race.

A. AI is already impacting the labor market and will have a greater influence in the future. Its effect on employment is an unavoidable issue. Enhancing productivity should directly improve working conditions and employment quality, particularly for vulnerable populations. However, if imported AI is used, it becomes challenging to achieve this. Transforming the labor market requires proactive and effective measures to address unemployment and job insecurity. This is crucial for our communities.

Q. Is the cultural perspective of Latin America present in AI?

A. Companies like OpenAI, Meta, Google and others likely have access to Spanish data, but the extent and specifics are unknown. It’s possible that they use our personal data from sources like WhatsApp and Google apps to develop language models like ChatGPT. These models have evolved into powerful technologies with innovative capabilities. However, cultural contexts shape human behavior, which may not be fully captured in the data used to train NLP models.

Q. So, the models that appear to represent us actually don’t?

A. Current language models like ChatGPT are multilingual and primarily trained on English and Chinese data. Thus, the positionality of ChatGPT is typically that of someone from an English-speaking culture, while Baidu’s Ernie Bot reflects Chinese culture.

Q. Is it fair to say there is little diversity in AI?

A. Yes, it is. While diversity is emphasized in these events [like the 2023 Security Summit on Artificial Intelligence], there are no concrete measures in place. Investment to improve diversity is lacking compared to countries that already dominate the field. The global summit had over 100 participants, but Spanish speakers were very underrepresented.

Q. How does the low Latin American representation in AI impact our economies?

A. Big tech AI companies are currently some of the richest in the world, and AI is a major source of economic wealth globally. Six of the eight most valuable companies in the world rely heavily on AI. Unfortunately, very few of these companies contribute to the Spanish-speaking community, despite using our data for free. At the summit, I mentioned the need for representation from the global South, but no one paid much attention. While diversity was discussed in general terms, only China’s Minister of Science addressed the lack of representation in AI from the global South.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

Archived In