Will we stop typing one day? Advances in speech recognition technology already raises the possibility

Speech-to-text technologies have made huge strides in recent years, but switching from keyboard to dictation to create texts carries other implications

Ana Bulnes

Jul 30, 2024 - 18:44CEST

The technology already exists: if necessary, this report could have been written without typing, simply by dictating the text to the processor. However, it is still far from convenient. It would be necessary to go back over the text to correct (and possibly add) punctuation and to change words that have been misunderstood. And, after rereading, it is also likely that the result would have to be generally reworked, since we do not speak in the same way as we write. Even if, when dictating, we are thinking that the result will be a written text. These are some of the problems that graphic designer Miriam Inza encountered when she wrote the article Writing with the mouth: voice dictation as an experiment for the magazine Inmaterial. In the text, some of the consequences of writing by dictation are evident: the machine sometimes misunderstands or does not detect certain words: “For this article to make sense, for it to really involve the implementation of a type of writing with the mouth, I self-imposed the rule of not correcting what is being written,” Inza said.

“Perhaps one of the areas where [speech-to-text technologies] can still make a huge qualitative leap is in automatic punctuation,” Inza notes, in an email she typed. “At the moment, to write by voice, you have to dictate punctuation marks or, in the case of transcribing an interview, for example, enter them manually. Some tools have automatic punctuation — only in some languages, but work is ongoing on it.” Still, what is missing are just “details: being able to write at the speed at which one speaks without using one’s hands is already the future of the present,” she said.

One of the keys to the huge advances in speech-to-text technologies in recent years has been the arrival of Whisper, the Automatic Speech Recognition (ASR) model that OpenAI released at the end of 2022. The tool is controversial: according to an investigation by The New York Times, OpenAI created Whisper when it ran out of text on the internet with which to feed its AI. With Whisper, the door to the entirety of YouTube was opened to the company, giving it more natural and conversational material with which to train ChatGPT-4, its most advanced language model. This use, however, could have violated YouTube’s rules, not to mention the privacy of the users who appear in the videos (Google, the owner of the online video service, also uses the same material to train its own AI).

Technological wars aside, “Whisper has changed everything,” says José María Fernández Gil, head of the Digital Accessibility Unit at the University of Alicante in Spain. “AI tries to transcribe entire sentences, with their full stops, commas, exclamation marks, questions… And it is not going to make, or would residually make, contextual errors such as ‘gray hair is very comfortable,’ because it has not distinguished between the N and the M [in the English and Spanish spelling],” he explains. At the University of Alicante the model has been used to subtitle nearly 1,800 hours of video with “impressive” precision.

As for what still needs to be improved, Fernández Gil points out that there is still a lack of vocabulary and that the model makes mistakes with some acronyms, although “much less so than traditional systems.” However, Whisper’s computational cost is very high, something that is “beyond the reach of most people.”

Another issue that has not yet been resolved is the processing of different accents and dialects, “especially if they are used locally or regionally,” notes Dayana Ribas, scientific director of Business Telecommunication Services (BTS), a telecommunications company that is also using these technologies in various projects. Ribas mentions that transcription also fails when words in different languages are used, a situation “frequent in the daily life of practically bilingual countries, such as Puerto Rico.” The fact that these types of details are still missing is a clear example of the problem of bias, she points out.

There are also pending issues such as the transcription of audio in realistic, everyday scenarios “that present a mix of distortions of various kinds, for example, telephone calls with their ambient noises,” the automatic correction of errors and the “constant and growing” need to address the issue of security and privacy, adds the expert.

Are we going to move on to writing by dictation?

With the technology already on the verge of becoming a reality, the question arises: will there come a time when the first option when we want to produce a written text is to dictate it to a machine? All the experts interviewed agree that we speak and write differently, so this is something that will always have to be taken into account. Ribas believes that dictation can be practical for more creative tasks or writing drafts, since “it facilitates speed and naturalness in the production and saving of ideas” and we can do it while doing “other semi-automatic things for humans, such as walking or cooking, and it requires less effort.” However, “for generating more precise and concentration-demanding ideas, such as writing a technical report or a novel, sitting down to type is likely to provide adequate time to think and produce ideas with more control,” she adds.

On this point, Inza recalls the French essayist Roland Barthes, who said: “There is a great distance between my head and my hand and I take advantage of it in order to avoid saying the first thing that comes to me.” One of the things Barthes noticed in his research on “writing with the mouth” is that it also changes the way in which one speaks. “To write a text with voice dictation, you have to adopt a specific way of dictating,” he said.

It is also quite possible that in all this we will end up seeing a generation gap. Compared to people who are used to typing quickly on a computer keyboard, “the new generations have seen the microphone icon for dictation from an early age and use it a lot,” says Fernández Gil, who gives the example of his niece, who is a teenager and, when she uses her cell phone, “usually prefers to dictate in the applications rather than type.” From what she tells her uncle, this is something generalized among her generation.

On the other hand, a change in the writing instrument will produce texts with different characteristics. Virginia Woolf, for example, complained when she wrote a letter with a typewriter (she tried to avoid doing so) about how the instrument cut and broke sentences that were crystal clear and precious in her head. Related to all this, using AI tools for writing also has an impact: recent research from Harvard University concluded that texts written with the help of predictive writing are “more succinct, more predictable and less colorful” than those that do not use it. There are still no studies on how texts written “by word of mouth” will be.

A revolution for accessibility

Developing speech-to-text technology is not only a step forward in terms of convenience or speed when carrying out certain tasks, but it will also be an option that will help many people. Fernández Gil provides some examples: it will help people with hearing impairment who, thanks to the generalization of automatic subtitles, will be able to “hear [read]” what they cannot hear; it will improve the integration of people from other countries and cultures by combining spoken language recognition with translation; it will aid “people who cannot write well,” as well as making life much easier for those who, due to motor problems, cannot or have difficulty writing using their hands.

For her part, Ribas also highlights the possibilities that it opens up from the point of view of learning, since it “boosts the educational system with tools that make it easier to take notes and study.” It could also have many uses in the field of customer service. In a health center, for example, doctors could better attend to patients while the computer transcribes what they are saying.

When it comes to simply producing a text like this, dictation will be another option. “Having options is always an advantage. The choice of producing a text one way or another will be very personal and will in any case be filtered by the auditory, visual or reproductive characteristics of each person for inspiration or to better fix ideas,” says the scientific director of BTS.

Perhaps the images of writers — which went from depicting them pen in hand to showing them behind a screen — will in a few years become photographs of people walking and talking at the same time. Or maybe not. “Voice dictation technology is having, and will have, a strong positive impact on the various writing endeavors. But just as some of us prefer to handwrite certain things rather than type them on a cell phone or computer, there will also be those of us who find keystrokes more pleasurable than dictation. If only for the pleasure of being able to write in silence,” Inza concludes.

Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition

Are we going to move on to writing by dictation?

A revolution for accessibility

Archived In