How WhatsApp messages can identify you

A team of researchers has trained an algorithm to extract personal data from anonymous conversations in an experiment that highlights the importance of protecting privacy

Unsplash / WhatsApp

What is the extent of our digital footprint? We know about the traces we leave on social networks and from sharing content on other websites on the internet. But we edit this content according to who is going to see it and the image we want to portray. Instant messaging platforms, such as WhatsApp which is owned by Facebook, are another matter altogether. “You reveal more about yourself in private messages, not only in the content, but also in the way you use language,” explains Timo Koch, a researcher at Munich University’s department of Psychology.

Koch and his team analyzed more than 300,000 WhatsApp messages and trained an algorithm to recognize the age and gender of their authors – an experiment, he says, that highlights the importance of protecting privacy in these spaces. “End-to-end encryption is an important first step,” says Koch. “But beyond that, we need to be informed – platforms need to be transparent and add labels when information is not encrypted.”

You reveal more about yourself in private messages, not only in the content, but also in the way you use language
Timo Koch, researcher at Munich University’s department of Psychology

Koch and his research team’s concerns come as social networks increasingly favor the use of private messaging spaces. “Facebook is shifting the focus to these conversations and they will probably want to use the data, so we need to have a conversation about how to protect those messages and make sure that, if they are labeled as private, they really are,” says Koch.

How many messages does it take to identify us? That depends on what part of the process we’re considering. Koch and his team based their algorithm on the contents of What’s up, Deutschland?, a project that studied 451,938 WhatsApp conversations between 495 German volunteers. After filtering out cases where age and gender were not provided and exchanges that were too brief, the researchers were left with 226 subjects, 309,229 messages and 1,949,518 words.

Similar studies using social networks as a source of content have based their analyses on large text samples of tens of millions of words contributed by tens of thousands of volunteers. But while the volume of information in the WhatsApp research is far less, its shortfall is compensated by the nature of the information and the more intimate way we express ourselves in these private messaging spaces. “The fact that we have such a small data set and our predictions still work, suggests how much more could be done. Our results should be considered a starting point,” says the research team.

Once the algorithm was trained, a sample of about 1,000 words was enough to obtain a reasonably accurate classification of age and gender. In order to assess this figure, the researchers counted the number of words in a moderately active conversation between two people: three days of dialogue consists in general of just over 1,000 words. Nevertheless, the team acknowledges that with a larger database the potential of the analysis would be much greater. “If we think about personality analysis or other characteristics, we would need more information because there are more subtle differences involved,” says Koch. “When you have a good model, making a prediction takes less than two seconds.”

Tell me who you are and I’ll tell you how you WhatsApp

This identification is possible because our way of expressing ourselves on WhatsApp responds to demographic patterns. According to the contents of What’s up, Deutschland?, younger users tend to use more emojis and more frequently express themselves in the first person, a characteristic which has already been observed in the study of content posted on other platforms and which seems to confirm that we become less individualistic with age.

Regarding gender, Koch and his team found greater and more varied use of emojis in women, who also use more pronouns in the first person singular. Men, on the other hand, use more colloquial language and include more references to alcohol consumption.

Koch does not rule out the possibility that, over time, there have been small changes in the way we express ourselves in these forums, since the content of the dataset used in his study was compiled between November 2014 and January 2015. Formats such as stickers, which were incorporated in 2018 – despite already being available in other applications, such as Line – and direct access to gifs, might have introduced certain variations.

But accessing a larger and more up-to-date body of messages is not easy, at least in the academic world. “A big technology company has access to far more data,” says Koch. Richer and more recent sources of information would allow, for example, more complex analyses of users’ personalities and would allow for researchers to study how much more sincere we are in private messages than in what we share on social networks according to different cultures and national contexts.

Another limitation on analysis beyond English-speaking countries is the language factor. The dominance of English in the development of language processing systems means that most of the tools available are in this language. “We had to train our own models,” says Koch. “Each language is different and has its own signals.”

Given what we know, should we be censoring what we say on private messaging apps? According to Koch, it depends on how much importance we give to privacy and how much to convenience. “There are some good alternatives, such as [instant messaging service] Signal, which is also encrypted and doesn’t have a corporation behind it that has a vested interest in profiting from the information,” he says.

English version by Heather Galloway.

More information

Recomendaciones EL PAÍS