Get Ready for Generative AI’s Next Assault On Your Privacy

Updated on Jun 24, 2024 by Glyn Moody

In the 15 months since we first warned that generative AI would be terrible for privacy, the potential threats have become clearer. One example of the new problems generative AI brings is the increasing popularity of romantic chatbots, which can explicitly request – and have often obtained – highly intimate data. There’s a new trend emerging that could be even more problematic than what we’ve seen so far.

It arises from generative AI’s apparently insatiable demand for training data. The more high-quality data a system is trained on, the more likely it is to deliver good output. In efforts to position themselves as market leaders, companies are looking for new sources of decent material to use to train the LLMs (large language models) that generative AI products are based on. 

How LLMs Are Trained

An article in the New York Times explains that until now, primary AI training sets have been sourced from just a few places. One is web pages collected since 2007; another is Wikipedia; two more are thought to be based on text from millions of published books. Another major dataset consists of popular web pages that have been linked from Reddit. Initially created as proprietary training data by OpenAI, there’s also a completely free open-source version that anyone can download.

The big companies in the generative AI space are now looking for untapped sources of high-quality material to train their LLMs. For example, the New York Times reported that Meta discussed buying an entire publishing house, effectively acquiring a large number of texts for future LLM training. Another approach that’s being considered is paying people directly to train generative AI systems, either by interacting with an AI chatbot to improve its quality, or simply writing new material for it.  

The New Threat to Privacy: User-Generated Material

There’s a more important issue at play, though. Last year, Zoom updated its terms and conditions in a way that seemed to grant the company the right to train AI systems on material generated by its users. After public outcry, the company quickly “clarified” matters by assuring customers that it would not use audio, video, or chat data to train its artificial intelligence models without customers’ permission. In the future, the company might make these permissions an essential condition of use. 

We’re starting to see this approach from other companies, too. For example, just a month after Zoom’s AI blunder, Meta announced its generative AI features, explaining that:

Generative AI models take a large amount of data to effectively train, so a combination of sources are used for training, including information that’s publicly available online, licensed data and information from Meta’s products and services. For publicly available online information, we filtered the dataset to exclude certain websites that commonly share personal information. Publicly shared posts from Instagram and Facebook – including photos and text – were part of the data used to train the generative AI models underlying the features we announced at Connect. We didn’t train these models using people’s private posts. We also do not use the content of your private messages with friends and family to train our AIs. 

In other words, private posts are excluded from the training set, but public ones aren’t – even though they might contain a great deal of highly personal data. However, following a request from the Irish Data Protection Commission, Meta has announced a “delay” in training its LLMs using public content shared by adults on Facebook and Instagram in Europe.

Google also aims to use publicly available information from its users in the same way. In July 2023, it updated its privacy policy to allow the use of user data “to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Google has also been busy signing deals with other services to add new user-generated material to its training sets. For example, in February, Google announced an expanded partnership with Reddit:

Google now has access to Reddit’s Data API, which delivers real-time, structured, unique content from their large and dynamic platform. With the Reddit Data API, Google will now have efficient and structured access to fresher information, as well as enhanced signals that will help us better understand Reddit content and display, train on, and otherwise use it in the most accurate and relevant ways. This expanded partnership does not change Google’s use of publicly available, crawlable content for indexing, training, or display in Google products.

An article in the New York Times reveals that Google has been transcribing videos on YouTube so it can use the texts for training purposes. Google is not the only company taking this route:  OpenAI has created a speech recognition tool called Whisper that lets it transcribe audio from YouTube videos.

Other companies with large stores of user-generated content are realizing how valuable their assets could be. For example, Automattic is planning to sell “public content that’s hosted on WordPress.com and Tumblr” to AI companies. Video company Vimeo is considering doing the same, and is currently conducting a survey about its users’ views on such a move.

Can You Avoid It?

There are already articles on how to stop your online posts being used to train AI, but as one article on WIRED points out, it may be too late: “Many companies building AI have already scraped the web, so anything you’ve posted is probably already in their systems.” Given how much user-generated content some services store and its potential value, companies may start requiring that users consent to their words, images, videos, and audios files being licensed to generative AI companies for training purposes. This would be a terrible setback for the online privacy of billions of people.

The fight back has already begun, in the EU at least. Privacy activist Max Schrems’s noyb organization has asked the Austrian data protection authority (DSB) to investigate OpenAI’s data processing and the measures it’s taken to ensure how personal data is handled in its LLM training sets. Noyb wants the DSB to order OpenAI to bring its processing in line with the EU’s GDPR privacy law and to impose a fine on the company “to ensure future compliance.” If successful, this complaint is likely to impact most AI companies working with large language models in the EU.