Chatbot Privacy Is Possible with Open Source Models, but Big Tech Might Not Like It

Posted on May 11, 2023 by Glyn Moody

Back in February, PIA blog was one of the first to point out that chatbots might indeed be an exciting breakthrough for artificial intelligence (AI), but that they also represent a huge new threat to privacy. And sure enough, a couple of months later, data protection authorities around the world started taking an interest and raising the alarm.

In the vanguard of this scrutiny was Italy’s data protection agency, “Il Garante per la protezione dei dati personali”. It shut down the use of ChatGPT in the country precisely because of concerns that it violated the EU’s General Data Protection Regulation (GDPR). It gave the company behind ChatGPT, OpenAI, a list of changes it wanted to see, and a deadline for making them.

While OpenAI complied with the request, the bigger problem of proprietary chatbot systems logging your personal information remains. The best solution seems to be simple, yet potentially harmful to the biggest players in the field: open source chatbots.

Can Proprietary AI Systems Actually Remove Your Personal Information?

Italy has rescinded its ban on ChatGPT after OpenAI modified how it operates. The company agreed to give more information about how it processes personal data, and updated its sign-up page in Italy to require users to give their date of birth. This is designed to prevent those under 13 years old from using the system.

OpenAI has also created a form for those who wish to object to the processing of their personal data. But one of the requirements is that people need to provide evidence that their data is being used:

Please provide any relevant prompts that resulted in the model mentioning the data subject. To be able to properly address your requests, we need clear evidence that the model has knowledge of the data subject conditioned on the prompts.

In other words, it’s necessary to sign up to OpenAI in order to object to its processing, which seems unsatisfactory.

It is not clear how completely OpenAI can remove personal data even if it wants to. Its models do not store such data directly. Instead, the information is part of the vast set of training parameters. That means all kinds of personal information is “baked in” to those models in a subtle, non-obvious way. There are no precedents for trying to delete specific personal data from these huge and inscrutable models, so it remains whether and to what extent it can be done.

Fortunately, while data protection authorities around the world have been fretting about the privacy implications, a major technical development has taken place that could solve many of the problems they have been discovering. There are now open source chatbots and generative AI systems that are as powerful as commercial systems costing thousands of times more. Moreover, these new systems are extremely compact, as this blog post by Christopher S. Penn explains:

There are even projects to put these models on your laptop as private chat instances, like the GPT4ALL software. This looks and runs like ChatGPT, but it’s a desktop app that doesn’t need an internet connection once it’s set up and, critically, it does not share data outside your individual computer, ensuring privacy.

His post has an excellent explanation of how this has been achieved, and the implications for protecting personal data:

Up until now, services like ChatGPT have sent your data to a third party company for use, which is why we’ve said you should never, ever use them with sensitive information. Now, that’s no longer the case – you can use GPT4ALL in complete privacy. It’s the best of both worlds – the performance and capabilities of a service like ChatGPT with ironclad privacy because the data – your data – never leaves your computer. That makes it ideal for industries like finance, healthcare, government – any place where you wouldn’t just want to hand over protected information willy nilly.

Open-Source Chatbot Privacy Is Possible

Now that companies can run powerful chatbot systems on their own hardware, with no data sent to the outside world, it seems likely that a new tech sector producing highly-tailored solutions for business and general users could spring up.

Locally-hosted AI chatbots will avoid many of the problems with popular solutions by limiting their functionality to very specific domains. That means they are unlikely to start producing wildly incorrect, irrelevant or even dangerous responses to queries. Best of all, there will be fewer privacy concerns because any personal data used for training these systems will not leave the company’s premises (although there may still be important GDPR issues concerning user consent).

This is great news for businesses and general users. But it does raise questions about today’s leaders in the chatbot field – companies like OpenAI, Google and Microsoft – that are spending billions of dollars developing proprietary solutions. What seems to be a leaked internal Google document shows that at least some within that company realize that they and others have a serious problem:

in the end, OpenAI doesn’t matter. They are making the same mistakes we are in their posture relative to open source, and their ability to maintain an edge is necessarily in question. Open source alternatives can and will eventually eclipse them unless they change their stance.

Although it is still early days, the rise of compact, powerful chatbot systems that are open source is perhaps the best tech development we could have hoped for. It makes it less likely that the nascent AI market will be owned by one giant company, and it will allow new startups based on open source code to be launched quickly. Moreover, the ability to run AI chatbots on corporate or even personal computers makes solving the major privacy issues that currently plague chatbots much easier than originally feared.

Featured image created with Stable Diffusion.