There’s a chance that ChatGPT knows personal details about you—and if it doesn’t, it might just make something up. As OpenAI’s generative text chatbot has boomed in popularity over the past six months, the risks of the system being trained on data vacuumed up from the web have become clearer.
Data regulators around the world are investigating issues with how OpenAI gathered the data it uses to train its large language models, the accuracy of answers it provides about people, and other legal concerns about the use of its generative text systems. Europe’s data regulators have joined forces to look at OpenAI after Italy temporarily banned ChatGPT from the country. And Canada is also investigating the technology’s potential privacy risks.
In Europe, GDPR laws require companies and organizations to demonstrate lawful reasons for handling people’s personal information and to let people access information about them, be informed of how their information is used, and demand that errors be rectified. In some cases, they can ask that certain types of data be erased. The way people’s personal information has been used in training data has been an early area of concern for EU regulators.
As people have experimented with the chatbot, asking it questions about their lives and friends, a range of potential problems have emerged. OpenAI warns that ChatGPT may provide inaccurate information, and people have found that it makes up jobs and hobbies. It has cooked up false newspaper articles that had even the alleged human authors wondering if they were real. It generated incorrect statements saying a law professor was involved in a sexual harassment scandal, and it said a mayor in Australia had been implicated in a bribery scandal—he is preparing to sue for defamation.
It’s not just individuals who are concerned about how data is used. Samsung has banned employees from using generative AI tools, in part over fears about how data is stored on external servers and the risk that company secrets could ultimately be disclosed to other users. (There are separate issues around copyright and intellectual property.)
In response to the scrutiny—particularly from the Italian data regulator, which has now allowed ChatGPT back into the country after OpenAI made changes to its service—the company has introduced tools and processes that allow people more control over at least some of their data. Here’s how to use them.
ChatGPT and GPT-4 generate their human-like answers statistically—predicting which words are likely to follow others after seeing millions of examples of sentences written by human authors. OpenAI has been secretive about the data it has trained its large language models on, so nobody outside the company knows exactly how much of the web (including people’s personal information) it has scraped in the process.
OpenAI says its large language models are trained on three sources of information: data taken from the web, data that the company licenses from others, and the information people feed it through chats. This can include information about individuals. “A large amount of data on the internet relates to people, so our training information does incidentally include personal information,” OpenAI explains in a post, stating that it takes steps to reduce the amount it gathers.
OpenAI has now introduced a Personal Data Removal Request form that allows people—primarily in Europe, although also in Japan—to ask that information about them be removed from OpenAI’s systems. It is described in an OpenAI blog post about how the company develops its language models.
The form primarily appears to be for requesting that information be removed from answers ChatGPT provides to users, rather than from its training data. It asks you to provide your name; email; the country you are in; whether you are making the application for yourself or on behalf of someone else (for instance a lawyer making a request for a client); and whether you are a public person, such as a celebrity.
OpenAI then asks for evidence that its systems have mentioned you. It asks you to provide “relevant prompts” that have resulted in you being mentioned and also for any screenshots where you are mentioned. “To be able to properly address your requests, we need clear evidence that the model has knowledge of the data subject conditioned on the prompts,” the form says. It asks you to swear that the details are correct and that you understand OpenAI may not, in all cases, delete the data. The company says it will balance “privacy and free expression” when making decisions about people’s deletion requests.
Daniel Leufer, a senior policy analyst at digital rights nonprofit Access Now, says the changes that OpenAI has made in recent weeks are OK but that it is only dealing with “the low-hanging fruit” when it comes to data protection. “They still have done nothing to address the more complex, systemic issue of how people’s data was used to train these models, and I expect that this is not an issue that’s just going to go away, especially with the creation of the EDPB taskforce on ChatGPT,” Leufer says, referring to the European regulators coming together to look at OpenAI.
“Individuals also may have the right to access, correct, restrict, delete, or transfer their personal information that may be included in our training information,” OpenAI’s help center page also says. To do this, it recommends emailing its data protection staff at email@example.com. People who have already requested their data from OpenAI have not been impressed with its responses. And Italy’s data regulator says OpenAI claims it’s “technically impossible” to correct inaccuracies at the moment.
You should be cautious of what you tell ChatGPT, especially given OpenAI’s limited data-deletion options. The conversations you have with ChatGPT can, by default, be used by OpenAI in its future large language models as training data. This means the information could, at least theoretically, be reproduced in answer to people’s future questions. On April 25, the company introduced a new setting to allow anyone to stop this process, no matter where in the world they are.
When logged in to ChatGPT, click on your user profile in the bottom left-hand corner of the screen, click Settings, and then Data Controls. Here you can toggle off Chat History & Training. OpenAI says turning your chat history off means data you input into conversations “won’t be used to train and improve our models.”
As a result, anything you enter into ChatGPT—such as information about yourself, your life, and your work—shouldn’t be resurfaced in future iterations of OpenAI’s large language models. OpenAI says when chat history is turned off, it will retain all conversations for 30 days “to monitor for abuse” and then they will be permanently deleted.
When your data history is turned off, ChatGPT nudges you to turn it back on by placing a button in the sidebar that gives you the option to enable chat history again—a stark contrast to the “off” setting buried in the settings menu
Article: How to delete your data from chatGPT