It is said that if you have ever written a blog, a product review on any platform, commented on anything online, or just about written anything on the online space, chances are your words were used for training ChatGPT. ChatGPT knows what you wrote, irrespective of whether you wanted it to know about it or not. By now, we have told you a few times about the humongous amount of data that went into training ChatGPT. As a thumb rule, the more data used for training a large language model like ChatGPT, the better the model would be at identifying patterns, predicting what’s next, and delivering plausible text.
Just for a quick refresher, we would like to remind you that more than 300 billion words existing all over the internet have been used to train the ChatGPT model. This includes books, articles, blogs, websites, posts, comments, and so much more. Now, it is a good thing that the model has been trained on so much data.
First, nobody was asked for permission to use their data for training ChatGPT. This is basically a violation of people’s privacy. If I have a blog I write on, anybody using the content of the blog for any person is ideally required to seek permission from me before using the content. OpenAI did not do this for any of the data it used to train ChatGPT. Some of the data used could be sensitive in nature for a multitude of reasons, and it is clearly a data privacy violation here. Plus, a lot of the content in this mix that ChatGPT has been trained on would be copyright-protected or proprietary information. For instance, you ask ChatGPT to pull up some paragraphs from some novel, and when ChatGPT delivers those paragraphs, you know that that content is copyright-protected and cannot be reproduced without the permission of the publishers.
I think we as users might be handing over sensitive information to ChatGPT without realizing it. This data then ends up being in the public domain and poses a definite data privacy risk. Say, you are a lawyer who asked ChatGPT to draft some legal agreements for you or you are a business executive who asked ChatGPT to draft some emails to your clients for you or you are a developer who asked ChatGPT to check some code for you. In all these cases, all your data – being drafted or being checked, now goes into the public domain as ChatGPT accesses it. Now, these are undoubtedly confidential pieces of information and one can’t afford to have any leakage in them. But you inadvertently gave ChatGPT access to it, by extension putting it into a freely accessible public domain.
according to OpenAI’s privacy policy, the company collects users’ IP addresses, browser type & settings, as well as user data around their interactions with the site, such as the type of content they engage with, the features that the users use, as well as the different actions they take online. To make it clearer, if you are a ChatGPT user, OpenAI is collecting all the data about your online browsing activity over time as well as across multiple websites.
What’s more alarming, in my opinion, is the fact that OpenAI has stated that it may share the user’s personal information with unspecified third parties, without actually informing them, to meet their, that is OpenAI’s business objectives.
Now, that is definitely scary. A lot of websites and platforms do that already, considering how many places we are asked to share personal information without being given a choice, data privacy is taken quite lightly as it is. The privacy risks that come with using ChatGPT sure seem a little concerning.
I think we could all do with being mindful of what we share with ChatGPT. I would also recommend making sure we log out once we are done using the tool.
Yeah, so I think this is what we had to tell you about the data privacy concerns associated with ChatGPT. We hope we were able to help you get a different perspective on things in this episode.