Author:
CEO & Co-Founder
Reading time:
In this article, we are going to show you what analysis of privacy policy using NLP looks like. Firstly, the privacy policy is a mandatory document that should be an integral part of every website, application, and online service. In the vast majority of countries, privacy policies are strictly regulated when it comes to their content and form. However, sometimes, it turns out that the specific policy is not compliant with legal regulations or even its attached product. Verifying these inconsistencies is a vital yet time-consuming endeavor. And this is where the technology called NLP (Natural Language Processing) comes into play. And in this article, we are going to show you what analysis of privacy policy using NLP solutions looks like.
Before we talk about privacy policy analysis using NLP, let’s examine for a few moments what natural language processing (along with natural language analysis) is all about.
Shortly put, NLP is a vast subset of AI that revolves around reading, understanding, and deriving meaning from human language (both written and spoken). Natural Language Processing systems are usually based on machine learning services.
Interestingly, if you want to see an advanced NLP algorithm in action, all you have to do is take your smartphone. Yes, Google Assistant, Alexa, and Siri are the perfect examples of NLP algorithms! If you understand how Siri works, you, in fact, understand what NLP is all about. However, knowing how NLP works is one thing. Making it work is a whole different story.
We reckon that the largest challenge when it comes to NLP is the fact that the process of understanding a language is extremely complex. That’s why, for example, machines don’t understand the concept of sarcasm. You see, when humans communicate, the same words and sentences can be used in a different context, with different meanings, and with a different intent.
If someone says, “it’s great weather today!”, they can mean that it really is great weather; it’s warm and sunny. But they can also mean the exact opposite, and, in fact, it’s cold and windy. How can a machine tell the difference?
And problems don’t even end here! The next issue in line is idioms and slang, which are also incredibly complicated to be understood by machines. And finally, every language is a living thing. Languages constantly evolve, and that fact has to be taken into consideration as well. As a result, devising a decent NLP algorithm is very, very complex. It isn’t impossible, though.
In 2013, the global market saw Word2Vec. It was a group of related models that were used to produce word embeddings. These models were basically two-layer neural networks that were trained to reconstruct linguistic contexts of words. The Word2vec algorithm used a neural network model to learn word associations from a large linguistic corpus of text (which was Word2Vec’s input source). As a result, this algorithm produced a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned to a corresponding vector in space.
Now, a linguistic corpus is typically a dataset comprising representative words, sentences, idioms, and phrases in a specific language. Corpora usually consist of books, magazines, newspapers, and internet portals. More and more commonly, they even contain informal forms and expressions, for example, coming from online chats. Corpora are used, for instance, in machine translation, but their main role is to help algorithms and machines understand the way people talk, write, and communicate in a given language.
If you go here, you will find a list of corpora for the English Language. The newest corpus comes from 2017, and it contains 14 billion words in 22 million web pages. NLP algorithms have to be based on this and other corpora in order to understand human language. And these algorithms handle this job more and more effectively. Just consider Google Assistant.
Today, this incredibly complex algorithm can, for instance[1]:
In short, we can expect that in the coming years, these algorithms will be even more effective! With this introduction done, let’s focus on the analysis of privacy policy using NLP.
Perhaps when you first read the title of this article, that was one of many questions that popped into your mind. It all started with the Federal Trade Commission (FTC). They decided to create a so-called “common law of privacy”. As the Internet began to grow in the mid-1990s and people began to surf the web and engage in online commercial activity, a lot of new questions and concerns arose. One of the most significant concerns revolved around personal data security. For obvious reasons, many people were afraid to use the Internet. They were worried that their personal data could be improperly viewed or accessed. At that time, there were no sufficient laws and regulations to cover these issues.
In 1914, the FTC (Federal Trade Commission) was created. Their initial goal was to ensure fair competition in commerce. Of course, over the years, the role and scope of FTC’s activity have significantly grown. Above all, one of the most significant expansions happened when congress passed the Wheeler-Lea Amendment to the Federal Trade Commission Act in order to expand the FTC’s jurisdiction “to prohibit ‘unfair or deceptive acts or practices”.
As a result, with this new act, FTC was directly charged with protecting consumers. In 1995, the FTC became involved with consumer privacy issues. And today, it’s still their job. After this extremely short history lesson, let’s go back to the present day.
Today, everyone knows that privacy policies are significant and should be a part of every online effort.
However, as it happens, these documents don’t always meet the purpose they were introduced for in the first place. We believe that Internet users should have tools that will analyze documents directly, without scrutinizing every paragraph.
As we will show you in this article, there are projects that try to help users achieve this goal. And this is what brings us to the MAPS project.
Today, the FTC engages in enforcement actions against operators of mobile apps that are non-compliant with their privacy policies. Although it may sound harmless, such non-compliance is considered an unfair or deceptive act or practice in or affecting commerce in violation of Section 5(a) of the FTC Act (FTC 2014). And that’s why the MAPS project was introduced in the first place. Its goal is to detect whether a specific privacy policy is not compliant with American law and the application itself. Now, because MAPS can autonomously identify potential issues, it makes the whole process and investigations following it much quicker and cheaper.
MAPS is a three-tiered classification model that uses NLP and natural language analysis to detect potential inconsistencies in privacy policies. As the authors of paper number two (the list of sources is at the end of the article) indicate, many apps’ privacy policies do not sufficiently disclose identifier and location data access practices performed by ad networks and other third parties.
But let’s go back to the analysis of privacy policy using NLP. MAPS comprise separate modules for the analysis of privacy policies and apps. Firstly, the authors used the APP-350 corpus of 350 annotated mobile app privacy policies to train and test their classifiers. The corpus’s policies were selected from the most popular apps on the Google Play Store and then annotated by legal experts using a set of privacy practice annotation labels. So, to make everything more effective, the authors focused on segments instead of entire documents in order to make productive use of the annotated data.
In the conclusions of their paper, the authors stated that: “Our results from analyzing 1,035,853 free apps on the Google Play Store suggest that potential compliance issues are rather common, particularly, when it comes to the disclosure of third party practices. These and similar results may be of interest to app developers, app stores, privacy activists, and regulators. Recently enacted laws, such as the General Data Protection Directive, impose new obligations and provide for substantial penalties for failing to properly disclose privacy practices.” And here’s the crucial conclusion, at least from our point of view:
“We believe that the natural language analysis of privacy policies, […] has the potential to improve privacy transparency and enhance privacy levels overall.”
Here, the authors of this paper explicitly state that natural language analysis is a vital tool that, at least when it comes to privacy policies, has whatever it takes to enhance the transparency of these documents. Therefore, it can have a significant impact on the level of privacy protection.
As you already know, when it comes to NLP, linguistic corpora are essential. For example, the MAPS project uses a corpus called APP-350. We want to show you yet another such corpus, this time called OPP-115, that also can play a significant role in privacy policy analysis using NLP. This corpus consists of 115 online privacy policies. These policies were split into subsets of 75 for training purposes and 40 for testing purposes. As the authors of this corpus claim:
“This corpus should serve as a resource for language technologies research to help Internet users understand the privacy practices of businesses and other entities that they interact with online.”
And here’s another interesting project: The scientists from Eindhoven University of Technology have written paper number 3 and developed a system that automatically evaluates the completeness of a privacy policy using machine learning and text classification techniques. The idea behind this project was simple – users very rarely make informed decisions concerning their personal data. That’s why they need help to make them. And this is where Eindhoven’s University project comes in handy.
Thanks to machine learning and classification, their algorithm automatically reads the specific privacy policy and provides the user with clear and easy-to-understand information concerning what they agree to by ticking the privacy policy box. So, this is how this system looks like:
The scientists behind this project also used a linguistic corpus. However, their corpus was based on 64 privacy policies (i.a., Google, Amazon, FoxNews), of which each paragraph has been manually annotated. This resulted in 1049 annotated paragraphs, 772 of which are used as a training set and 277 as a testing set. What were their results? Here’s what we can read in the conclusion part of the paper:
“We test several automatic classifiers, obtained by applying machine learning over a corpus of pre-annotated privacy policies, to prove the feasibility approach and give an indication of the accuracy that can be achieved. The results of our experiments show that it is possible to create automatic classification with a high accuracy ~92%. This accuracy level is similar to the one obtainable with human classifiers.”
The truth is all the projects mentioned earlier in this article function primarily in the academic environment. This doesn’t mean, however, that there’s nothing going on in the real world! We want to show you the Usable Privacy Policy project that’s been funded by the National Science Foundation under its Secure and Trustworthy Computing initiative.
Several recognizable American universities participate in this project, including:
NSF uses the latest advances in NLP, privacy preference modeling, crowdsourcing, formal methods, and privacy interfaces to help Internet users understand what they are really reading in online privacy policies. Here’s how they do it:
Moreover, they also use machine learning, code analysis, and software engineering to develop a solution that’s easy to use and versatile. Their side goal is to achieve a higher level of transparency without the need to impose new legal requirements on website operators.In their resources, they have both human-annotated privacy policies (the annotations were made by law students on a set of 115 privacy policies collected in 2015) and machine-annotated privacy policies (with annotations generated by the machine learning classifiers for a set of over 7,000 privacy policies collected in 2017).
As a result, you can go here and explore the results of their work on your own, 100% for free [5]. Inside, you will find privacy policies from Google, YouTube, Twitter, Amazon, and many more.
If you want to find out more about this project, watch this video:
To sum up, privacy policy analysis is a vital part of our online activity. It’s the best way to ensure that we understand what we agree to when clicking this “I Agree to Privacy Policy” checkbox currently located on almost every website. And NLP and machine learning are two technologies that are more and more extensively used to make this privacy policy analysis automated and smart.
And, as it happens, we thrive in both these areas. If you need any help with NLP algorithms or machine learning solutions, please drop us a line. Let’s find out what we can accomplish together!
This article has been written based on several papers published thanks to a number of American universities:
[1]Maggie Tillman, Pocket-lint, What is Google Assistant and what can it do?, March 25 2021, https://www.pocket-lint.com/apps/news/google/137722-what-is-google-assistant-how-does-it-work-and-which-devices-offer-it, accessed May 15, 2021.
[2] https://cyberlaw.stanford.edu/files/publication/files/SSRN-id2312913.pdf
[3]https://www.researchgate.net/publication/339935768_Natural_Language_Processing_for_Mobile_App_Privacy_Compliance
[4] https://www.semanticscholar.org/paper/A-machine-learning-solution-to-assess-privacy-Costante-Sun/653783690e08307561453205cb873b03efe3f568#paper-header
[5] https://explore.usableprivacy.org/?view=machine
Category: