What is NLP?
Shortly put, NLP is a vast subset of AI that revolves around reading, understanding, and deriving meaning from human language (both written and spoken). Natural Language Processing systems are usually based on machine learning services. Interestingly, if you want to see an advanced NLP algorithm in action, all you have to do is take your smartphone. Yes, Google Assistant, Alexa, and Siri are the perfect examples of NLP algorithms! If you understand how Siri works, you, in fact, understand what NLP is all about. However, knowing how NLP works is one thing. Making it work is a whole different story.
We reckon that the largest challenge when it comes to NLP is the fact that the process of understanding a language is extremely complex. That’s why, for example, machines don’t understand the concept of sarcasm. You see, when humans communicate, the same words and sentences can be used in a different context, with different meanings, and with a different intent. If someone says, “it’s great weather today!”, they can mean that it really is great weather; it’s warm and sunny. But they can also mean the exact opposite, and, in fact, it’s cold and windy. How can a machine tell the difference?
And problems don’t even end here! The next issue in line is idioms and slang, which are also incredibly complicated to be understood by machines. And finally, every language is a living thing. Languages constantly evolve, and that fact has to be taken into consideration as well. As a result, devising a decent NLP algorithm is very, very complex. It isn’t impossible, though.
You may find it interesting – NLP algorithm
In 2013, the global market saw Word2Vec. It was a group of related models that were used to produce word embeddings. These models were basically two-layer neural networks that were trained to reconstruct linguistic contexts of words. The Word2vec algorithm used a neural network model to learn word associations from a large linguistic corpus of text (which was Word2Vec’s input source). As a result, this algorithm produced a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned to a corresponding vector in space.
Now, a linguistic corpus is typically a dataset comprising representative words, sentences, idioms, and phrases in a specific language. Corpora usually consist of books, magazines, newspapers, and internet portals. More and more commonly, they even contain informal forms and expressions, for example, coming from online chats. Corpora are used, for instance, in machine translation, but their main role is to help algorithms and machines understand the way people talk, write, and communicate in a given language.
If you go here, you will find a list of corpora for the English Language. The newest corpus comes from 2017, and it contains 14 billion words in 22 million web pages. NLP algorithms have to be based on this and other corpora in order to understand human language. And these algorithms handle this job more and more effectively. Just consider Google Assistant.
Today, this incredibly complex algorithm can, for instance:
- Control your devices and your smart home
- Access information from your calendars and other personal information
- Find information online, from restaurant bookings to directions, weather, and news
- Control music
- Run timers and reminders
- Make appointments and send messages
- Open apps on your smartphone
- Read notifications to you
- And even conduct real-time spoken translations
Why would you want to analyze privacy policies?
Perhaps when you first read the title of this article, that was one of many questions that popped into your mind. It all started with the Federal Trade Commission (FTC). They decided to create a so-called “common law of privacy”. As the Internet began to grow in the mid-1990s and people began to surf the web and engage in online commercial activity, a lot of new questions and concerns arose. One of the most significant concerns revolved around personal data security. For obvious reasons, many people were afraid to use the Internet. They were worried that their personal data could be improperly viewed or accessed. At that time, there were no sufficient laws and regulations to cover these issues.
In 1914, the FTC (Federal Trade Commission) was created. Their initial goal was to ensure fair competition in commerce. Of course, over the years, the role and scope of FTC’s activity have significantly grown. Above all, one of the most significant expansions happened when congress passed the Wheeler-Lea Amendment to the Federal Trade Commission Act in order to expand the FTC’s jurisdiction “to prohibit ‘unfair or deceptive acts or practices”. As a result, with this new act, FTC was directly charged with protecting consumers. In 1995, the FTC became involved with consumer privacy issues. And today, it’s still their job. After this extremely short history lesson, let’s go back to the present day.
Today, everyone knows that privacy policies are significant and should be a part of every online effort. However, as it happens, these documents don’t always meet the purpose they were introduced for in the first place. We believe that Internet users should have tools that will analyze documents directly, without scrutinizing every paragraph.
As we will show you in this article, there are projects that try to help users achieve this goal. And this is what brings us to the MAPS project.
MOBILE APP PRIVACY SYSTEM (MAPS)
MAPS is a three-tiered classification model that uses NLP and natural language analysis to detect potential inconsistencies in privacy policies. As the authors of paper number two (the list of sources is at the end of the article) indicate, many apps’ privacy policies do not sufficiently disclose identifier and location data access practices performed by ad networks and other third parties.
In the conclusions of their paper, the authors stated that: “Our results from analyzing 1,035,853 free apps on the Google Play Store suggest that potential compliance issues are rather common, particularly, when it comes to the disclosure of third party practices. These and similar results may be of interest to app developers, app stores, privacy activists, and regulators. Recently enacted laws, such as the General Data Protection Directive, impose new obligations and provide for substantial penalties for failing to properly disclose privacy practices.” And here’s the crucial conclusion, at least from our point of view:
“We believe that the natural language analysis of privacy policies, […] has the potential to improve privacy transparency and enhance privacy levels overall.”
Here, the authors of this paper explicitly state that natural language analysis is a vital tool that, at least when it comes to privacy policies, has whatever it takes to enhance the transparency of these documents. Therefore, it can have a significant impact on the level of privacy protection.
“This corpus should serve as a resource for language technologies research to help Internet users understand the privacy practices of businesses and other entities that they interact with online.”
The scientists behind this project also used a linguistic corpus. However, their corpus was based on 64 privacy policies (i.a., Google, Amazon, FoxNews), of which each paragraph has been manually annotated. This resulted in 1049 annotated paragraphs, 772 of which are used as a training set and 277 as a testing set. What were their results? Here’s what we can read in the conclusion part of the paper:
“We test several automatic classifiers, obtained by applying machine learning over a corpus of pre-annotated privacy policies, to prove the feasibility approach and give an indication of the accuracy that can be achieved. The results of our experiments show that it is possible to create automatic classification with a high accuracy ~92%. This accuracy level is similar to the one obtainable with human classifiers.”
Several recognizable American universities participate in this project, including:
- Carnegie Mellon University
- Fordham University
- Stanford University
- University of Cincinnati
- Columbia University
What is this project all about?
NSF uses the latest advances in NLP, privacy preference modeling, crowdsourcing, formal methods, and privacy interfaces to help Internet users understand what they are really reading in online privacy policies. Here’s how they do it:
- They present these features to users in an easy-to-digest format that enables them to make more informed privacy decisions as they interact with various websites and online services.
Moreover, they also use machine learning, code analysis, and software engineering to develop a solution that’s easy to use and versatile. Their side goal is to achieve a higher level of transparency without the need to impose new legal requirements on website operators.
In their resources, they have both human-annotated privacy policies (the annotations were made by law students on a set of 115 privacy policies collected in 2015) and machine-annotated privacy policies (with annotations generated by the machine learning classifiers for a set of over 7,000 privacy policies collected in 2017). As a result, you can go here and explore the results of their work on your own, 100% for free . Inside, you will find privacy policies from Google, YouTube, Twitter, Amazon, and many more.
If you want to find out more about this project, watch this video:
And, as it happens, we thrive in both these areas. If you need any help with NLP algorithms or machine learning solutions, please drop us a line. Let’s find out what we can accomplish together!
This article has been written based on several papers published thanks to a number of American universities:
- The FTC And The New Common Law Of Privacy, Daniel J. Solove & Woodrow Hartzog. This paper is available for viewing and downloading via the Stanford University website.
- Natural Language Processing for Mobile App Privacy Compliance, Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh; School of Computer Science, Carnegie Mellon University, Department of Mathematics and Computer Science, Wesleyan University, School of Law, Fordham University.
Maggie Tillman, Pocket-lint, What is Google Assistant and what can it do?, March 25 2021, https://www.pocket-lint.com/apps/news/google/137722-what-is-google-assistant-how-does-it-work-and-which-devices-offer-it, accessed May 15, 2021.