Author:
CEO & Co-Founder
Reading time:
Today, we want to tackle another fascinating field of Artificial Intelligence. NLP, which stands for Natural Language Processing (NLP), is a subset of AI that aims at reading, understanding, and deriving meaning from human language, both written and spoken. It’s one of these AI applications that anyone can experience simply by using a smartphone. You see, Google Assistant, Alexa, and Siri are the perfect examples of NLP algorithms in action. But there’s far more to NLP algorithms! Let’s examine NLP solutions a bit closer and find out how it’s utilized today.
NLP is a relatively new technology. The very first major leap forward in the field of natural language processing (NLP) happened in 2013. At this time, Word2Vec[1] was introduced. It was a group of related models that are used to produce word embeddings. These models are basically two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned to a corresponding vector in the space. At this point, we ought to explain what corpora (corpuses) are.
A linguistic corpus is a dataset of representative words, sentences, and phrases in a given language. Typically, they consist of books, magazines, newspapers, and internet portals. Sometimes it may contain less formal forms and expressions, for instance, originating with chats and Internet communicators. All in all–the main idea is to help machines understand the way people talk and communicate.
Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human, natural languages. The primary goal of Natural Language Processing (NLP) is to enable computers to understand, interpret, and respond to human language in a way that is both meaningful and useful.
NLP algorithms are a set of methods and techniques designed to process, analyze, and understand human language. These algorithms enable computers to perform a variety of tasks involving natural language, such as translation, sentiment analysis, and topic extraction. The development and refinement of these algorithms are central to advances in Natural Language Processing (NLP).
In 2014, Stanford’s research group introduced another Natural Language Processing (NLP) algorithm: GloVe[2]. It’s an unsupervised learning algorithm for obtaining vector representations for words. According to Stanford’s website, GloVe’s training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Stanford’s researchers proposed a different approach. They proposed that the best way to encode the semantic meaning of words is through the global word-word co-occurrence matrix as opposed to local co-occurrences (as in Word2Vec). GloVe algorithm involves representing words as vectors in a way that their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.
These two algorithms have significantly accelerated the pace of Natural Language Processing (NLP) algorithms development. There are, however, several problems with this technology.
The largest NLP-related challenge is the fact that the process of understanding and manipulating language is extremely complex. The same words can be used in a different context, different meaning, and intent. And then, there are idioms and slang, which are incredibly complicated to be understood by machines. On top of all that–language is a living thing–it constantly evolves, and that fact has to be taken into consideration.
One of the most common problems is sentiment analysis. It’s all about determining the attitude or emotional reaction of a speaker/writer toward a particular topic. Possible sentiments are positive, neutral, and negative. What’s easy and natural for humans is incredibly difficult for machines.
Naturally, data scientists and NLP specialists try to overcome these issues and train the Natural Language Processing (NLP) algorithms so that they can operate as efficiently as possible. There are many training models and methods that allow training Natural Language Processing (NLP) algorithms so that they can understand and derive meaning from the text. In this article, we are going to show you four exemplary techniques:
In this model, a text is represented as a bag (multiset) of words (hence it’s name), disregarding grammar and even word order, but keeping multiplicity[3]. Basically, the bag of words model creates an occurrence matrix. These word frequencies or occurrences are then used as features for training a classifier.
Unfortunately, this model has several downsides. The biggest is the absence of semantic meaning and context, and the fact that some words are not weighted accordingly (for instance, in this model, the word “universe” weights less than the word “they”).
It’s the process of segmenting text into sentences and words. In essence, it’s the task of cutting a text into smaller pieces (called tokens), and at the same time throwing away certain characters, such as punctuation[4].
For example:
Input text: Peter went to school yesterday.
Output text: Peter, went, to, school, yesterday
The major downside of this technique is the fact that it works well with some languages, and much worse with others. This is a case, especially when talking about tonal languages, like Mandarin or Vietnamese. For instance, depending on the tone, the Mandarin word ma can mean “a horse, “hemp”, “a scold”, or “a mother”. It’s a real challenge for the NLP algorithms!
The stop words are, for instance, “and”, “the” or “an”. This technique is based on removing words that provide little or no value to the NLP algorithm. They are called the stop words and are removed from the text before it’s processed.
There are a couple of advantages to this method:
Naturally, there are also downsides. There is always a risk that the stop word removal can wipe out relevant information and modify the context in a given sentence. That’s why it’s immensely important to carefully select the stop words, and exclude ones that can change the meaning of a word (like, for example, “not”).
This technique is based on reducing a word just to its base form and grouping together different forms of the same word. In this technique:
This technique is all about reaching to the root (lemma) of reach word. Let’s take the words “working”, “worked”, “works”. All of them are different versions of the word “work”. So “work” is our lemma–the base form that we want to use.
The lemmatization technique takes the context of the word into consideration, in order to solve other problems like disambiguation, where one word can have two or more meanings. Take the word “cancer”–it can either mean a severe disease or a marine animal. It’s the context that allows you to decide which meaning is correct. And this is what lemmatization is about.
That’s the theory. Now, let’s talk about the practical implementation of this technology. We will show you two common applications of NLP algorithms. One is in the medical field and one is in the mobile devices field.
NLP plays a more and more important role in medicine. It enables the recognition and prediction of diseases based on patient electronic health records and their speech. According to the paper called “The promise of natural language processing in healthcare”[5] published in The University of Western Ontario Medical Journal, medical NLP algorithms can be divided into four major categories:
As you can see, almost every group within the medical field can benefit from NLP in medicine. Now, let’s take a look at two real-life examples of functioning NLP algorithms:
One of the most outstanding examples of NLP application in medicine is Amazon Comprehend Medical. It’s part of Amazon Web Services. ACM is an NLP service that makes it accessible to use machine learning to extract medical information from an unstructured text. With ACM, physicians can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records[6].
This project’s idea is based on the fact that a lot of patient data is “trapped” in free-form medical texts. That’s especially including hospital admission notes and a patient’s medical history. These are materials frequently hand-written, on many occasions, difficult to read for other people. ACM can help to improve extracting information from these texts.
Another interesting example is Curai[7]. They try to build an AI-fueled care service that involves many NLP tasks. For instance, they’re working on a question-answering NLP service, both for patients and physicians. For instance, let’s say we have a patient that wants to know if they can take Mucinex while on a Z-Pack? Curai wants to extract the relevant medical terms and surrounding context from such questions and use them to retrieve the documents most responsive to the question terms from a repository of curated answers and provide the patient with an answer. Fully automatically! Their ultimate goal is to develop a “dialogue system that can lead a medically sound conversation with a patient”.
Each one of us has instant access to a sophisticated NLP algorithm. How so? Think of your mobile device. Currently, there are four major voice assistants that are nothing but real-life applications of NLP algorithms!
These assistants are:
On the example of Amazon Alexa, let’s take a look at how these assistants work. Actually, what you perceive as a simple and short “conversation” is a sophisticated, four-step process[9]:
Step 1: The recording of your speech is sent to Amazon’s servers to be analyzed.
Step 2: Alexa breaks down your “orders” into individual sounds. Next, they are sent to the database containing various word pronunciations to find which words most closely correspond to the combination of your individual sounds.
Step 3: Alexa identifies crucial words to understand the sense of your order and carry out corresponding functions.
Step 4: Amazon’s servers send the information back to your device so that Alexa speaks the answer.
That’s impressive, especially given that it all happens within several seconds! Today, you can use these intelligent assistants to open specific apps in your mobile device, take notes, control your smart home devices, and more!
Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that involves analyzing text to determine the sentiment behind it.
Sentiment analysis is used to understand the attitudes, opinions, and emotions expressed in a piece of writing, especially in user-generated content like reviews, social media posts, and survey responses.
The primary goal of sentiment analysis is to categorize text as positive, negative, or neutral, though more advanced systems can also detect specific emotions like happiness, anger, or disappointment.
Sentiment analysis is typically performed using machine learning algorithms that have been trained on large datasets of labeled text.
Machine learning algorithms can range from simple rule-based systems that look for positive or negative keywords to advanced deep learning models that can understand context and subtle nuances in language.
As with any AI technology, the effectiveness of sentiment analysis can be influenced by the quality of the data it’s trained on, including the need for it to be diverse and representative.
Natural Language Processing (NLP) leverages machine learning (ML) in numerous ways to understand and manipulate human language. Initially, in NLP, raw text data undergoes preprocessing, where it’s broken down and structured through processes like tokenization and part-of-speech tagging. This is essential for machine learning (ML) algorithms, which thrive on structured data.
Once the data is ready, machine learning (ML) models are trained with these features. This training involves feeding the model vast amounts of text so it can learn patterns and associations. For example, in sentiment analysis, models learn to associate specific words and phrases with sentiments like happiness or anger.
A large part of NLP work involves supervised learning, where models learn from examples that are clearly labeled. For instance, in spam detection, models are trained on emails marked as ‘spam’ or ‘not spam’, learning to classify new emails accordingly. On the other hand, unsupervised learning in NLP allows models to find patterns in unlabeled text data, like identifying different topics in a collection of documents without any pre-defined categories.
Deep learning, a more advanced subset of machine learning (ML), has revolutionized NLP. Neural networks, particularly those like recurrent neural networks (RNNs) and transformers, are adept at handling language. They excel in capturing contextual nuances, which is vital for understanding the subtleties of human language.
NLP tasks often involve sequence modeling, where the order of words and their context is crucial. RNNs and their advanced versions, like Long Short-Term Memory networks (LSTMs), are particularly effective for tasks that involve sequences, such as translating languages or recognizing speech.
Another critical development in NLP is the use of transfer learning. Here, models pre-trained on large text datasets, like BERT and GPT, are fine-tuned for specific tasks. This approach has dramatically improved performance across various NLP applications, reducing the need for large labeled datasets in every new task.
In some advanced applications, like interactive chatbots or language-based games, NLP systems employ reinforcement learning. This technique allows models to improve over time based on feedback, learning through a system of rewards and penalties.
In essence, ML provides the tools and techniques for NLP to process and generate human language, enabling a wide array of applications from automated translation services to sophisticated chatbots.
Granted, the NLP technology has taken a huge leap forward! And all of that happened in just the past eight years! Imagine what can happen in the next eight years period! Without a shadow of a doubt, we will have a multitude of topics to cover on our blog!
If you’d like to see if NLP can be implemented into your company – drop us a line. We will gladly guide you through the amazing AI world. With our help, you’re on a straight course to improve the way your company works. It’s time to find out how!
[1] Wikipedia. Word2vec. URL: https://en.wikipedia.org/wiki/Word2vec. Accessed Jul 3, 2020.
[2] Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representation. URL: https://nlp.stanford.edu/projects/glove/. Accessed Jul 3, 2020.
[3] Wikipedia. Bag-of-words model. URL: https://en.wikipedia.org/wiki/Bag-of-words_model. Accessed Jul 3, 2020.
[4] Stanford. Tokenization. URL: https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html. Accessed Jul 3, 2020.
[5] Rohin Attrey, Alexander Levit. The promise of natural language processing in healthcare. URL: https://ojs.lib.uwo.ca/index.php/uwomj/article/view/1152/4587. Accessed Jul 3, 2020.
[6] Amazon. Amazon Comprehend Medical. URL: https://aws.amazon.com/comprehend/medical/. Accessed Jul 3, 2020.
[7] Xavier Amatriain. NLP & Healthcare: Understanding the Language of Medicine. Nov 5, 2018. URL: https://medium.com/curai-tech/nlp-healthcare-understanding-the-language-of-medicine-e9917bbf49e7. Accessed Jul 3, 2020.
[8] Wikipedia. Google Assistant. URL: https://en.wikipedia.org/wiki/Google_Assistant. Accessed Jul 3, 2020.
[9] Alexandre Gonfalonieri. How Amazon Alexa works? Your guide to Natural Language Processing (AI). Nov 21, 2018. URL: https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3. Accessed Jul 3, 2020.
Category: