Today, we want to tackle another fascinating field of Artificial Intelligence. NLP, which stands for Natural Language Processing, is a subset of AI that aims at reading, understanding, and deriving meaning from human language, both written and spoken. It’s one of these AI applications that anyone can experience simply by using a smartphone. You see, Google Assistant, Alexa, and Siri are the perfect examples of NLP algorithms in action. But there’s far more to NLP algorithms! Let’s examine NLP solutions a bit closer and find out how it’s utilized today.
NLP is a relatively new technology. The very first major leap forward in the field of natural language processing happened in 2013. At this time, Word2Vec was introduced. It was a group of related models that are used to produce word embeddings. These models are basically two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned to a corresponding vector in the space. At this point, we ought to explain what corpora (corpuses) are.
A linguistic corpus is a dataset of representative words, sentences, and phrases in a given language. Typically, they consist of books, magazines, newspapers, and internet portals. Sometimes it may contain less formal forms and expressions, for instance, originating with chats and Internet communicators. All in all–the main idea is to help machines understand the way people talk and communicate.
You may find interesting – data science consulting
A different approach to NLP algorithms
In 2014, Stanford’s research group introduced another NLP algorithm: GloVe. It’s an unsupervised learning algorithm for obtaining vector representations for words. According to Stanford’s website, GloVe’s training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Stanford’s researchers proposed a different approach. They proposed that the best way to encode the semantic meaning of words is through the global word-word co-occurrence matrix as opposed to local co-occurrences (as in Word2Vec). GloVe algorithm involves representing words as vectors in a way that their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.
These two algorithms have significantly accelerated the pace NLP algorithms develop. There are, however, several problems with this technology.
The largest NLP-related challenge is the fact that the process of understanding and manipulating language is extremely complex. The same words can be used in a different context, different meaning, and intent. And then, there are idioms and slang, which are incredibly complicated to be understood by machines. On top of all that–language is a living thing–it constantly evolves, and that fact has to be taken into consideration.
One of the most common problems is sentiment analysis. It’s all about determining the attitude or emotional reaction of a speaker/writer toward a particular topic. Possible sentiments are positive, neutral, and negative. What’s easy and natural for humans is incredibly difficult for machines.
Naturally, data scientists and NLP specialists try to overcome these issues and train the NLP algorithms so that they can operate as efficiently as possible. There are many training models and methods that allow training NLP algorithms so that they can understand and derive meaning from the text. In this article, we are going to show you four exemplary techniques:
BAG OF WORDS
In this model, a text is represented as a bag (multiset) of words (hence it’s name), disregarding grammar and even word order, but keeping multiplicity. Basically, the bag of words model creates an occurrence matrix. These word frequencies or occurrences are then used as features for training a classifier.
Unfortunately, this model has several downsides. The biggest is the absence of semantic meaning and context, and the fact that some words are not weighted accordingly (for instance, in this model, the word “universe” weights less than the word “they”).
It’s the process of segmenting text into sentences and words. In essence, it’s the task of cutting a text into smaller pieces (called tokens), and at the same time throwing away certain characters, such as punctuation.
Input text: Peter went to school yesterday.
Output text: Peter, went, to, school, yesterday
The major downside of this technique is the fact that it works well with some languages, and much worse with others. This is a case, especially when talking about tonal languages, like Mandarin or Vietnamese. For instance, depending on the tone, the Mandarin word ma can mean “a horse, “hemp”, “a scold”, or “a mother”. It’s a real challenge for the NLP algorithms!
STOP WORDS REMOVAL
The stop words are, for instance, “and”, “the” or “an”. This technique is based on removing words that provide little or no value to the NLP algorithm. They are called the stop words and are removed from the text before it’s processed.
There are a couple of advantages to this method:
- The NLP algorithm’s database is not overloaded with words that aren’t useful
- The text is processed more quickly
- The algorithm can be trained more quickly because the training set contains only vital information
Naturally, there are also downsides. There is always a risk that the stop word removal can wipe out relevant information and modify the context in a given sentence. That’s why it’s immensely important to carefully select the stop words, and exclude ones that can change the meaning of a word (like, for example, “not”).
This technique is based on reducing a word just to its base form and grouping together different forms of the same word. In this technique:
- Verbs in the past tense are changed into the present (worked to work)
- Synonyms are unified (efficient to effective)
This technique is all about reaching to the root (lemma) of reach word. Let’s take the words “working”, “worked”, “works”. All of them are different versions of the word “work”. So “work” is our lemma–the base form that we want to use.
The lemmatization technique takes the context of the word into consideration, in order to solve other problems like disambiguation, where one word can have two or more meanings. Take the word “cancer”–it can either mean a severe disease or a marine animal. It’s the context that allows you to decide which meaning is correct. And this is what lemmatization is about.
That’s the theory. Now, let’s talk about the practical implementation of this technology. We will show you two common applications of NLP algorithms. One is in the medical field and one is in the mobile devices field.
NLP algorithms in medicine
NLP plays a more and more important role in medicine. It enables the recognition and prediction of diseases based on patient electronic health records and their speech. According to the paper called “The promise of natural language processing in healthcare” published in The University of Western Ontario Medical Journal, medical NLP algorithms can be divided into four major categories:
- For patients–that includes teletriage services (it’s telephone access to triage nurses who can provide basic disease information and instructions), where NLP-powered chatbots could replace nurses and physicians in the most straightforward cases.
- For physicians–a computerized Clinical Decision Support System (CDSS) can be enhanced by NLP algorithms. For instance, it can alert physicians about rare conditions of a given patient. This could prove to be a life-saving improvement!
- For researchers–NLP offers great methodological promise for qualitative research. It helps physicians and researchers to enable, empower, and accelerate qualitative studies across a number of vectors.
- For healthcare management– NLP has the capability to take patient reviews and opinions in the form of unstructured and unrestricted feedback and create meaningful summaries for the healthcare management teams.
As you can see, almost every group within the medical field can benefit from NLP in medicine. Now, let’s take a look at two real-life examples of functioning NLP algorithms:
Further reading: Medical image analysis
AMAZON COMPREHEND MEDICAL
One of the most outstanding examples of NLP application in medicine is Amazon Comprehend Medical. It’s part of Amazon Web Services. ACM is an NLP service that makes it accessible to use machine learning to extract medical information from an unstructured text. With ACM, physicians can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records.
This project’s idea is based on the fact that a lot of patient data is “trapped” in free-form medical texts. That’s especially including hospital admission notes and a patient’s medical history. These are materials frequently hand-written, on many occasions, difficult to read for other people. ACM can help to improve extracting information from these texts.
Another interesting example is Curai. They try to build an AI-fueled care service that involves many NLP tasks. For instance, they’re working on a question-answering NLP service, both for patients and physicians. For instance, let’s say we have a patient that wants to know if they can take Mucinex while on a Z-Pack? Curai wants to extract the relevant medical terms and surrounding context from such questions and use them to retrieve the documents most responsive to the question terms from a repository of curated answers and provide the patient with an answer. Fully automatically! Their ultimate goal is to develop a “dialogue system that can lead a medically sound conversation with a patient”.
NLP algorithms in mobile devices
Each one of us has instant access to a sophisticated NLP algorithm. How so? Think of your mobile device. Currently, there are four major voice assistants that are nothing but real-life applications of NLP algorithms!
These assistants are:
- Microsoft Cortana: It’s a personal productivity assistant in Microsoft 365.
- Google Assistant: Virtual assistant developed by Google, primarily available on mobile and smart home devices. What’s interesting, Google Assistant can engage in two-way conversations.
- Amazon Alexa: Virtual assistant developed by Amazon, originally for Echo smart speakers.
- Apple Siri: Virtual assistant built by Apple for their operating systems.
On the example of Amazon Alexa, let’s take a look at how these assistants work. Actually, what you perceive as a simple and short “conversation” is a sophisticated, four-step process:
Step 1: The recording of your speech is sent to Amazon’s servers to be analyzed.
Step 2: Alexa breaks down your “orders” into individual sounds. Next, they are sent to the database containing various word pronunciations to find which words most closely correspond to the combination of your individual sounds.
Step 3: Alexa identifies crucial words to understand the sense of your order and carry out corresponding functions.
Step 4: Amazon’s servers send the information back to your device so that Alexa speaks the answer.
That’s impressive, especially given that it all happens within several seconds! Today, you can use these intelligent assistants to open specific apps in your mobile device, take notes, control your smart home devices, and more!
Granted, the NLP technology has taken a huge leap forward! And all of that happened in just the past eight years! Imagine what can happen in the next eight years period! Without a shadow of a doubt, we will have a multitude of topics to cover on our blog!
If you’d like to see if NLP can be implemented into your company–drop us a line. We will gladly guide you through the amazing AI world. With our help, you’re on a straight course to improve the way your company works. It’s time to find out how!
 Wikipedia. Word2vec. URL: https://en.wikipedia.org/wiki/Word2vec. Accessed Jul 3, 2020.
 Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representation. URL: https://nlp.stanford.edu/projects/glove/. Accessed Jul 3, 2020.
 Wikipedia. Bag-of-words model. URL: https://en.wikipedia.org/wiki/Bag-of-words_model. Accessed Jul 3, 2020.
 Stanford. Tokenization. URL: https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html. Accessed Jul 3, 2020.
 Rohin Attrey, Alexander Levit. The promise of natural language processing in healthcare. URL: https://ojs.lib.uwo.ca/index.php/uwomj/article/view/1152/4587. Accessed Jul 3, 2020.
 Amazon. Amazon Comprehend Medical. URL: https://aws.amazon.com/comprehend/medical/. Accessed Jul 3, 2020.
 Xavier Amatriain. NLP & Healthcare: Understanding the Language of Medicine. Nov 5, 2018. URL: https://medium.com/curai-tech/nlp-healthcare-understanding-the-language-of-medicine-e9917bbf49e7. Accessed Jul 3, 2020.
 Wikipedia. Google Assistant. URL: https://en.wikipedia.org/wiki/Google_Assistant. Accessed Jul 3, 2020.
 Alexandre Gonfalonieri. How Amazon Alexa works? Your guide to Natural Language Processing (AI). Nov 21, 2018. URL: https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3. Accessed Jul 3, 2020.