in Blog

July 03, 2020

NLP Algorithms: Definition, Types & Examples (update: 2024)

Author:




Artur Haponik

CEO & Co-Founder


Reading time:




16 minutes


Today, we want to tackle another fascinating field of Artificial Intelligence. NLP, which stands for Natural Language Processing (NLP), is a subset of AI that aims at reading, understanding, and deriving meaning from human language, both written and spoken. It’s one of these AI applications that anyone can experience simply by using a smartphone. You see, Google Assistant, Alexa, and Siri are the perfect examples of NLP algorithms in action. But there’s far more to NLP algorithms! Let’s examine NLP solutions a bit closer and find out how it’s utilized today.

NLP is a relatively new technology. The very first major leap forward in the field of natural language processing (NLP) happened in 2013. At this time, Word2Vec[1] was introduced. It was a group of related models that are used to produce word embeddings. These models are basically two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned to a corresponding vector in the space. At this point, we ought to explain what corpora (corpuses) are.

A linguistic corpus is a dataset of representative words, sentences, and phrases in a given language. Typically, they consist of books, magazines, newspapers, and internet portals. Sometimes it may contain less formal forms and expressions, for instance, originating with chats and Internet communicators. All in all–the main idea is to help machines understand the way people talk and communicate.

NLP Algorithms: Understanding Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human, natural languages. The primary goal of Natural Language Processing (NLP) is to enable computers to understand, interpret, and respond to human language in a way that is both meaningful and useful.

What are NLP Algorithms?

NLP algorithms are a set of methods and techniques designed to process, analyze, and understand human language. These algorithms enable computers to perform a variety of tasks involving natural language, such as translation, sentiment analysis, and topic extraction. The development and refinement of these algorithms are central to advances in Natural Language Processing (NLP).

Types of NLP Algorithms

  • Speech Recognition: Converting spoken language into text (e.g., voice-controlled assistants).
  • Natural Language Understanding: Comprehending and interpreting natural language, including context and intent.
  • Natural Language Generation: Generating human-like text from computer data (like what I’m doing right now).
  • Machine Translation: Translating text or speech from one language to another.
  • Sentiment Analysis: Determining the emotional tone behind a series of words, to gain an understanding of the attitudes, opinions, and emotions expressed within an online mention.
  • Text Classification and Categorization: Assigning categories or labels to text to easily organize and manage large volumes of information (e.g., spam detection in emails).
  • Chatbots and Virtual Assistants: Simulating conversational interactions with users to perform tasks or provide information.

A different approach to Natural Language Processing (NLP) algorithms

In 2014, Stanford’s research group introduced another Natural Language Processing (NLP) algorithm: GloVe[2]. It’s an unsupervised learning algorithm for obtaining vector representations for words. According to Stanford’s website, GloVe’s training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Stanford’s researchers proposed a different approach. They proposed that the best way to encode the semantic meaning of words is through the global word-word co-occurrence matrix as opposed to local co-occurrences (as in Word2Vec). GloVe algorithm involves representing words as vectors in a way that their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.

These two algorithms have significantly accelerated the pace of Natural Language Processing (NLP) algorithms development. There are, however, several problems with this technology.

NLP

Natural Language Processing (NLP) training

The largest NLP-related challenge is the fact that the process of understanding and manipulating language is extremely complex. The same words can be used in a different context, different meaning, and intent. And then, there are idioms and slang, which are incredibly complicated to be understood by machines. On top of all that–language is a living thing–it constantly evolves, and that fact has to be taken into consideration.

One of the most common problems is sentiment analysis. It’s all about determining the attitude or emotional reaction of a speaker/writer toward a particular topic. Possible sentiments are positive, neutral, and negative. What’s easy and natural for humans is incredibly difficult for machines.

Naturally, data scientists and NLP specialists try to overcome these issues and train the Natural Language Processing (NLP) algorithms so that they can operate as efficiently as possible. There are many training models and methods that allow training Natural Language Processing (NLP) algorithms so that they can understand and derive meaning from the text. In this article, we are going to show you four exemplary techniques:

NLP training

Bag of words

In this model, a text is represented as a bag (multiset) of words (hence it’s name), disregarding grammar and even word order, but keeping multiplicity[3]. Basically, the bag of words model creates an occurrence matrix. These word frequencies or occurrences are then used as features for training a classifier.

Unfortunately, this model has several downsides. The biggest is the absence of semantic meaning and context, and the fact that some words are not weighted accordingly (for instance, in this model, the word “universe” weights less than the word “they”).

Tokenization

It’s the process of segmenting text into sentences and words. In essence, it’s the task of cutting a text into smaller pieces (called tokens), and at the same time throwing away certain characters, such as punctuation[4].

For example:

Input text: Peter went to school yesterday.

Output text: Peter, went, to, school, yesterday

The major downside of this technique is the fact that it works well with some languages, and much worse with others. This is a case, especially when talking about tonal languages, like Mandarin or Vietnamese. For instance, depending on the tone, the Mandarin word ma can mean “a horse, “hemp”, “a scold”, or “a mother”. It’s a real challenge for the NLP algorithms!

Stop words removal

The stop words are, for instance, “and”, “the” or “an”. This technique is based on removing words that provide little or no value to the NLP algorithm. They are called the stop words and are removed from the text before it’s processed.

There are a couple of advantages to this method:

  • The NLP algorithm’s database is not overloaded with words that aren’t useful
  • The text is processed more quickly
  • The algorithm can be trained more quickly because the training set contains only vital information

Naturally, there are also downsides. There is always a risk that the stop word removal can wipe out relevant information and modify the context in a given sentence. That’s why it’s immensely important to carefully select the stop words, and exclude ones that can change the meaning of a word (like, for example, “not”).

stop word

Lemmatization

This technique is based on reducing a word just to its base form and grouping together different forms of the same word. In this technique:

  • Verbs in the past tense are changed into the present (worked to work)
  • Synonyms are unified (efficient to effective)

This technique is all about reaching to the root (lemma) of reach word. Let’s take the words “working”, “worked”, “works”. All of them are different versions of the word “work”. So “work” is our lemma–the base form that we want to use.

The lemmatization technique takes the context of the word into consideration, in order to solve other problems like disambiguation, where one word can have two or more meanings. Take the word “cancer”–it can either mean a severe disease or a marine animal. It’s the context that allows you to decide which meaning is correct. And this is what lemmatization is about.

That’s the theory. Now, let’s talk about the practical implementation of this technology. We will show you two common applications of NLP algorithms. One is in the medical field and one is in the mobile devices field.

Natural Language Processing (NLP) algorithms in medicine

NLP plays a more and more important role in medicine. It enables the recognition and prediction of diseases based on patient electronic health records and their speech. According to the paper called “The promise of natural language processing in healthcare”[5] published in The University of Western Ontario Medical Journal, medical NLP algorithms can be divided into four major categories:

  • For patients–that includes teletriage services (it’s telephone access to triage nurses who can provide basic disease information and instructions), where NLP-powered chatbots could replace nurses and physicians in the most straightforward cases.
  • For physicians–a computerized Clinical Decision Support System (CDSS) can be enhanced by NLP algorithms. For instance, it can alert physicians about rare conditions of a given patient. This could prove to be a life-saving improvement!
  • For researchers–NLP offers great methodological promise for qualitative research. It helps physicians and researchers to enable, empower, and accelerate qualitative studies across a number of vectors.
  • For healthcare management– NLP has the capability to take patient reviews and opinions in the form of unstructured and unrestricted feedback and create meaningful summaries for the healthcare management teams.

NLP algorithms in medicine

As you can see, almost every group within the medical field can benefit from NLP in medicine. Now, let’s take a look at two real-life examples of functioning NLP algorithms:

Amazon comprehend medical

One of the most outstanding examples of NLP application in medicine is Amazon Comprehend Medical. It’s part of Amazon Web Services. ACM is an NLP service that makes it accessible to use machine learning to extract medical information from an unstructured text. With ACM, physicians can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records[6].

This project’s idea is based on the fact that a lot of patient data is “trapped” in free-form medical texts. That’s especially including hospital admission notes and a patient’s medical history. These are materials frequently hand-written, on many occasions, difficult to read for other people. ACM can help to improve extracting information from these texts.

Curai

Another interesting example is Curai[7]. They try to build an AI-fueled care service that involves many NLP tasks. For instance, they’re working on a question-answering NLP service, both for patients and physicians. For instance, let’s say we have a patient that wants to know if they can take Mucinex while on a Z-Pack? Curai wants to extract the relevant medical terms and surrounding context from such questions and use them to retrieve the documents most responsive to the question terms from a repository of curated answers and provide the patient with an answer. Fully automatically! Their ultimate goal is to develop a “dialogue system that can lead a medically sound conversation with a patient”.

NLP algorithms in mobile devices

Each one of us has instant access to a sophisticated NLP algorithm. How so? Think of your mobile device. Currently, there are four major voice assistants that are nothing but real-life applications of NLP algorithms!

NLP algorithms in mobile devices

These assistants are:

  • Microsoft Cortana: It’s a personal productivity assistant in Microsoft 365.
  • Google Assistant: Virtual assistant developed by Google, primarily available on mobile and smart home devices. What’s interesting, Google Assistant can engage in two-way conversations[8].
  • Amazon Alexa: Virtual assistant developed by Amazon, originally for Echo smart speakers.
  • Apple Siri: Virtual assistant built by Apple for their operating systems.

On the example of Amazon Alexa, let’s take a look at how these assistants work. Actually, what you perceive as a simple and short “conversation” is a sophisticated, four-step process[9]:

Step 1: The recording of your speech is sent to Amazon’s servers to be analyzed.

Step 2: Alexa breaks down your “orders” into individual sounds. Next, they are sent to the database containing various word pronunciations to find which words most closely correspond to the combination of your individual sounds.

Step 3: Alexa identifies crucial words to understand the sense of your order and carry out corresponding functions.

Step 4: Amazon’s servers send the information back to your device so that Alexa speaks the answer.

That’s impressive, especially given that it all happens within several seconds! Today, you can use these intelligent assistants to open specific apps in your mobile device, take notes, control your smart home devices, and more!

Natural language processing

 

Sentiment analysis. Explanation and use cases

Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that involves analyzing text to determine the sentiment behind it.

Sentiment analysis is used to understand the attitudes, opinions, and emotions expressed in a piece of writing, especially in user-generated content like reviews, social media posts, and survey responses.

The primary goal of sentiment analysis is to categorize text as positive, negative, or neutral, though more advanced systems can also detect specific emotions like happiness, anger, or disappointment.

Uses of sentiment analysis in business areas

  • Social Media monitoring:
    Companies use sentiment analysis to monitor and analyze social media posts about their brand, products, or services. This helps them gauge public opinion and react appropriately.
  • Customer feedback analysis:
    Analyzing customer reviews and survey responses to understand customer satisfaction and improve products or services.
  • Market research:
    Businesses use sentiment analysis to research market trends, track competitors, and understand consumer responses to campaigns or product launches.
  • Stock market prediction:
    Sentiment analysis of news articles, financial reports, and social media can be used to make investment decisions.
  • Public opinion analysis:
    Politicians and political parties use sentiment analysis to gauge public opinion on policies, debates, and election campaigns.
  • Customer Support:
    Automating responses in customer service by analyzing the sentiment of customer inquiries and routing them to the appropriate department or escalating issues based on urgency.
  • Medical Analysis:
    Analyzing patient feedback, social media discussions, and medical literature to understand patient sentiments and concerns about treatments and healthcare services.
  • Content recommendation:
    Streaming services and content platforms use sentiment analysis to understand user reviews and preferences, improving their recommendation algorithms.
  • Human Resources:
    Analyzing employee feedback, survey responses, and communication to analyse overall employee satisfaction and organizational health.
  • Public Relations:
    Monitoring news and social media to quickly identify and address negative sentiments or public relations crises.

Sentiment analysis is typically performed using machine learning algorithms that have been trained on large datasets of labeled text.

Machine learning algorithms can range from simple rule-based systems that look for positive or negative keywords to advanced deep learning models that can understand context and subtle nuances in language.

As with any AI technology, the effectiveness of sentiment analysis can be influenced by the quality of the data it’s trained on, including the need for it to be diverse and representative.

Natural Language Processing (NLP) and Machine Learning

Natural Language Processing (NLP) leverages machine learning (ML) in numerous ways to understand and manipulate human language. Initially, in NLP, raw text data undergoes preprocessing, where it’s broken down and structured through processes like tokenization and part-of-speech tagging. This is essential for machine learning (ML) algorithms, which thrive on structured data.

Once the data is ready, machine learning (ML) models are trained with these features. This training involves feeding the model vast amounts of text so it can learn patterns and associations. For example, in sentiment analysis, models learn to associate specific words and phrases with sentiments like happiness or anger.

A large part of NLP work involves supervised learning, where models learn from examples that are clearly labeled. For instance, in spam detection, models are trained on emails marked as ‘spam’ or ‘not spam’, learning to classify new emails accordingly. On the other hand, unsupervised learning in NLP allows models to find patterns in unlabeled text data, like identifying different topics in a collection of documents without any pre-defined categories.

Deep learning, a more advanced subset of machine learning (ML), has revolutionized NLP. Neural networks, particularly those like recurrent neural networks (RNNs) and transformers, are adept at handling language. They excel in capturing contextual nuances, which is vital for understanding the subtleties of human language.

NLP tasks often involve sequence modeling, where the order of words and their context is crucial. RNNs and their advanced versions, like Long Short-Term Memory networks (LSTMs), are particularly effective for tasks that involve sequences, such as translating languages or recognizing speech.

Another critical development in NLP is the use of transfer learning. Here, models pre-trained on large text datasets, like BERT and GPT, are fine-tuned for specific tasks. This approach has dramatically improved performance across various NLP applications, reducing the need for large labeled datasets in every new task.

In some advanced applications, like interactive chatbots or language-based games, NLP systems employ reinforcement learning. This technique allows models to improve over time based on feedback, learning through a system of rewards and penalties.

In essence, ML provides the tools and techniques for NLP to process and generate human language, enabling a wide array of applications from automated translation services to sophisticated chatbots.

Conclusion

Granted, the NLP technology has taken a huge leap forward! And all of that happened in just the past eight years! Imagine what can happen in the next eight years period! Without a shadow of a doubt, we will have a multitude of topics to cover on our blog!

If you’d like to see if NLP can be implemented into your company – drop us a line. We will gladly guide you through the amazing AI world. With our help, you’re on a straight course to improve the way your company works. It’s time to find out how!

AI consulting services

References

[1] Wikipedia. Word2vec. URL: https://en.wikipedia.org/wiki/Word2vec. Accessed Jul 3, 2020.

[2] Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representation. URL: https://nlp.stanford.edu/projects/glove/. Accessed Jul 3, 2020.

[3] Wikipedia. Bag-of-words model. URL: https://en.wikipedia.org/wiki/Bag-of-words_model. Accessed Jul 3, 2020.

[4] Stanford. Tokenization. URL: https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html. Accessed Jul 3, 2020.

[5] Rohin Attrey, Alexander Levit. The promise of natural language processing in healthcare. URL: https://ojs.lib.uwo.ca/index.php/uwomj/article/view/1152/4587. Accessed Jul 3, 2020.

[6] Amazon. Amazon Comprehend Medical. URL: https://aws.amazon.com/comprehend/medical/. Accessed Jul 3, 2020.

[7] Xavier Amatriain. NLP & Healthcare: Understanding the Language of Medicine. Nov 5, 2018. URL: https://medium.com/curai-tech/nlp-healthcare-understanding-the-language-of-medicine-e9917bbf49e7. Accessed Jul 3, 2020.

[8] Wikipedia. Google Assistant. URL: https://en.wikipedia.org/wiki/Google_Assistant. Accessed Jul 3, 2020.

[9] Alexandre Gonfalonieri. How Amazon Alexa works? Your guide to Natural Language Processing (AI). Nov 21, 2018. URL: https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3. Accessed Jul 3, 2020.



Category:


Artificial Intelligence