in Blog

January 28, 2022

Author:

Senior Data Scientist

Reading time:

In today’s world, there are many challenges that have to be solved by decision-making. Whether in our lives or in business, there must be a person that will make a decision about the future, based on his current knowledge and predictions. The decision can be as irrelevant as a choice between two flavors of juice, or as important as the decision of which company is worth further investments. Each such problem can be solved with random guessing (most often not a good option), or via complicated analysis of possible choices and their outcomes.

In this article, we are going to talk about the real world of the decision-making process. Sometimes the business use case consists of one, consistent challenge. One type of decision making based on the same structure of input data (like the characteristics of a client or a product). The solution to the problem can be chosen only from a closed set of answers, called classes.

Then, each time a new input arrives, the procedure is the same, repetitive like classifying an email whether it’s spam or not. And when we hear “repetitive procedure” we should think about a programming challenge. It should be able to be solved automatically, by a computer program. This observation leads us to the definition of a classification problem.

But first, let’s take a closer look at how people solve problems and how machine learning consulting can help them make the best choices possible.

And how does a human solve problems like detecting irony in a text message, classifying an email as spam and not opening a suspicious link, or identifying the species of a fish? Well, there are many approaches.

One can ask a relatively small number of trusted friends and base their decision on the majority voting. The doubt is whether the majority of this group of friends are specialists in the domain of fish species identification, but that’s another discussion.

Another person can also try to see how different people classified this kind of email, what they took into consideration, and based on that create a set of rules to be able to distinguish what is spam and what isn’t.

Rules can look as follows:

- If there is “lottery” in the text then it is spam.
- If there is “you won” in the text then it is spam.

And so on.

A person can create such simple rules, based on them, and easily classify if an email is obviously spam or a scam.

If a person solving the challenge is a maths nerd, he can use statistics and probability to get an answer. In the case of spam classifying, conditional probability can be really helpful. If there is a “lottery” in the text of an email, there is a 90% of chance that the email is spam. The same can be applied to every word in the text and then one can make a formula to compute the probability of an email being spam.

Based on the description presented above, we can define a challenge called **classification problem**. It is the most common type of problem, which can be found in every kind of business.

A typical classification problem consists of:

**Input data**:- a table of data with columns, each corresponding to another feature like the classical Iris problem (picture with tabular data about petal length, width etc.)
- a set of photos for image classification or object detection (photos of iris flowers)
- a set of texts like in spam emails detection
- an ordered list of values that change in time or signals like a song or human voice translated to machine-readable form.

**Classification algorithm**used for problem-solving – we will discuss some classical algorithms in the next sections of this article.**Set of output classes**, so possible decisions. They can be only as easy as true or false decisions, or more complicated, with bigger sets of answers.

In this article, we will focus on several classical, easily explainable, and interpretable algorithms that are considered an ideal introduction to more complicated, real-life application models.

The first algorithm that is going to be discussed and defined is the K Nearest Neighbours algorithm (**KNN**). It is a basic algorithm and its logic is very simple.

Imagine that you have a fish tank. Then, you notice a fish and wonder if it is a clownfish or an angelfish.

- If you set K to 1, then you search for the most similar fish and assume that your one is of the same species, so the same class.
- Bigger values of parameter K mean that you take more of the most similar examples and make voting.
- For two classes all odd numbers are fine for the K parameter (although there is a suggestion to try to use a square root of the number of features as a first guess), so there won’t be any draws.
- For a bigger number of classes, there may be cases when there is a tie after the voting. Then the algorithm should take the class which contains examples closer to the query object.

This algorithm seems to incorporate very human-like thinking. If you see a new object, check other objects and assign a class, which is the most common among its neighbors. But in reality, it’s not always easy to find the closest similar objects. Learn more about K Nearest Neighbours algorithm in this video:

There are some characteristics of this algorithm that make it not so useful in the real world:

- It is best when there are many preferably continuous features. KNN doesn’t handle text features very well. It needs them in an encoded format.Just replacing string values with numbers won’t do, because they will be treated as sorted numbers from 1 to n, where n is the number of features. These text features must be encoded via a one-hot encoder or a similar transformation to many binary columns.
- When there are not many examples in the dataset, the algorithm can make more mistakes. The training set of KNN must contain all the important groups of examples.If not then it can omit some important clusters and assign the class of improper examples, which happen to be most similar, but in reality, there could be some better points of reference, which just were not present in the training set.
- When there are too many examples, the prediction for each example by default can take a lot of time.

For each new observation, we need to compare it to all other examples, in order to compute the similarity score (the chosen distance metric) in the multidimensional space for each pair.

Even though KNN is a very simple approach to the classification problem, it is still used nowadays. Most frequently as the baseline solution in the time series classification problem.

Due to the fact that time series examples most often consist of multiple numbers, it is an ideal problem to immediately throw into the KNN model. The similarity metric can be set as the default one (most often Euclidean) or as DTW (Dynamic Time Warping), which is more suitable for comparing two-time series.

Imagine a sentence like this: *If the sky is clear and the temperature is lower than 25 degrees Celsius then we will go and play golf.*

This is a simplification of another type of Machine Learning model, namely the **Decision Rules.** There are many different algorithms that generate these rules, from very naive to really complicated, but we won’t focus on any in particular in this article. We will just discuss what are the characteristics of decision rules and how to use them in classification problems.

Overall, a set of decision rules is an easy interpretation of how humans think when they are trying to make a decision in a reasonable way, like, *should I go sailing?* *Should I buy this TV set?*

In life, we have many variables such as:

- the weather
- our current budget
- our free time and so on.

We make a decision based on this information, sometimes making hard divisions.

For example, *if the TV costs above $5000 then I won’t buy it, or if the perimeter of the screen is bigger than 65 inches then I won’t be able to fit it in my living room.*

Decision rules incorporate this “if-thinking” into a machine learning model, which then will be able to decide about future events, based on the training data, so on the experience from the past.

The easiest approach would be to look through the set and classify it according to the one that matches the features of an example.

Of course, when generating a bigger number of rules, it possibly can overlap somehow with each other. Therefore, an observation can trigger more than one rule from the set. Whenever this situation occurs, we can incorporate many different strategies to solve this challenge.

**1. Ordered rules**

The first strategy is just ordering the rules. The algorithm can score the rules according to some metrics. Then, it orders them in a way that we have to search through the list of rules and assign a class (after the first rule will be fired by the example).

However, this approach is a bit tricky.

The designer of the algorithm has to think of a proper metric that will be used for the ordering. *Which one will fit the biggest number of problems? Which will give the best result for the current user’s challenge?*

The only way to make sure that the algorithm uses the best possible metric is to implement many of them and allow users to search for the best one.

**2. Voting**

The second strategy is voting. If we have several rules that got triggered by a new example then each of them can vote with a weight equal to 1 or the weights of rules can also be assigned with some measure of importance.

Thanks to the voting, even if one rule classified the example incorrectly, some other rules can fix this error by outvoting the faulty rule.

**3. Strict enforcement of mutual exclusiveness**

There is also a third strategy, completely different from the previous two. It’s based on the idea that there shouldn’t be any two rules that cover the same example. All rules should be mutually exclusive.

Thanks to this approach, we might not be able to create some rules that could be built when using voting or ordering strategies. But we are able to try to create rules that never have conflicts, so whenever an example triggers a rule, it will get the class specified by this rule. [1]

In summary, decision rules may look like a simple approach, which works only for easy use cases. thanks to these strategies and interpretability of each rule, it was used to develop many more advanced algorithms like:

- decision trees,
- ensemble classifiers like Random Forests, XGBoost and many others are still used today in real-life applications.

Consider the following scenario: you open an email, and notice that there are many words that grab your attention.

*“Reward”, “inheritance”, “credit card”.* You could think that there is a high probability that emails containing these words are spam emails from a bot or any other kind of scam account. This behavior seems like a very reasonable pattern of thinking.

If a person sees a hundred emails containing the word “inheritance” and every one of them is spam then he attributes that there is almost a 100% chance that an email about alleged inheritance must be spam.

This algorithm is based on conditional probability, so the main question that the **Naive Bayes** approach answers is “*What is the probability of class X given these values of features?”*

But it would be difficult to use the formula behind this question and too specific for a general-purpose machine learning algorithm.

That’s why there is an assumption that each feature is completely independent of the other. This way we can easily multiply the probabilities of all feature values for every new input.

In this article, we defined a classification problem. Furthermore, we introduced three basic approaches to classification and explained the intuition behind them.

This is just the first article in the series on Machine Learning algorithms and intuitions which helped researchers to create ML solutions used nowadays in real-life applications. We started with something really simple, incorporating human-like thinking.

In the next articles of this series, we will continue with decision-making process, decision trees and more technical descriptions of how a decision tree is built so it can be used for inference in the future. Also, you will discover the definition of under- and overfitting and some techniques for fighting with them.

[1] Mimuw.edu.pl. URL: https://www.mimuw.edu.pl/~son/datamining/DM/6-rules.pdf. Accessed January 25, 2022.

Category:

Machine Learning

+