Naive Bayes in machine learning

Do you like statistics? There is a high probability 😉 that you don’t. But don’t worry, statistics are not my strong point either. All in all, it’s hard to say why most people shun statistics, but I feel that many statistical issues are not entirely intuitive. Just reach for the gambler’s fallacy or Monty Hall problem. Interesting reading :-). Seriously! Regardless of whether you like statistics or not, the vast majority of machine learning is based on or uses statistics. And certainly Naive Bayes in machine learning or more precisely Naive Bayes Classifier.

After reading this post:

you will know why Naive Bayes is naive,
you will learn the basics of the theory behind the Naive Bayes Classifier,
you will be able to calculate the classifier result by yourself,
and you will know how to use the classifier available in the scikit-learn library.

Why is Naive Bayes actually naive?

I would like to explain this at the beginning, because it is a key issue for understanding the classifier principle. Suppose we want to classify some information based on several input features. Let this classification be sunny vs. cloudy based on: season, temperature, humidity and atmospheric pressure. In the case of the Bayesian classifier, we assume that inputs are completely independent of each other. In our example, one of the assumptions will be: the temperature does not depend on the season. How naive… 😉

In addition, we assume that each of the input features is just as important as any other. This is also quite naive, because although I’m not a meteorologist, I could bet that some parameters will have a greater impact on the sunny day, and others a little less.

Although Bayesian assumptions are somewhat detached from reality, Naive Bayes in machine learning works very well, especially in classification tasks (spam detection, text classification, sentiment analysis), recommendation systems, and because of the speed also in real-time prediction. It doesn’t work well for regression though.

Naive Bayes in machine learning – the data set

It’s time to look at the data on which we will train a classifier. Since I am a basketball fan and I like to watch NBA games, I have prepared my own small data set on which you can do relatively simple manual calculations, but also run the algorithm offered by scikit-learn. We have the following columns in this data set:

Team – an auxiliary / descriptive column containing the name of the team; does not participate in the calculations.
Season – an auxiliary / descriptive column containing the indication of the season to which the data row applies; does not participate in the calculations.
Playoffs a year before – the first of the input features indicating whether the team was in the playoffs phase last year.
Player(s) in the All-Star Game – did the team have a representative in the All-Star Game this season?
Salary cap over the 50th percentile – teams have a limit on their players’ salaries – the so-called salary cap. After exceeding it, the team must pay an additional, expensive tax. The mechanism is designed to protect against the dominance of very rich teams and even chances. It works effectively, but most teams exceed the salary cap anyway, sometimes by tens of millions. The column indicates whether the given team exceeded the salary cap by more than half of the other teams this season.
Play-offs – the column indicating the actual belonging to the class: Yes – the team qualified for the playoffs in the given season / No – the team finished the competition in the regular season and did not play in the playoffs.

Team	Season	Play-offs a year before	Player(s) in the All-Star game	Salary cap over the 50th percentile	Play-offs
Atlanta Hawks	2018-19	No	No	No	No
Boston Celtics	2018-19	Yes	Yes	Yes	Yes
Brooklyn Nets	2018-19	No	Yes	No	Yes
Charlotte Hornets	2018-19	No	Yes	Yes	No
…	…	…	…	…	…
Sacramento Kings	2017-18	No	No	No	No
San Antonio Spurs	2017-18	Yes	Yes	Yes	Yes
Toronto Raptors	2017-18	Yes	Yes	Yes	Yes
Utah Jazz	2017-18	Yes	No	No	Yes
Washington Wizards	2017-18	Yes	Yes	Yes	Yes

Data is available for download in csv format: NBA – Naive Bayes. All “Yes” are represented by 1.0, all “No” by 0.0. If you want to import a file into OpenOffice or another spreadsheet, you should set the import parameters as follows:

Naive Bayes in machine learning - importing dataset from csv file

Naive Bayes in machine learning using scikit-learn

Before we start calculating the classifier ourselves, let’s use the scikit-learn library for this. I suggest that you create a new virtual environment using conda and run Jupyter Notebook. If you are not sure how to prepare a programming environment for machine learning, read my post on how to build it. In addition to the packages listed in that post, please add also the scikit-learn library.

First, we import the numpy and pandas libraries and load the .csv file into a dataframe:

import numpy as np
import pandas as pd

data_frame = pd.read_csv(r'.\NBA - Naive Bayes.csv', encoding='utf-8')

We display the first few lines of the dataframe:

data_frame.head()

	`Team`	`Season`	`Play-offs a year before`	`Player(s) in the All-Star game`	`Salary cap over the 50th percentile`	`Play-offs`
`0`	`Atlanta Hawks`	`2018-19`	`0.0`	`0.0`	`0.0`	`0.0`
`1`	`Boston Celtics`	`2018-19`	`1.0`	`1.0`	`1.0`	`1.0`
`2`	`Brooklyn Nets`	`2018-19`	`0.0`	`1.0`	`0.0`	`1.0`
`3`	`Charlotte Hornets`	`2018-19`	`0.0`	`1.0`	`1.0`	`0.0`
`4`	`Chicago Bulls`	`2018-19`	`0.0`	`0.0`	`0.0`	`0.0`

As you can see, the data loaded correctly. In addition, the dataframe object adds the first indexing column. Columns No. 2 “Team” and No. 3 “Season” are descriptive and we will not use them in the learning process. The last column – No. 7 “Play-offs” – is in turn an information about whether the team qualified for the play-offs or not. This is our target, i.e. actual qualification for play-off or not. The columns No. 4, 5 and 6 will take part in the learning process as input data and No. 7 as the target (label). Therefore, we need to transform the data so that we get two variables. One with the columns 4, 5 and 6 and the other with the column No. 7. First, let’s make a copy of the dataframe containing only columns 4, 5 and 6:

df_X = data_frame.drop(['Team', 'Season', 'Play-offs'], axis=1)
df_X.head()

	`Play-offs a year before`	`Player(s) in the All-Star game`	`Salary cap over the 50th percentile`
`0`	`0.0`	`0.0`	`0.0`
`1`	`1.0`	`1.0`	`1.0`
`2`	`0.0`	`1.0`	`0.0`
`3`	`0.0`	`1.0`	`1.0`
`4`	`0.0`	`0.0`	`0.0`

In the next step, we convert the dataframe to the numpy array and then repeat the entire operation for the target variable:

data = df_X.to_numpy()
data[0:5]

>>>array([[0., 0., 0.],
       [1., 1., 1.],
       [0., 1., 0.],
       [0., 1., 1.],
       [0., 0., 0.]])

target = data_frame[['Play-offs']].to_numpy()
target[0:5]

>>>array([[0.],
       [1.],
       [1.],
       [0.],
       [0.]])

Having prepared data in numpy tables, we can pass them on to a classifier provided by the scikit-learn library.

Looking through the scikit-learn documentation for Naive Bayes, it’s easy to see that we have several classifiers available. A question: which one to use?

Gaussian – we use it for continuous data for which we can assume a normal distribution
For discrete data (e.g. number of All-Star players in a team, rating of purchased goods on a scale of 1 to 5) we will use Multinomial Naive Bayes
If our data are binary (zeros and ones) – and this is our case – Bernoulli will be the most appropriate.

Before we get to the code, it is worth noting that our data set is very small and contains only three input parameters. Modern basketball is an advanced sport in which many factors play a role. Teams competing at the highest level try to refine every aspect related to the technique of the game, team tactics, players’ health, or even such issues as nutrition, supplementation or travel logistics. To assess the team’s actual chances of entering the playoffs phase, dozens inputs, if not hundreds, should probably be taken into account. Definitely a lot more than just three ;-). Hence, the result we get should be taken with a grain of salt, and the goal of this simple exercise is only to familiarize you with building a classifier and the theory underlying Naive Bayes in machine learning. Moreover, it is a good practice in machine learning to separate the learning and test sets. We will not do this because of the size of our data set and the rather theoretical nature of the considerations described above. Returning to the code:

from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()
classifier.fit(data, target.ravel())

predictions = classifier.predict(data)
predictions

>>> array([0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.,
           1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 1.,
           0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1.,
           0., 0., 0., 1., 0., 1., 1., 0., 1.]

To check how well the classifier handled the data, we calculate accuracy:

print("Bernoulli Naive Bayes Accuracy: ",
np.round((predictions == target.ravel()).sum() / target.size * 100, decimals=2), "%"
)
>>>Bernoulli Naive Bayes Accuracy: 73.33 %

How to manually calculate Naive Bayes in machine learning?

We have reached the point where you can no longer hide behind the proven machine learning library and you have to roll up your sleeves for work 😉 . Using Naive Bayes in machine learning is actually nothing else but creating a statistical model. More precisely – the probability model. We must therefore reach for some statistics. By creating a probability model for the classifier, we try to answer the question: what is the probability that we are dealing with some class C, assuming the input data, x₁, x₂, …, x_n? In our case, it would be, e.g., what is the probability that class C = team will play in the playoffs, assuming that x₁ = the team played in the playoffs last year, x₂ = the team does not have a representative in the All-Star Game and x₃ = the team spends less on salaries than at least half of the other teams. This problem can be written down as the following formula:

P(C | x₁, x₂, x₃)

or more generally:

P (C | X_n)

We read this formula: What is the probability of class C having a given input vector X_n.

In our simple example, when we have 3 binary input features, the model could be built by calculating the probabilities for each variant. However, usually there are many more features (and thus their combinations) and in addition they can take continuous values, which means that the problem formulated in this way cannot be easily solved, if at all. Here, Bayes’ theorem comes to our aid. The theorem assumes “naively” that the input features of X are independent of each other and each of them is just as important for the model. Again, this is a very large simplification of the real-life scenario, but first of all it works 😎 , and secondly it allows to significantly simplify the calculations. Bayes’ theorem looks like this:

P(C | X) = P(C) * P(X | C)P(X)

And for our simple example you can write it like this:

P(C | x1,x2,x3) =P(C) * P(x1,x2,x3 | C)P(x1,x2,x3)

Believe it or not, but by creating this classifier we may completely ignore the denominator of this expression. Why? Let’s assume that we want to answer the question whether the team will play in the playoffs this year (C = Yes) or not (C = No), providing x₁ = Yes = the team played in playoffs last year, x₂ = No = the team has no representative in the All-Star Game this year and x₃ = No = the team spends less on salaries than at least half of the other teams. We will know the answer if we can calculate and compare the probabilities for both classes:

P(C=Yes) * P(Yes, No, No | C=Yes) P(Yes, No, No)

oraz

P(C=No) * P(Yes, No, No | C=No) P(Yes, No, No)

Will we consider the denominator at all when assessing which of these probabilities is higher? No, because it is the same in both expressions. Therefore, only the value of the nominator will decide. This will simplify the equation:

P(C | x1,x2,x3) = P(C) * P(x1,x2,x3 | C)

What’s more, since we operate on naive Bayesian assumptions, and more precisely since the features x₁, x₂, x₃ are independent of each other, we ultimately calculate the probability for a given class C using the formula:

P(C | x1,x2,x3) = P(C) * P(x1 | C) * P(x2 | C) * P(x3 | C)

Remember that the above formula is a variant for: two classes and three input data. However, it can be generalized to the following form:

P(C | Xn) = P(C) * ∑P(Xi | C)i=1n

Returning to our simple example, in which we have two classes C and three input parameters X. In order to be able to calculate the probability of class occurrence for any combination of input parameters, we need to have the probability values of each class – P (C) and all combinations P (X_i | C ). I suggest you try to calculate it by yourself. It’s best to use the spreadsheet and filtering options for this. Don’t worry if you make any mistake (I probably did it as well somewhere below 😉 ) – all these tedious calculations will ultimately be carried out by the machine. And we only want to understand the methodology and the principles of the Bayesian classifier.

I collected my calculations in the table below.

P(C=NO)	28/62
P(C=YES)	32/62

Playoffs a year before	P(C=NO)	P(C=YES)
No	19/28	9/32
Yes	9/28	23/32

Player(s) in the All-Star game	P(C=NO)	P(C=YES)
No	18/28	3/32
Yes	10/28	29/32

Salary cap over the 50th percentile	P(C=NO)	P(C=YES)
No	18/28	12/32
Yes	10/28	20/32

Having calculated the individual probabilities, we can attempt to compute the classification for an example input data. Let’s look at the Atlanta Hawks. This is a young team with high aspirations for the 2019-20 season and with a rising star of the young generation – Trae Young, who I bet will be in the All-Star Game in 2020. This gives us x₂ = Yes. In the 2018-19 season, Atlanta was out of playoffs phase (x₁ = No). As for players’ salaries the Hawks spend the least in the entire league (x₃ = No). For such input data, we want to calculate two probabilities: P (C = Yes – will play in the playoffs) and P (C = No – will not play in the playoffs), and then compare which of them is higher and hence which will be chosen by our classifier.

Because when comparing the probabilities for both classes we can ignore the denominators (they will be the same), it should be added here for the sake of accuracy that we will not calculate probabilities, but only nominators of the probability formula. However, this will be sufficient for correct classification.

For the Yes class (the Atlanta Hawks will play in the playoffs) we have:

P(Yes|No,Yes,No)=P(Yes)*P(x1=No|Yes)*P(x2=Yes|Yes)*P(x3=No|Yes) = 3262*932*2932*1232=313263488≈0.04933

For the No class (the Atlanta Hawks will not play in the playoffs) we have:

P(No|No,Yes,No)=P(No)*P(x1=No|No)*P(x2=Yes|No)*P(x3=No|No)= 2862*1928*1028*1828=342048608≈0.07036

As you can see, the higher value, and therefore the higher probability, has the variant in which the Atlanta Hawks will not play in the playoffs. However, I hope that our classifier is wrong 😉

If you liked my post about Naive Bayes in machine learning, please share it with your friends.

Are you looking for more information about machine learning? Maybe you will be interested in my other posts? For example, the one about logistic regression or this one about handwriting recognition.

Good luck in your studies!