Naive Bayes Algorithm: Bayes Theorem (Part 1)

Setup

This guide was written in Python 3.6.

Python and Pip

If you haven't already, please download Python and Pip.

Introduction

In this tutorial set, we'll review the Naive Bayes Algorithm used in the field of machine learning. Naive Bayes works on Bayes Theorem of probability to predict the class of a given data point, and is extremely fast compared to other classification algorithms.

Because it works with an assumption of independence among predictors, the Naive Bayes model is easy to build and particularly useful for large datasets. Along with its simplicity, Naive Bayes is known to outperform even some of the most sophisticated classification methods.

This tutorial assumes you have prior programming experience in Python and probablility. While I will overview some of the priciples in probability, this tutorial is not intended to teach you these fundamental concepts. If you need some background on this material, please see my tutorial here.

Bayes Theorem

Recall Bayes Theorem, which provides a way of calculating the posterior probability:

alt text

Before we go into more specifics of the Naive Bayes Algorithm, we'll go through an example of classification to determine whether a sports team will play or not based on the weather.

To start, we'll load in the data, which you can find here.

import pandas as pd
f1 = pd.read_csv("./data/weather.csv")

Before we go any further, let's take a look at the dataset we're working with. It consists of 2 columns (excluding the indices), weather and play. The weather column consists of one of three possible weather categories: sunny, overcast, and rainy. The play column is a binary value of yes or no, and indicates whether or not the sports team played that day.

f1.head(3)

Weather Play
0   Sunny   No
1   Overcast    Yes
2   Rainy   Yes

Frequency Table

If you recall from probability theory, frequencies are an important part of eventually calculating the probability of a given class. In this section of the tutorial, we'll first convert the dataset into different frequency tables, using the groupby() function. First, we retrieve the frequences of each combination of weather and play columns:

df = f1.groupby(['Weather','Play']).size()
print(df)

Weather   Play
Overcast  Yes     4
Rainy     No      3
          Yes     2
Sunny     No      2
          Yes     3
dtype: int64

It will also come in handy to split the frequencies by weather and yes/no. Let's start with the three weather frequencies:

df2 = f1.groupby('Weather').count()
print(df2)

          Play
Weather       
Overcast     4
Rainy        5
Sunny        5

And now for the frequencies of yes and no:

df1 = f1.groupby('Play').count()
print(df1)

      Weather
Play         
No          5
Yes         9

Likelihood Table

The frequencies of each class are important in calculating the likelihood, or the probably that a certain class will occur. Using the frequency tables we just created, we'll find the likelihoods of each weather condition and yes/no. We'll accomplish this by adding a new column that takes the frequency column and divides it by the total data occurances:

df1['Likelihood'] = df1['Weather']/len(f1)
df2['Likelihood'] = df2['Play']/len(f1)
print(df1)
print(df2)

      Weather  Likelihood
Play                     
No          5    0.357143
Yes         9    0.642857
          Play  Likelihood
Weather                   
Overcast     4    0.285714
Rainy        5    0.357143
Sunny        5    0.357143

Now, we're able to use the Naive Bayesian equation to calculate the posterior probability for each class. The highest posterior probability is the outcome of prediction.

Calculation

Now, let's get back to our question: Will the team play if the weather is sunny?

From this question, we can construct Bayes Theorem. Because the know factor is that it is sunny, the P(A | B) becomes P(Yes | Sunny). From there, it's just a matter of plugging in probabilities.

Screen Shot 2017-08-17 at 3.17.44 PM.png

Since we already created some likelihood tables, we can just index P(Sunny) and P(Yes) off the tables:

ps = df2['Likelihood']['Sunny']
py = df1['Likelihood']['Yes']

That leaves us with P(Sunny | Yes). This is the probability that the weather is sunny given that the players played that day. In df, we see that the total number of yes days under sunny is 3. We take this number and divide it by the total number of yes days, which we can get from df:

psy = df['Sunny']['Yes']/df1['Weather']['Yes']

And finally, we can just plug these variables into bayes theorem:

p = (psy*py)/ps
print(p)

0.6

This tells us that there's a 60% likelihood of the team playing if it's sunny. Because this is a binary classification of yes or no, a value greater than 50% indicates a team will play.