Implementing Naive Bayes from Scratch: A Detailed Guide

3 min readMay 24, 2024

Introduction to Naive Bayes

Naive Bayes is a simple yet powerful classification algorithm based on Bayes’ Theorem with the “naive” assumption of conditional independence between every pair of features given the class label. Despite its simplicity, it performs surprisingly well in various applications like spam detection, text classification, and more.

Bayes’ Theorem

Bayes’ Theorem forms the foundation of Naive Bayes and is stated as:

Where:

𝑃(𝑦∣𝑋) is the posterior probability of class 𝑦 given the features X.
𝑃(𝑋∣𝑦) is the likelihood of the features 𝑋 given the class y.
𝑃(𝑦) is the prior probability of class 𝑦.
𝑃(𝑋) is the evidence or the total probability of the features X.

Naive Bayes Assumption

The key assumption in Naive Bayes is that all features are independent given the class label. This simplifies the computation of 𝑃(𝑋∣𝑦) as:

𝑃(𝑋∣𝑦)=𝑃(𝑥1∣𝑦)⋅𝑃(𝑥2∣𝑦)⋅…⋅𝑃(𝑥𝑛∣𝑦)

Steps to Implement Naive Bayes

Calculate Priors:

𝑃(𝑦): The prior probability of each class.

Calculate Likelihoods:

𝑃(𝑥𝑖∣𝑦): The likelihood of each feature given each class. For continuous features, we often assume a Gaussian distribution.

Calculate Posterior Probabilities:

Using Bayes’ Theorem, compute the posterior probability for each class and choose the class with the highest posterior probability.

Step-by-Step Implementation

Importing Libraries

import numpy as np
import pandas as pd
from scipy.stats import norm

# For simplicity, assume we have a dataset df with features and a target column 'Class'

Calculate Priors

def calculate_priors(df, target_col):
    priors = df[target_col].value_counts(normalize=True)
    return priors.to_dict()

Calculate Likelihoods (Gaussian)

The function calculate_likelihood_gaussian computes the likelihood of a given feature value under the assumption that the feature follows a Gaussian (normal) distribution. Here's the formula used in the function:

def calculate_likelihood_gaussian(df, feat_name, feat_val, target_col, label):
    df = df[df[target_col] == label]
    mean, std = df[feat_name].mean(), df[feat_name].std()
    p_x_given_y = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-((feat_val - mean) ** 2 / (2 * std ** 2)))
    return p_x_given_y

Calculate Posteriors

def calculate_posteriors(df, input_data, target_col):
    labels = sorted(list(df[target_col].unique()))
    priors = calculate_priors(df, target_col)
    likelihoods = {}

    for label in labels:
        likelihoods[label] = 1
        for feat in input_data:
            likelihoods[label] *= calculate_likelihood_gaussian(df, feat, input_data[feat], target_col, label)
        likelihoods[label] *= priors[label]

    total_likelihood = sum(likelihoods.values())
    for label in labels:
        likelihoods[label] /= total_likelihood
    
    return likelihoods

Predict Function

def predict(df, input_data, target_col):
    posteriors = calculate_posteriors(df, input_data, target_col)
    return max(posteriors, key=posteriors.get)

Example Usage

# Example DataFrame
data = {
    'Feature1': [2.5, 3.1, 1.8, 3.6],
    'Feature2': [0.5, 1.2, 0.3, 0.4],
    'Class': ['A', 'B', 'A', 'B']
}
df = pd.DataFrame(data)

# Predicting class for a new data point
new_data = {'Feature1': 2.0, 'Feature2': 0.6}
predicted_class = predict(df, new_data, 'Class')
print(f"Predicted class: {predicted_class}")

Conclusion

Naive Bayes is an elegant algorithm for classification problems, especially when feature independence can be assumed. By implementing it from scratch, we gain deeper insights into its mechanics and appreciate its simplicity and efficiency.