Implementing Naive Bayes from Scratch: A Detailed Guide
Introduction to Naive Bayes
Naive Bayes is a simple yet powerful classification algorithm based on Bayes’ Theorem with the “naive” assumption of conditional independence between every pair of features given the class label. Despite its simplicity, it performs surprisingly well in various applications like spam detection, text classification, and more.
Bayes’ Theorem
Bayes’ Theorem forms the foundation of Naive Bayes and is stated as:
Where:
- 𝑃(𝑦∣𝑋) is the posterior probability of class 𝑦 given the features X.
- 𝑃(𝑋∣𝑦) is the likelihood of the features 𝑋 given the class y.
- 𝑃(𝑦) is the prior probability of class 𝑦.
- 𝑃(𝑋) is the evidence or the total probability of the features X.
Naive Bayes Assumption
The key assumption in Naive Bayes is that all features are independent given the class label. This simplifies the computation of 𝑃(𝑋∣𝑦) as:
𝑃(𝑋∣𝑦)=𝑃(𝑥1∣𝑦)⋅𝑃(𝑥2∣𝑦)⋅…⋅𝑃(𝑥𝑛∣𝑦)
Steps to Implement Naive Bayes
Calculate Priors:
- 𝑃(𝑦): The prior probability of each class.
Calculate Likelihoods:
- 𝑃(𝑥𝑖∣𝑦): The likelihood of each feature given each class. For continuous features, we often assume a Gaussian distribution.
Calculate Posterior Probabilities:
- Using Bayes’ Theorem, compute the posterior probability for each class and choose the class with the highest posterior probability.
Step-by-Step Implementation
Importing Libraries
import numpy as np
import pandas as pd
from scipy.stats import norm
# For simplicity, assume we have a dataset df with features and a target column 'Class'
Calculate Priors
def calculate_priors(df, target_col):
priors = df[target_col].value_counts(normalize=True)
return priors.to_dict()
Calculate Likelihoods (Gaussian)
The function calculate_likelihood_gaussian
computes the likelihood of a given feature value under the assumption that the feature follows a Gaussian (normal) distribution. Here's the formula used in the function:
def calculate_likelihood_gaussian(df, feat_name, feat_val, target_col, label):
df = df[df[target_col] == label]
mean, std = df[feat_name].mean(), df[feat_name].std()
p_x_given_y = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-((feat_val - mean) ** 2 / (2 * std ** 2)))
return p_x_given_y
Calculate Posteriors
def calculate_posteriors(df, input_data, target_col):
labels = sorted(list(df[target_col].unique()))
priors = calculate_priors(df, target_col)
likelihoods = {}
for label in labels:
likelihoods[label] = 1
for feat in input_data:
likelihoods[label] *= calculate_likelihood_gaussian(df, feat, input_data[feat], target_col, label)
likelihoods[label] *= priors[label]
total_likelihood = sum(likelihoods.values())
for label in labels:
likelihoods[label] /= total_likelihood
return likelihoods
Predict Function
def predict(df, input_data, target_col):
posteriors = calculate_posteriors(df, input_data, target_col)
return max(posteriors, key=posteriors.get)
Example Usage
# Example DataFrame
data = {
'Feature1': [2.5, 3.1, 1.8, 3.6],
'Feature2': [0.5, 1.2, 0.3, 0.4],
'Class': ['A', 'B', 'A', 'B']
}
df = pd.DataFrame(data)
# Predicting class for a new data point
new_data = {'Feature1': 2.0, 'Feature2': 0.6}
predicted_class = predict(df, new_data, 'Class')
print(f"Predicted class: {predicted_class}")
Conclusion
Naive Bayes is an elegant algorithm for classification problems, especially when feature independence can be assumed. By implementing it from scratch, we gain deeper insights into its mechanics and appreciate its simplicity and efficiency.