Implementing PCA from Scratch: A Detailed Guide

3 min readMay 24, 2024

Introduction to PCA

Principal Component Analysis (PCA) is a powerful dimensionality reduction technique widely used in data analysis and machine learning. By reducing the number of dimensions, PCA helps in simplifying the dataset while retaining most of the variance present in the original data. This makes it easier to visualize and interpret large datasets and also helps in reducing computational costs for various machine learning algorithms.

Concepts Behind PCA

Mean Centering: Subtract the mean of each feature from the dataset to center the data around the origin.
Covariance Matrix: Calculate the covariance matrix to understand how the features in the dataset vary with respect to each other.
Eigenvalues and Eigenvectors: Compute the eigenvalues and eigenvectors of the covariance matrix to identify the principal components.
Sorting and Selecting Principal Components: Sort the eigenvalues in descending order and select the top n eigenvalues and their corresponding eigenvectors to form the principal components.
Projecting the Data: Transform the original data by projecting it onto the selected principal components.

Step-by-Step Implementation

Let’s break down the implementation of PCA in Python.

Importing Libraries

First, we need to import the necessary libraries.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

Defining the PCA Class

We create a class PCA to encapsulate all the functionality required to perform PCA.

class PCA:
    def __init__(self, n_components):
        self.n_components = n_components
        self.components = None
        self.mean = None

__init__ Method: Initializes the PCA object with the number of principal components to be extracted.

Fit Method

The fit method calculates the principal components from the input data.

    def fit(self, X):
        # mean centering
        self.mean = np.mean(X, axis=0)
        X = X - self.mean

Mean Centering: Subtract the mean of each feature to center the data.

        # covariance matrix
        cov = np.cov(X.T)

Covariance Matrix: Compute the covariance matrix of the centered data

        # eigenvectors, eigenvalues
        eigenvectors, eigenvalues = np.linalg.eig(cov)
        eigenvectors = eigenvectors.T

Eigenvectors and Eigenvalues: Compute the eigenvectors and eigenvalues of the covariance matrix.

        # sort eigenvectors
        idxs = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idxs]
        eigenvectors = eigenvectors[idxs]

        self.components = eigenvectors[:self.n_components]

Sorting and Selecting Principal Components: Sort the eigenvalues in descending order and select the top n eigenvalues and their corresponding eigenvectors.

Transform Method

The transform method projects the original data onto the principal components.

    def transform(self, X):
        # project data
        X = X - self.mean
        return np.dot(X, self.components.T)

Projecting the Data: Subtract the mean and project the data onto the principal components using the dot product.

Testing the Implementation

We test the PCA implementation using the Iris dataset from scikit-learn.

if __name__ == "__main__":
    # Load the dataset
    data = datasets.load_iris()
    X = data.data
    y = data.target

    # Project the data onto the 2 primary principal components
    pca = PCA(2)
    pca.fit(X)
    X_projected = pca.transform(X)

    print("Shape of X:", X.shape)
    print("Shape of transformed X:", X_projected.shape)

    x1 = X_projected[:, 0]
    x2 = X_projected[:, 1]

    plt.scatter(
        x1, x2, c=y, edgecolor="none", alpha=0.8, cmap=plt.cm.get_cmap("viridis", 3)
    )
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.colorbar()
    plt.show()

Loading the Dataset: Load the Iris dataset.
Fitting the PCA Model: Fit the PCA model to the data.
Transforming the Data: Transform the data using the PCA model.
Plotting the Results: Plot the projected data on a 2D plane.

Conclusion

PCA is an essential tool in the data scientist’s toolkit, offering a way to simplify complex datasets and make them more manageable. By understanding and implementing PCA from scratch, we gain a deeper appreciation for the algorithm and its utility in various applications.

Feel free to experiment with different datasets and the number of principal components to see how PCA can help in your specific use case. Happy coding!