Understanding Different Types of Optimization Techniques: L1, L2

5 min readMay 27, 2024

Introduction

In the realm of machine learning, optimization plays a crucial role. It is the process of adjusting the parameters of a model to minimize the error between predicted and actual outcomes.

One of the significant challenges in this process is finding the right balance between bias (error due to overly simplistic models) and variance (error due to overly complex models). This balance is often managed through regularization techniques such as L1 and L2 regularization. These techniques help in controlling the complexity of the model, thereby improving its performance on unseen data.

In this blog, we’ll explore L1 and L2 regularization in detail, understand the mathematics behind them, and demonstrate their application through a logistic regression example.

Explanation of L1 and L2 Regularization

L1 Regularization (Lasso):

Think of L1 regularization as a way to make your model simpler. Imagine you have a bookshelf full of books, but you only want to keep the most important ones. L1 regularization helps you remove the unnecessary books (features) by shrinking some coefficients to zero.
It’s like a lazy person who prefers to take shortcuts and ignore some paths entirely.

L2 Regularization (Ridge):

L2 regularization is about distributing your importance evenly. Instead of removing books, you make sure that no book stands out too much. All books are important but in a balanced way.
It’s like a meticulous person who carefully adjusts the importance of each path to ensure a smooth journey.

Let’s dive deeper into the mathematics behind these techniques.

L1 Regularization:

L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients.
The cost function for L1 regularization can be written as:

where:

𝐽(𝜃) is the cost function.
Loss(ℎ𝜃(𝑥),𝑦) is the loss function (e.g., mean squared error).
𝜆 is the regularization parameter.
𝜃𝑖 are the parameters of the model.

L2 Regularization:

L2 regularization adds a penalty equal to the square of the magnitude of coefficients.
The cost function for L2 regularization can be written as:

where the terms are the same as in L1 regularization but with squared coefficients.

Which Optimization to Use for Which Problem?

L1 Regularization (Lasso): Use L1 when you want to perform feature selection, i.e., when you want to identify the most important features and ignore the rest. It’s useful when you suspect that many of the features are irrelevant.
L2 Regularization (Ridge): Use L2 when you want to prevent overfitting by reducing the magnitude of coefficients but don’t necessarily want to eliminate any features. It’s useful when you have multicollinearity among your features.

Implementing Logistic Regression with L1 and L2 Regularization from Scratch

Let’s demonstrate the usefulness of these optimizations with a logistic regression example. We’ll implement logistic regression with both L1 and L2 regularization.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Explanation:

Loading Dataset: We use the breast cancer dataset from sklearn, which provides features of tumors and their labels (malignant or benign).
Standardizing Features: We standardize the features using StandardScaler to ensure they have a mean of 0 and a standard deviation of 1, which helps in faster and more accurate convergence during training.
Splitting Data: We split the data into training and testing sets to evaluate model performance.

class LogisticRegressionL2:
    def __init__(self, lr=0.01, iterations=1000, lambda_param=0.01):
        self.lr = lr
        self.iterations = iterations
        self.lambda_param = lambda_param

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.theta = np.zeros(X.shape[1])
        self.bias = 0
        m = X.shape[0]

        for _ in range(self.iterations):
            linear_model = np.dot(X, self.theta) + self.bias
            y_predicted = self.sigmoid(linear_model)

            # Gradient descent
            d_theta = (1 / m) * np.dot(X.T, (y_predicted - y)) + (self.lambda_param / m) * self.theta
            d_bias = (1 / m) * np.sum(y_predicted - y)

            self.theta -= self.lr * d_theta
            self.bias -= self.lr * d_bias

    def predict(self, X):
        linear_model = np.dot(X, self.theta) + self.bias
        y_predicted = self.sigmoid(linear_model)
        y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

# Train model with L2 regularization
model_l2 = LogisticRegressionL2(lambda_param=0.1)
model_l2.fit(X_train, y_train)
predictions_l2 = model_l2.predict(X_test)

Explanation:

Class Definition: The LogisticRegressionL2 class implements logistic regression with L2 regularization.
Sigmoid Function: The sigmoid function calculates the logistic function, which is used to predict probabilities.
Fit Method: In the fit method, we initialize the parameters and iterate through the training data to update the weights (theta) and bias using gradient descent. The gradient descent step includes the L2 penalty term, which is 𝜆 / 𝑚 * ∑𝜃𝑖2, where 𝜆 is the regularization parameter and 𝑚 is the number of samples.
Predict Method: The predict method calculates predictions based on the learned parameters. The decision threshold is 0.5 to classify the output as either 1 or 0.

class LogisticRegressionL1:
    def __init__(self, lr=0.01, iterations=1000, lambda_param=0.01):
        self.lr = lr
        self.iterations = iterations
        self.lambda_param = lambda_param

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.theta = np.zeros(X.shape[1])
        self.bias = 0
        m = X.shape[0]

        for _ in range(self.iterations):
            linear_model = np.dot(X, self.theta) + self.bias
            y_predicted = self.sigmoid(linear_model)

            # Gradient descent with L1 regularization
            d_theta = (1 / m) * np.dot(X.T, (y_predicted - y)) + (self.lambda_param / m) * np.sign(self.theta)
            d_bias = (1 / m) * np.sum(y_predicted - y)

            self.theta -= self.lr * d_theta
            self.bias -= self.lr * d_bias

    def predict(self, X):
        linear_model = np.dot(X, self.theta) + self.bias
        y_predicted = self.sigmoid(linear_model)
        y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
        return np.array(y_predicted_cls)

# Train model with L1 regularization
model_l1 = LogisticRegressionL1(lambda_param=0.1)
model_l1.fit(X_train, y_train)
predictions_l1 = model_l1.predict(X_test)

Explanation:

Class Definition: The LogisticRegressionL1 class implements logistic regression with L1 regularization.
Fit Method: In the fit method, the gradient descent step includes the L1 penalty term, which is 𝜆 / 𝑚 * ∑∣𝜃𝑖∣. The np.sign function is used to compute the gradient for the L1 regularization term.
Predict Method: Similar to the L2 implementation, this method calculates predictions based on the learned parameters.

from sklearn.metrics import accuracy_score

# Evaluate the L2 regularized model
accuracy_l2 = accuracy_score(y_test, predictions_l2)
print(f'L2 Regularized Logistic Regression Accuracy: {accuracy_l2}')

# Evaluate the L1 regularized model
accuracy_l1 = accuracy_score(y_test, predictions_l1)
print(f'L1 Regularized Logistic Regression Accuracy: {accuracy_l1}')

Explanation:

Accuracy Calculation: We use the accuracy_score function from sklearn to evaluate the performance of both models. Accuracy is calculated as the ratio of correctly predicted instances to the total instances.
Print Accuracy: The accuracy of both the L2 and L1 regularized logistic regression models is printed to compare their performance.

Conclusion

Understanding different optimization techniques like L1 and L2 regularization is crucial for building effective machine learning models.

L1 regularization helps with feature selection, while L2 regularization prevents overfitting by reducing the magnitude of coefficients. By balancing these techniques, we can create models that generalize well to new data.

Understanding Different Types of Optimization Techniques: L1, L2

Introduction

Explanation of L1 and L2 Regularization

Let’s dive deeper into the mathematics behind these techniques.

Which Optimization to Use for Which Problem?

Implementing Logistic Regression with L1 and L2 Regularization from Scratch

Conclusion

Written by Rahul Jain

No responses yet