Understanding Different Types of Optimization Techniques: L1, L2
Introduction
In the realm of machine learning, optimization plays a crucial role. It is the process of adjusting the parameters of a model to minimize the error between predicted and actual outcomes.
One of the significant challenges in this process is finding the right balance between bias (error due to overly simplistic models) and variance (error due to overly complex models). This balance is often managed through regularization techniques such as L1 and L2 regularization. These techniques help in controlling the complexity of the model, thereby improving its performance on unseen data.
In this blog, we’ll explore L1 and L2 regularization in detail, understand the mathematics behind them, and demonstrate their application through a logistic regression example.
Explanation of L1 and L2 Regularization
L1 Regularization (Lasso):
- Think of L1 regularization as a way to make your model simpler. Imagine you have a bookshelf full of books, but you only want to keep the most important ones. L1 regularization helps you remove the unnecessary books (features) by shrinking some coefficients to zero.
- It’s like a lazy person who prefers to take shortcuts and ignore some paths entirely.
L2 Regularization (Ridge):
- L2 regularization is about distributing your importance evenly. Instead of removing books, you make sure that no book stands out too much. All books are important but in a balanced way.
- It’s like a meticulous person who carefully adjusts the importance of each path to ensure a smooth journey.
Let’s dive deeper into the mathematics behind these techniques.
L1 Regularization:
- L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients.
- The cost function for L1 regularization can be written as:
where:
- 𝐽(𝜃) is the cost function.
- Loss(ℎ𝜃(𝑥),𝑦) is the loss function (e.g., mean squared error).
- 𝜆 is the regularization parameter.
- 𝜃𝑖 are the parameters of the model.
L2 Regularization:
- L2 regularization adds a penalty equal to the square of the magnitude of coefficients.
- The cost function for L2 regularization can be written as:
where the terms are the same as in L1 regularization but with squared coefficients.
Which Optimization to Use for Which Problem?
- L1 Regularization (Lasso): Use L1 when you want to perform feature selection, i.e., when you want to identify the most important features and ignore the rest. It’s useful when you suspect that many of the features are irrelevant.
- L2 Regularization (Ridge): Use L2 when you want to prevent overfitting by reducing the magnitude of coefficients but don’t necessarily want to eliminate any features. It’s useful when you have multicollinearity among your features.
Implementing Logistic Regression with L1 and L2 Regularization from Scratch
Let’s demonstrate the usefulness of these optimizations with a logistic regression example. We’ll implement logistic regression with both L1 and L2 regularization.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Explanation:
- Loading Dataset: We use the breast cancer dataset from sklearn, which provides features of tumors and their labels (malignant or benign).
- Standardizing Features: We standardize the features using
StandardScaler
to ensure they have a mean of 0 and a standard deviation of 1, which helps in faster and more accurate convergence during training. - Splitting Data: We split the data into training and testing sets to evaluate model performance.
class LogisticRegressionL2:
def __init__(self, lr=0.01, iterations=1000, lambda_param=0.01):
self.lr = lr
self.iterations = iterations
self.lambda_param = lambda_param
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def fit(self, X, y):
self.theta = np.zeros(X.shape[1])
self.bias = 0
m = X.shape[0]
for _ in range(self.iterations):
linear_model = np.dot(X, self.theta) + self.bias
y_predicted = self.sigmoid(linear_model)
# Gradient descent
d_theta = (1 / m) * np.dot(X.T, (y_predicted - y)) + (self.lambda_param / m) * self.theta
d_bias = (1 / m) * np.sum(y_predicted - y)
self.theta -= self.lr * d_theta
self.bias -= self.lr * d_bias
def predict(self, X):
linear_model = np.dot(X, self.theta) + self.bias
y_predicted = self.sigmoid(linear_model)
y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
return np.array(y_predicted_cls)
# Train model with L2 regularization
model_l2 = LogisticRegressionL2(lambda_param=0.1)
model_l2.fit(X_train, y_train)
predictions_l2 = model_l2.predict(X_test)
Explanation:
- Class Definition: The
LogisticRegressionL2
class implements logistic regression with L2 regularization. - Sigmoid Function: The
sigmoid
function calculates the logistic function, which is used to predict probabilities. - Fit Method: In the
fit
method, we initialize the parameters and iterate through the training data to update the weights (theta
) and bias using gradient descent. The gradient descent step includes the L2 penalty term, which is 𝜆 / 𝑚 * ∑𝜃𝑖2, where 𝜆 is the regularization parameter and 𝑚 is the number of samples. - Predict Method: The
predict
method calculates predictions based on the learned parameters. The decision threshold is 0.5 to classify the output as either 1 or 0.
class LogisticRegressionL1:
def __init__(self, lr=0.01, iterations=1000, lambda_param=0.01):
self.lr = lr
self.iterations = iterations
self.lambda_param = lambda_param
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def fit(self, X, y):
self.theta = np.zeros(X.shape[1])
self.bias = 0
m = X.shape[0]
for _ in range(self.iterations):
linear_model = np.dot(X, self.theta) + self.bias
y_predicted = self.sigmoid(linear_model)
# Gradient descent with L1 regularization
d_theta = (1 / m) * np.dot(X.T, (y_predicted - y)) + (self.lambda_param / m) * np.sign(self.theta)
d_bias = (1 / m) * np.sum(y_predicted - y)
self.theta -= self.lr * d_theta
self.bias -= self.lr * d_bias
def predict(self, X):
linear_model = np.dot(X, self.theta) + self.bias
y_predicted = self.sigmoid(linear_model)
y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
return np.array(y_predicted_cls)
# Train model with L1 regularization
model_l1 = LogisticRegressionL1(lambda_param=0.1)
model_l1.fit(X_train, y_train)
predictions_l1 = model_l1.predict(X_test)
Explanation:
- Class Definition: The
LogisticRegressionL1
class implements logistic regression with L1 regularization. - Fit Method: In the
fit
method, the gradient descent step includes the L1 penalty term, which is 𝜆 / 𝑚 * ∑∣𝜃𝑖∣. Thenp.sign
function is used to compute the gradient for the L1 regularization term. - Predict Method: Similar to the L2 implementation, this method calculates predictions based on the learned parameters.
from sklearn.metrics import accuracy_score
# Evaluate the L2 regularized model
accuracy_l2 = accuracy_score(y_test, predictions_l2)
print(f'L2 Regularized Logistic Regression Accuracy: {accuracy_l2}')
# Evaluate the L1 regularized model
accuracy_l1 = accuracy_score(y_test, predictions_l1)
print(f'L1 Regularized Logistic Regression Accuracy: {accuracy_l1}')
Explanation:
- Accuracy Calculation: We use the
accuracy_score
function from sklearn to evaluate the performance of both models. Accuracy is calculated as the ratio of correctly predicted instances to the total instances. - Print Accuracy: The accuracy of both the L2 and L1 regularized logistic regression models is printed to compare their performance.
Conclusion
Understanding different optimization techniques like L1 and L2 regularization is crucial for building effective machine learning models.
L1 regularization helps with feature selection, while L2 regularization prevents overfitting by reducing the magnitude of coefficients. By balancing these techniques, we can create models that generalize well to new data.