Understanding the Bias-Variance Trade-off

Rahul Jain
4 min readMay 27, 2024

--

Imagine you’re playing basketball and trying to make a basket. Sometimes you miss because you’re not aiming well (high bias), and sometimes you miss because you’re trying too hard to adjust for every little difference (high variance).

  1. High Bias: You always shoot the ball the same way, but you keep missing because you’re not aiming correctly. Your shots are consistent but off target.
  2. High Variance: You change how you shoot every time, trying to adjust for the wind or how you feel, but this makes your shots inconsistent.

The best way to get the ball in the basket is to find a balance between aiming well and being consistent.

Predicting Exam Scores

Now, let’s think about predicting exam scores based on study hours.

  • High Bias: Imagine you have a simple rule: everyone gets the same score regardless of how much they study. This is like using a straight line to predict scores. It’s too simple and doesn’t capture the real relationship.
  • High Variance: Now imagine you try to create a model that fits every single data point perfectly, including outliers. This model might predict scores well for the training data but will perform poorly on new data because it’s too sensitive to small variations.

The goal is to find a model that accurately captures the general trend without overfitting to the noise.

Housing Price Prediction

Let’s move to a more complex example: predicting house prices based on various features like size, location, and number of bedrooms.

  1. High Bias: Using a very simple model, such as a linear regression with just the size of the house as the predictor, might not capture the true relationship. This results in underfitting.
  2. High Variance: Using a very complex model, such as a high-degree polynomial regression that fits all the data points exactly, captures the noise in the training data and leads to overfitting.

We aim to build a model that generalizes well to new data by balancing bias and variance.

Advanced Statistical Learning

In more formal terms, the bias-variance tradeoff involves understanding how the total error in a model is composed of bias, variance, and irreducible error.

  • Bias: Error from incorrect assumptions in the model. High bias can cause underfitting.
  • Variance: Error from sensitivity to small fluctuations in the training set. High variance can cause overfitting.
  • Irreducible Error: Error that cannot be reduced by any model due to inherent noise in the data.

Mathematically, the expected prediction error for a point 𝑥x can be decomposed as:

Our goal is to minimize both bias and variance to achieve the best predictive performance.

Python Code: Implementing a Simple Linear Regression Model

Let’s use Python to demonstrate this with a simple example of predicting house prices.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Generating synthetic data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting a simple linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predicting with the linear model
y_pred_lin = lin_reg.predict(X_test)

# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred_lin, color='red', linewidth=2, label='Predicted')
plt.title('Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Mean Squared Error for linear model
mse_lin = mean_squared_error(y_test, y_pred_lin)
print(f'Linear Model MSE: {mse_lin}')

# Fitting a polynomial regression model
poly_features = PolynomialFeatures(degree=4)
X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)

poly_reg = LinearRegression()
poly_reg.fit(X_poly_train, y_train)

# Predicting with the polynomial model
y_pred_poly = poly_reg.predict(X_poly_test)

# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred_poly, color='red', label='Predicted')
plt.title('Polynomial Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Mean Squared Error for polynomial model
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Polynomial Model MSE: {mse_poly}')

By comparing the MSE of the linear and polynomial models, we can see how increasing complexity (variance) affects model performance. The goal is to find a balance where the model is complex enough to capture the underlying trend but not so complex that it overfits the data.

--

--

Rahul Jain
Rahul Jain

Written by Rahul Jain

Lead Data Scientist @ Rockwell Automation

No responses yet