Linear Regression: From Equation to Code
Linear regression is the “hello, world” of supervised learning. This post walks through the math, the LaTeX you’d type to render it, and a NumPy implementation small enough to read in one screen.
Setup
Given a dataset \(\{(x_i, y_i)\}_{i=1}^n\) with \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), we want to fit a linear model
\[\hat{y}_i = w^\top x_i + b\]by minimising the mean squared error:
\begin{equation} \label{eq:mse} \mathcal{L}(w, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( y_i - w^\top x_i - b \right)^2 \end{equation}
The objective in equation \eqref{eq:mse} is convex and quadratic in \((w, b)\), so it has a unique minimum.
The closed form
Stack the inputs into a design matrix \(X \in \mathbb{R}^{n \times d}\) (with a column of ones absorbed for the bias) and the targets into \(y \in \mathbb{R}^n\). Setting the gradient of \(\mathcal{L}\) to zero gives the normal equations:
\[\theta^\ast \;=\; (X^\top X)^{-1} X^\top y, \qquad \theta = \begin{bmatrix} w \\ b \end{bmatrix}\]That single line is what you’d type in a paper:
\theta^\ast = (X^\top X)^{-1} X^\top y, \qquad
\theta = \begin{bmatrix} w \\ b \end{bmatrix}
In practice you almost never compute \((X^\top X)^{-1}\) explicitly — it’s both numerically wobbly and unnecessary. np.linalg.solve (LU under the hood) or np.linalg.lstsq (SVD) are both better-behaved.
NumPy implementation
import numpy as np
def fit_closed_form(X: np.ndarray, y: np.ndarray) -> np.ndarray:
"""Solve theta = (X^T X)^{-1} X^T y via a stable linear solve."""
X_b = np.hstack([X, np.ones((X.shape[0], 1))]) # absorb bias
return np.linalg.solve(X_b.T @ X_b, X_b.T @ y)
def fit_gradient_descent(
X: np.ndarray,
y: np.ndarray,
lr: float = 1e-2,
n_steps: int = 1_000,
) -> np.ndarray:
"""Vanilla full-batch GD on the MSE objective."""
X_b = np.hstack([X, np.ones((X.shape[0], 1))])
theta = np.zeros(X_b.shape[1])
n = len(y)
for _ in range(n_steps):
grad = (2 / n) * X_b.T @ (X_b @ theta - y)
theta -= lr * grad
return theta
# Sanity check on synthetic data
rng = np.random.default_rng(0)
X = rng.normal(size=(200, 3))
true_theta = np.array([1.5, -2.0, 0.7, 0.3]) # last entry is bias
y = X @ true_theta[:-1] + true_theta[-1] + 0.1 * rng.normal(size=200)
theta_cf = fit_closed_form(X, y)
theta_gd = fit_gradient_descent(X, y, lr=1e-2, n_steps=5_000)
print(f"true: {true_theta}")
print(f"closed: {theta_cf}")
print(f"GD: {theta_gd}")
Both estimators recover the true parameters to within the noise floor; the gradient-descent path converges to the closed-form answer because the loss surface is convex.
Gradient descent — the update rule
The per-step update for vanilla GD is
\begin{equation} \label{eq:gd-update} \theta_{t+1} \;=\; \theta_t \;-\; \eta \cdot \nabla_\theta \mathcal{L}(\theta_t) \;=\; \theta_t \;-\; \frac{2 \eta}{n} \, X^\top (X \theta_t - y) \end{equation}
with learning rate \(\eta > 0\). Equation \eqref{eq:gd-update} is what the loop in fit_gradient_descent implements one step at a time.
A quick rule of thumb for choosing \(\eta\): with feature scaling so each column of \(X\) has unit variance, anything from \(10^{-3}\) to \(10^{-1}\) tends to work; outside that range expect divergence or glacial convergence.
When to reach for which
| Situation | Method |
|---|---|
| \(n \lesssim 10^4\), \(d \lesssim 10^3\), no regularisation | Closed form via np.linalg.lstsq. |
| Very large \(n\) (out-of-core) | Mini-batch SGD. |
| Adding \(\ell_2\) regularisation (ridge) | Closed form still works: \(\theta^\ast = (X^\top X + \lambda I)^{-1} X^\top y\). |
| Adding \(\ell_1\) regularisation (lasso) | Coordinate descent or proximal GD; no closed form. |
| Want uncertainty estimates | Bayesian linear regression — same conjugate-Gaussian story, plus a posterior covariance. |
That’s the whole package: a one-line solution, a one-line LaTeX equation, and a thirty-line NumPy implementation.