Linear regression is the “hello, world” of supervised learning. This post walks through the math, the LaTeX you’d type to render it, and a NumPy implementation small enough to read in one screen.

Setup

Given a dataset \(\{(x_i, y_i)\}_{i=1}^n\) with \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), we want to fit a linear model

\[\hat{y}_i = w^\top x_i + b\]

by minimising the mean squared error:

\begin{equation} \label{eq:mse} \mathcal{L}(w, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( y_i - w^\top x_i - b \right)^2 \end{equation}

The objective in equation \eqref{eq:mse} is convex and quadratic in \((w, b)\), so it has a unique minimum.

The closed form

Stack the inputs into a design matrix \(X \in \mathbb{R}^{n \times d}\) (with a column of ones absorbed for the bias) and the targets into \(y \in \mathbb{R}^n\). Setting the gradient of \(\mathcal{L}\) to zero gives the normal equations:

\[\theta^\ast \;=\; (X^\top X)^{-1} X^\top y, \qquad \theta = \begin{bmatrix} w \\ b \end{bmatrix}\]

That single line is what you’d type in a paper:

\theta^\ast = (X^\top X)^{-1} X^\top y, \qquad
\theta = \begin{bmatrix} w \\ b \end{bmatrix}

In practice you almost never compute \((X^\top X)^{-1}\) explicitly — it’s both numerically wobbly and unnecessary. np.linalg.solve (LU under the hood) or np.linalg.lstsq (SVD) are both better-behaved.

NumPy implementation

import numpy as np


def fit_closed_form(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """Solve theta = (X^T X)^{-1} X^T y via a stable linear solve."""
    X_b = np.hstack([X, np.ones((X.shape[0], 1))])      # absorb bias
    return np.linalg.solve(X_b.T @ X_b, X_b.T @ y)


def fit_gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 1e-2,
    n_steps: int = 1_000,
) -> np.ndarray:
    """Vanilla full-batch GD on the MSE objective."""
    X_b = np.hstack([X, np.ones((X.shape[0], 1))])
    theta = np.zeros(X_b.shape[1])
    n = len(y)
    for _ in range(n_steps):
        grad = (2 / n) * X_b.T @ (X_b @ theta - y)
        theta -= lr * grad
    return theta


# Sanity check on synthetic data
rng = np.random.default_rng(0)
X = rng.normal(size=(200, 3))
true_theta = np.array([1.5, -2.0, 0.7, 0.3])           # last entry is bias
y = X @ true_theta[:-1] + true_theta[-1] + 0.1 * rng.normal(size=200)

theta_cf = fit_closed_form(X, y)
theta_gd = fit_gradient_descent(X, y, lr=1e-2, n_steps=5_000)

print(f"true:    {true_theta}")
print(f"closed:  {theta_cf}")
print(f"GD:      {theta_gd}")

Both estimators recover the true parameters to within the noise floor; the gradient-descent path converges to the closed-form answer because the loss surface is convex.

Gradient descent — the update rule

The per-step update for vanilla GD is

\begin{equation} \label{eq:gd-update} \theta_{t+1} \;=\; \theta_t \;-\; \eta \cdot \nabla_\theta \mathcal{L}(\theta_t) \;=\; \theta_t \;-\; \frac{2 \eta}{n} \, X^\top (X \theta_t - y) \end{equation}

with learning rate \(\eta > 0\). Equation \eqref{eq:gd-update} is what the loop in fit_gradient_descent implements one step at a time.

A quick rule of thumb for choosing \(\eta\): with feature scaling so each column of \(X\) has unit variance, anything from \(10^{-3}\) to \(10^{-1}\) tends to work; outside that range expect divergence or glacial convergence.

When to reach for which

Situation Method
\(n \lesssim 10^4\), \(d \lesssim 10^3\), no regularisation Closed form via np.linalg.lstsq.
Very large \(n\) (out-of-core) Mini-batch SGD.
Adding \(\ell_2\) regularisation (ridge) Closed form still works: \(\theta^\ast = (X^\top X + \lambda I)^{-1} X^\top y\).
Adding \(\ell_1\) regularisation (lasso) Coordinate descent or proximal GD; no closed form.
Want uncertainty estimates Bayesian linear regression — same conjugate-Gaussian story, plus a posterior covariance.

That’s the whole package: a one-line solution, a one-line LaTeX equation, and a thirty-line NumPy implementation.