Linear Regression: From Equation to Code

Setup
The closed form
NumPy implementation
Gradient descent — the update rule
When to reach for which

Linear regression is the “hello, world” of supervised learning. This post walks through the math, the LaTeX you’d type to render it, and a NumPy implementation small enough to read in one screen.

Setup

Given a dataset \(\{(x_i, y_i)\}_{i=1}^n\) with \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), we want to fit a linear model

\[\hat{y}_i = w^\top x_i + b\]

by minimising the mean squared error:

\begin{equation} \label{eq:mse} \mathcal{L}(w, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( y_i - w^\top x_i - b \right)^2 \end{equation}

The objective in equation \eqref{eq:mse} is convex and quadratic in \((w, b)\), so it has a unique minimum.

The closed form

Stack the inputs into a design matrix \(X \in \mathbb{R}^{n \times d}\) (with a column of ones absorbed for the bias) and the targets into \(y \in \mathbb{R}^n\). Setting the gradient of \(\mathcal{L}\) to zero gives the normal equations:

\[\theta^\ast \;=\; (X^\top X)^{-1} X^\top y, \qquad \theta = \begin{bmatrix} w \\ b \end{bmatrix}\]

That single line is what you’d type in a paper:

\theta^\ast = (X^\top X)^{-1} X^\top y, \qquad
\theta = \begin{bmatrix} w \\ b \end{bmatrix}

In practice you almost never compute \((X^\top X)^{-1}\) explicitly — it’s both numerically wobbly and unnecessary. np.linalg.solve (LU under the hood) or np.linalg.lstsq (SVD) are both better-behaved.

NumPy implementation

import numpy as np


def fit_closed_form(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """Solve theta = (X^T X)^{-1} X^T y via a stable linear solve."""
    X_b = np.hstack([X, np.ones((X.shape[0], 1))])      # absorb bias
    return np.linalg.solve(X_b.T @ X_b, X_b.T @ y)


def fit_gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 1e-2,
    n_steps: int = 1_000,
) -> np.ndarray:
    """Vanilla full-batch GD on the MSE objective."""
    X_b = np.hstack([X, np.ones((X.shape[0], 1))])
    theta = np.zeros(X_b.shape[1])
    n = len(y)
    for _ in range(n_steps):
        grad = (2 / n) * X_b.T @ (X_b @ theta - y)
        theta -= lr * grad
    return theta


# Sanity check on synthetic data
rng = np.random.default_rng(0)
X = rng.normal(size=(200, 3))
true_theta = np.array([1.5, -2.0, 0.7, 0.3])           # last entry is bias
y = X @ true_theta[:-1] + true_theta[-1] + 0.1 * rng.normal(size=200)

theta_cf = fit_closed_form(X, y)
theta_gd = fit_gradient_descent(X, y, lr=1e-2, n_steps=5_000)

print(f"true:    {true_theta}")
print(f"closed:  {theta_cf}")
print(f"GD:      {theta_gd}")

Both estimators recover the true parameters to within the noise floor; the gradient-descent path converges to the closed-form answer because the loss surface is convex.

Gradient descent — the update rule

The per-step update for vanilla GD is

\begin{equation} \label{eq:gd-update} \theta_{t+1} \;=\; \theta_t \;-\; \eta \cdot \nabla_\theta \mathcal{L}(\theta_t) \;=\; \theta_t \;-\; \frac{2 \eta}{n} \, X^\top (X \theta_t - y) \end{equation}

with learning rate \(\eta > 0\). Equation \eqref{eq:gd-update} is what the loop in fit_gradient_descent implements one step at a time.

A quick rule of thumb for choosing \(\eta\): with feature scaling so each column of \(X\) has unit variance, anything from \(10^{-3}\) to \(10^{-1}\) tends to work; outside that range expect divergence or glacial convergence.

When to reach for which

Situation	Method
\(n \lesssim 10^4\), \(d \lesssim 10^3\), no regularisation	Closed form via `np.linalg.lstsq`.
Very large \(n\) (out-of-core)	Mini-batch SGD.
Adding \(\ell_2\) regularisation (ridge)	Closed form still works: \(\theta^\ast = (X^\top X + \lambda I)^{-1} X^\top y\).
Adding \(\ell_1\) regularisation (lasso)	Coordinate descent or proximal GD; no closed form.
Want uncertainty estimates	Bayesian linear regression — same conjugate-Gaussian story, plus a posterior covariance.

That’s the whole package: a one-line solution, a one-line LaTeX equation, and a thirty-line NumPy implementation.