Linear Regression¶

Peter Ma | Dec 28th 2025¶

Linear regression is exactly what it sounds like. We are assuming the underlying model has a linear relationship with the data $X$ and the labels $y$. We say that they are related $\hat y_\theta = W^T X+ \beta$ where there weights $\theta = (W,\beta)$. We try to find $$\theta^* := \text{argmin}_\theta \mathcal{L(\theta, y)}$$ If we assume that the loss function is that of $\chi^2$ then $$\theta^* := \min_\theta (W^T X+ \beta - y)^2$$ Let us write this down better $$\mathcal{L} (W^T X+ \beta - y)^T(W^T X+ \beta - y) = E^2$$ $$\mathcal{L} = (W^T X+ \beta)^T(W^T X+ \beta) - (W^T X+ \beta - \hat y)^T y - y^T(W^T X+ \beta - y) + y^Ty$$ Okay lets bang this out $$\mathcal{L} = (W^T X+ \beta)^T(W^T X+ \beta) - 2y^T(W^T X+ \beta) + y^T y$$

We can optimize this: $$\frac{\partial \mathcal{L}}{\partial W} =\frac{\partial \mathcal{L}}{\partial E}\frac{\partial E}{\partial W}$$ $$XX^T W = X y^T - X\beta^T$$ $$W = (XX^T)^{-1} (X y^T - X\beta^T)$$ We can also solve for this by then solving for $\beta$ but then you need to solve the system of equations which is annoying. However this makes it hard to solve separately.

Consider the following: we apply the bias trick $$\tilde{W}^T \tilde{x} = \begin{bmatrix} \beta & w_1 & w_2 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}$$ This lets write: $$W^* = (XX^T)^{-1} X y^T$$

In [42]:

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import boston_housing

# 1. Load Data
(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

X_train = jnp.array(X_train)
X_test = jnp.array(X_test) 
y_train = jnp.array(y_train[:, None])
y_test = jnp.array(y_test[:, None])

print(f"Training data shape: {X_train.shape}")
print(f"Target data shape:   {y_train.shape}")

# 2. The Bias Trick
train_ones = jnp.ones((X_train.shape[0], 1))
test_ones = jnp.ones((X_test.shape[0], 1))

X_train_aug = jnp.hstack((train_ones, X_train))
X_test_aug = jnp.hstack((test_ones, X_test))

# 3. The Normal Equation
# W = (X^T X)^-1 X^T y
# We use solve() for stability: (X^T X) W = X^T y
XTX = X_train_aug.T @ X_train_aug
XTy = X_train_aug.T @ y_train

W = jnp.linalg.solve(XTX, XTy)

print(f"\nWeights shape: {W.shape}") 


y_pred = X_test_aug @ W

# 5) MSE
mse = jnp.mean((y_test - y_pred) ** 2)
print(f"Test MSE: {mse:.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([0, 50], [0, 50], '--r', label='Perfect Prediction') # Diagonal line
plt.xlabel("Actual Prices ($1000s)")
plt.ylabel("Predicted Prices ($1000s)")
plt.title("Boston Housing: Linear Regression with JAX")
plt.legend()
plt.grid(True)
plt.show()

Training data shape: (404, 13)
Target data shape:   (404, 1)

Weights shape: (14, 1)
Test MSE: 23.20

Probablistic / Generative Interpretation¶

Instead of just saying $y \approx mx + b$, we assume the target value $y$ comes from a deterministic signal plus random noise $\epsilon$: $$y = \mathbf{w}^T\mathbf{x} + \epsilon$$

In the probabilistic view, we define the distribution of this noise. Lets just assume that the noise is Gaussian (Normal) with mean 0 and variance $\sigma^2$:$$\epsilon \sim \mathcal{N}(0, \sigma^2)$$

Since $\epsilon$ is Gaussian, and $\mathbf{w}^T\mathbf{x}$ is just a constant shift (for a fixed $x$), the target $y$ itself follows a Gaussian distribution centered at the regression line:$$p(y | \mathbf{x}, \mathbf{w}) = \mathcal{N}(y | \mu=\mathbf{w}^T\mathbf{x}, \sigma^2)$$This is the probability density function (PDF) for a single data point:$$p(y_i | \mathbf{x}_i, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2\sigma^2} \right)$$

We want to find the weights $\mathbf{w}$ that explain all our data points best. Assuming the data points are independent and identically distributed (IID), the probability of observing our entire dataset $Y$ is the product of the individual probabilities.This is called the Likelihood:$$L(\mathbf{w}) = \prod_{i=1}^{N} p(y_i | \mathbf{x}_i, \mathbf{w})$$

Substituting the Gaussian PDF from before:$$L(\mathbf{w}) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2\sigma^2} \right)$$

Maximum Likelihood Estimation (MLE)Our goal is to find the $\mathbf{w}$ that maximizes this Likelihood. To make the math easier (turning products into sums), we maximize the Log-Likelihood instead.$$\ln L(\mathbf{w}) = \sum_{i=1}^{N} \ln \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \mathbf{w}^T\mathbf{x}_i)^2}{2\sigma^2} \right) \right]$$Splitting the log terms gives us:$$\ln L(\mathbf{w}) = \underbrace{- \frac{N}{2} \ln(2\pi\sigma^2)}_{\text{Constant}} - \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

Now to find the best $\mathbf{w}$, we maximize this equation.The first term is constant with respect to $\mathbf{w}$, so we ignore it.The factor $\frac{1}{2\sigma^2}$ is just a scaling factor, so we ignore it.We are left with maximizing the negative sum of squares:$$\text{Maximize } -\sum_{i=1}^{N} (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$Maximizing a negative is the same as minimizing the positive:$$\boxed{\text{Minimize } \sum_{i=1}^{N} (y_i - \mathbf{w}^T\mathbf{x}_i)^2}$$Conclusion: This is exactly the MSE loss function used in standard Least Squares regression.