Logistic Regression¶

Peter Ma | Dec 31st 2025¶

Logistic regression is effectively the classification equivalent to the linear regression problem. Instead of assuming there is a linear relationship between the data and the labels, we assume that there is a linear relationship between with the data $X$ and the log-odds of the labels $y$.

Why Log-Odds?¶

Reminder that the log-odds is the ratio of the probability of 2 outcomes. For example if there is probability P of success then the odds of success is $P/(1-P)$ but why do we care about log odds?

Well... you can't predict raw probability with a line like $y = mx+b$ because the line will eventually go above 1 or below 0, which is impossible for a probability.

But Log-Odds cover the entire number line from $-\infty$ to $+\infty$, we can set them equal to a linear equation without breaking any rules! This effectively translates the classification problem into a regression like problem.

$$\ln\left(\frac{P}{1-P}\right) = W^T X + \beta \tag{binary class.}$$

FYI: log-odds is also called logits!

Formalise¶

Consider a binary classification problem. The goal is to define things in terms of log-odds as discussed before. $$\hat y_\theta = \sigma(W^T X + \beta)$$

We want to construct a $\sigma$ such that thats the case $$\ln\left(\frac{P}{1-P}\right) = W^T X + \beta \tag{binary class.}$$ $$\frac{P}{1-P} = \exp{W^T X + \beta \tag{binary class.}}$$

$$P = \exp{W^T X + \beta} (1-P)$$$$P = \exp{W^T X + \beta} -P\exp{W^T X + \beta} $$$$P+P\exp{W^T X + \beta} = \exp{W^T X + \beta} $$$$P(1+\exp{W^T X + \beta}) = \exp{W^T X + \beta} $$$$P = \frac{\exp{W^T X + \beta}}{1+\exp{W^T X + \beta}} $$$$\boxed{P = \frac{1}{1+\exp{-(W^T X + \beta)}}} \tag{factor and divide }$$

This is our $\sigma(z) = \frac{1}{1+\exp{-z}}$ and more famously known as the sigmoid function! We can generalize this to multiple classes too!

Generalization¶

If we want to generalize to multi class problems we can derive the famous softmax function consider this generalization of log-odd...

Let us define a reference class. It is the class we arbitrarily choose to set as the "standard" against. A concrete example: let us classify Red, Green, or Blue ($K=3$).The probabilities must sum to 1. If I tell you $P(\text{Red}) = 0.2$ and $P(\text{Green}) = 0.5$, you automatically know that $P(\text{Blue}) = 0.3$. You don't need a separate equation to predict Blue! If we try to learn distinct weights ($W_{red}, W_{green}, W_{blue}$) for all three, the model is over-parameterized. To fix this, we pick one class—let's pick Blue—to be the "Reference Class" and force its score to be zero.

Alright so instead of one equation, we now have $K-1$ equations. We model the log-ratio of probability of class $i$ vs the reference class $K$:$$\ln \left( \frac{P(y=i)}{P(y=K)} \right) = z_i$$ We repeate the process solve for this $$\frac{P(y=i)}{P(y=K)} = e^{z_i} \implies P(y=i) = P(y=K) \cdot e^{z_i}$$ Okay but to solve a system we know that the sum of probabilities for all classes must equal 1: $$\sum_{j=1}^{K} P(y=j) = 1$$ Now we drop in our solution from before $$\sum_{j=1}^{K} \left[ P(y=K) \cdot e^{z_j} \right] = 1$$ And then we realise there is no dependence! we can factor $$P(y=K)\sum_{j=1}^{K} \left[ e^{z_j} \right] = 1$$ $$P(y=K) = \frac{1}{\sum_{j=1}^{K} \left[ e^{z_j} \right] }$$ We see that this needs to be solved for a particular class: $$\boxed{P(y=i) = \frac{1}{\sum_{j=1}^{K} e^{z_j}} \cdot e^{z_i}}\tag{softmax!}$$

Physics Connection:¶

For the physics inclined student, the Softmax function is pretty much identical to the Boltzmann Distribution (or Gibbs Distribution) from statistical mechanics. More specifically, in physics, nature prefers low energy states. The lower the energy $E_i$, the higher the probability. In machine learning, we prefer high scores (logits). The higher the logit $z_i$, the higher the probability. Thus, we can think of the logits as negative energies:$$z_i \propto -E_i$$ This is the foundation of Energy-Based Models (EBMs) (popularized by Yann LeCun!), where instead of training a classifier to output "probabilities," you train a model to output an "energy scalar" for a given input configuration, and you try to push the energy of the "correct" answer down and pull the energy of incorrect answers up.

Even more... the denominator $\sum e^{z_j}$ can be reconized as the Partition Function ($Z$). In both fields, calculating $Z$ is the hardest part.In Physics: summing over all possible microstates of a gas is often intractable (requires integrals over high-dimensional phase space?). While in ML: If you are doing simple classification, summing over $K=10$ classes is easy. But in Generative Modeling (like LLMs or Diffusion), summing over "all possible text sequences" to normalize the probability is impossible. This is why we use techniques like Contrastive Divergence or approximations that avoid calculating $Z$ directly.

Optimization Problem¶

If we assume the labels follow a Bernoulli distribution, the loss function is the Binary Cross Entropy (or Negative Log Likelihood):$$\boxed{\mathcal{L} = - \sum_{i} \left[ y_i \log(\hat y_i) + (1-y_i) \log(1-\hat y_i) \right]} \tag{binary cross-entropy loss}$$ Okay lets bang this out. Unlike linear regression, we cannot solve this with a closed-form solution (like the Normal Equation) because of the non-linearity introduced by the sigmoid. However, we can optimize this using Gradient Descent.

$$\frac{\partial \mathcal{L}}{\partial \theta} = X^T (\hat y - y)$$$$\theta := \theta - \alpha \frac{\partial \mathcal{L}}{\partial \theta}$$

Okay lets implement this. I can't use any autograd calls because I haven't shown how thats implemented so I will manually compute the gradient ;)

Implementation¶

Okay enough math lets try this out. Below is an example of an end to end logistic regression for breast cancer classification (true or false).

In [2]:
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Load & Prep Data
data = load_breast_cancer()
X, y = data.data, data.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = jnp.array(X_train)
X_test = jnp.array(X_test)
y_train = jnp.array(y_train[:, None])
y_test = jnp.array(y_test[:, None])

# 2. Bias Trick
train_ones = jnp.ones((X_train.shape[0], 1))
test_ones = jnp.ones((X_test.shape[0], 1))
X_train_aug = jnp.hstack((train_ones, X_train))
X_test_aug = jnp.hstack((test_ones, X_test))

# 3. Initialization
key = jax.random.PRNGKey(0)
W = jax.random.normal(key, (X_train_aug.shape[1], 1)) * 0.01

def predict(W, X):
    return jax.nn.sigmoid(X @ W)

# We keep this just for tracking the loss history, not for training
def loss_fn(W, X, y):
    y_pred = predict(W, X)
    eps = 1e-15
    y_pred = jnp.clip(y_pred, eps, 1 - eps)
    return -jnp.mean(y * jnp.log(y_pred) + (1 - y) * jnp.log(1 - y_pred))

# MANUAL GRADIENT UPDATE
@jax.jit
def update_step(W, X, y, learning_rate):
    N = X.shape[0]
    
    # 1. Forward Pass
    y_pred = predict(W, X)
    
    # 2. Compute Gradient Manually: (1/N) * X.T @ (y_pred - y)
    error = y_pred - y
    grads = (X.T @ error) / N
    
    # 3. Update
    return W - learning_rate * grads

# Training Loop
learning_rate = 0.01
epochs = 2000
loss_history = []

for i in range(epochs):
    W = update_step(W, X_train_aug, y_train, learning_rate)
    
    if i % 100 == 0:
        # We only call loss_fn here for logging
        current_loss = loss_fn(W, X_train_aug, y_train)
        loss_history.append(current_loss)

print(f"\nFinal Weights shape: {W.shape}")

# 4. Results
y_pred_prob = predict(W, X_test_aug)
y_pred_class = (y_pred_prob >= 0.5).astype(jnp.float32)
accuracy = jnp.mean(y_pred_class == y_test)
print(f"Test Accuracy: {accuracy:.2%}")

# 5. Plotting
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(loss_history)
plt.xlabel("Hundreds of Iterations")
plt.ylabel("Binary Cross Entropy Loss")
plt.title("Training Loss (Manual Gradient)")
plt.grid(True)

plt.subplot(1, 2, 2)
plt.hist(y_pred_prob[y_test==0], alpha=0.5, label='Class 0', bins=20)
plt.hist(y_pred_prob[y_test==1], alpha=0.5, label='Class 1', bins=20)
plt.legend()
plt.show()
Final Weights shape: (31, 1)
Test Accuracy: 99.12%

Probablistic / Generative Interpretation¶

You might be wondering how the hell did I get the binary cross-entropy loss function? lets derive it.

First we assume our target variable $y$ (where $y \in \{0,1\}$) follows a Bernoulli distribution.Let our model's predicted probability be $\hat{y} = P(y=1|x)$.The probability mass function for a single observation is:$$P(y|x) = \begin{cases} \hat{y} & \text{if } y=1 \\ 1-\hat{y} & \text{if } y=0 \end{cases}$$As we saw before, we can compress this into a single mathematical expression:$$P(y|x) = \hat{y}^y (1-\hat{y})^{1-y}$$

Now, assume we have a dataset of $N$ independent examples. The probability of observing this exact sequence of data (the Likelihood $L$) is the product of the individual probabilities:$$L(\theta) = \prod_{i=1}^{N} P(y^{(i)}|x^{(i)})$$Substitute our compressed Bernoulli formula:$$L(\theta) = \prod_{i=1}^{N} \left[ (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}} \right]$$

We want to maximize this likelihood. To make the math easier (turning products into sums), we take the natural logarithm ($\ln$ or $\log$).$$\ell(\theta) = \log \left( \prod_{i=1}^{N} \left[ (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}} \right] \right)$$Using the log rule $\log(ab) = \log(a) + \log(b)$:$$\ell(\theta) = \sum_{i=1}^{N} \log \left[ (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}} \right]$$Using the log rule $\log(a^b) = b \log(a)$:$$\boxed{\ell(\theta) = \sum_{i=1}^{N} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right]}$$ Taddaaaaaaa! :)