BriefingApril 10, 2026

The Mathematics of Modern Machine Learning

Understanding loss functions, gradient descent, and optimization equations


The Mathematics of Modern Machine Learning

At the core of every modern neural network lies a beautiful, rigorous mathematical foundation. In this article, we'll walk through some of the most critical equations that power modern optimization and loss calculations.

1. Gradient Descent Optimization

To train a machine learning model, we seek to minimize a loss function L(θ)L(\theta), where θ\theta represents the model's weights and biases. We update the parameter vector θ\theta iteratively by taking steps in the direction of the steepest descent—that is, the negative gradient of the loss function:

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

Here, η\eta represents the learning rate (step size), and MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^20 is the gradient vector with respect to θ\theta at step tt.

💡 Mathematical Tip

If the learning rate η\eta is too large, the updates might overshoot the minimum, causing divergence. If it is too small, convergence will be slow.

2. Mean Squared Error (MSE)

For regression problems, the standard loss metric is the Mean Squared Error, calculated across nn training examples:

MathFormulaPlaceholder1

Where yiy_i is the actual target value, and y^i\hat{y}_i is the predicted output of our model. We can expand y^i\hat{y}_i as a linear combination of inputs XijX_{ij} and weights wjw_j:

y^i=j=1dXijwj+b\hat{y}_i = \sum_{j=1}^{d} X_{ij} w_j + b

3. Backpropagation and the Chain Rule

To compute the gradient L(θ)\nabla L(\theta) for nested functions (such as layers in a neural network), we apply the multivariate chain rule. For a network layer outputting a=σ(z)a = \sigma(z) where z=wx+bz = wx + b and σ\sigma is an activation function like Sigmoid:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Its derivative is beautifully clean:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

Using this, the partial derivative of the loss with respect to a weight ww in the final layer is:

Lw=Laazzw=Laσ(z)x\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = \frac{\partial L}{\partial a} \cdot \sigma'(z) \cdot x

These mathematical principles power today's largest models, enabling robust learning across complex datasets.


Back to Insights