The Mathematics of Modern Machine Learning

At the core of every modern neural network lies a beautiful, rigorous mathematical foundation. In this article, we'll walk through some of the most critical equations that power modern optimization and loss calculations.

1. Gradient Descent Optimization

To train a machine learning model, we seek to minimize a loss function $L(\theta)$ , where $\theta$ represents the model's weights and biases. We update the parameter vector $\theta$ iteratively by taking steps in the direction of the steepest descent—that is, the negative gradient of the loss function:

$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$

Here, $\eta$ represents the learning rate (step size), and $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ 0 is the gradient vector with respect to $\theta$ at step $t$ .

💡 Mathematical Tip

If the learning rate $\eta$ is too large, the updates might overshoot the minimum, causing divergence. If it is too small, convergence will be slow.

2. Mean Squared Error (MSE)

For regression problems, the standard loss metric is the Mean Squared Error, calculated across $n$ training examples:

MathFormulaPlaceholder1

Where $y_i$ is the actual target value, and $\hat{y}_i$ is the predicted output of our model. We can expand $\hat{y}_i$ as a linear combination of inputs $X_{ij}$ and weights $w_j$ :

$\hat{y}_i = \sum_{j=1}^{d} X_{ij} w_j + b$

3. Backpropagation and the Chain Rule

To compute the gradient $\nabla L(\theta)$ for nested functions (such as layers in a neural network), we apply the multivariate chain rule. For a network layer outputting $a = \sigma(z)$ where $z = wx + b$ and $\sigma$ is an activation function like Sigmoid:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Its derivative is beautifully clean:

$\sigma'(z) = \sigma(z)(1 - \sigma(z))$

Using this, the partial derivative of the loss with respect to a weight $w$ in the final layer is:

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = \frac{\partial L}{\partial a} \cdot \sigma'(z) \cdot x$

These mathematical principles power today's largest models, enabling robust learning across complex datasets.