The Mathematics of Modern Machine Learning
Understanding loss functions, gradient descent, and optimization equations
The Mathematics of Modern Machine Learning
At the core of every modern neural network lies a beautiful, rigorous mathematical foundation. In this article, we'll walk through some of the most critical equations that power modern optimization and loss calculations.
1. Gradient Descent Optimization
To train a machine learning model, we seek to minimize a loss function , where represents the model's weights and biases. We update the parameter vector iteratively by taking steps in the direction of the steepest descent—that is, the negative gradient of the loss function:
Here, represents the learning rate (step size), and 0 is the gradient vector with respect to at step .
💡 Mathematical Tip
If the learning rate is too large, the updates might overshoot the minimum, causing divergence. If it is too small, convergence will be slow.
2. Mean Squared Error (MSE)
For regression problems, the standard loss metric is the Mean Squared Error, calculated across training examples:
MathFormulaPlaceholder1
Where is the actual target value, and is the predicted output of our model. We can expand as a linear combination of inputs and weights :
3. Backpropagation and the Chain Rule
To compute the gradient for nested functions (such as layers in a neural network), we apply the multivariate chain rule. For a network layer outputting where and is an activation function like Sigmoid:
Its derivative is beautifully clean:
Using this, the partial derivative of the loss with respect to a weight in the final layer is:
These mathematical principles power today's largest models, enabling robust learning across complex datasets.