Gradient Boosting Seen as Gradient Descent
Mathematics ·tldr
The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on blackbox models. Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update^{1}, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!
But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.
GBM
The naive algorithm
A description of the algorithm I’m talking about can be found on wikipedia, but I’ll go over the algorithm somewhat quickly here just to add another phrasing of the thing.
Let’s assume we have a featureset $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $\lambda \leq 1$. The algorithm is as follows:
First we define $f_0(X) = 0$.
Now if $i \geq 1$, we let:
We then train a new set of trees (or whatever learner we want) such that:
We then find the real number scaler that minimizes our loss in the following equation (AKA a linesearch):
Finally we define:
So that
In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.
Gradient Descent
The typical scenario we generally talk about in gradient descent is fitting linear weights in some type of model (whether that be a linear model, logistic regression, neural network, whatever). So let’s stick to this paradigm and consider the simplest case (a linear model), and we’ll show how the algorithm above is actually the same algorithm, just slightly more general.
So we’re modeling $y = X\beta + \epsilon$; in the parlance of GBM: $f = \beta$ or $f(X) = X\beta$.
To solve this, we define $\beta_0 = 0$.
Then for $i \geq 1 \in \mathbb{N}$,
Where $L, \lambda, \gamma_i$ are exactly the same as in the gradient boosting algorithm^{2}. So that was the standard gradient descent algorithm. Now let’s show how these two are really the same beast.
Similarities
We’ll show that the update formula is the same (mutatis mutandis) in both, and then as a consequence the final models will be the constructed in the same way (because they’re both aggregates of their updates)^{3}.
To make the notation easier, we define:
Update Formula
For the update formula, we need to first note two things:
 Gradients are linear by definition, so $X\frac{\partial L}{\partial \beta_{i1}}$ can be seen as the result of modeling ($\frac{\partial L}{\partial \beta_{i1}(X)}$) by the application of a linear function (although it is normally unwise to do so).
Which is the same formula we have in gradient boosting.
Observations
Since we’re modeling the gradient with an arbitrary (potentially blackbox) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (MeanAbsoluteError), and a blackbox learner. The gradient at each point is either 1, 1, or np.nan
(because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$ which adds scale to our gradient, but in trees we have no such thing. So in order to add scale, we tend to fit the line search and then add a learning rate to avoid overfitting.
One can also subsample (as is a parameter in popular packages like LightGBM
). Subsampling is the blackbox model version of the familiar Stochastic Gradient Descent.
Summary
I realize this might be obvious to some, but it was pretty cool when I first realized this. I hope you found something useful and/or interesting here.

Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). ↩

In practice we generally don’t include the line search and just have a decreasing $\lambda$ – and sometimes we don’t even do that. We can get away with these shortcuts because the magnitude of the gradients will decrease as you get closer to the optima, and the derivative with respect to $\beta$ is always continuous. The same cannot be said about the gradient boosting algorithm. ↩

If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum. ↩