Gradient Boosting Seen as Gradient Descent

tldr

The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models.

Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update¹, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!

But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.

GBM

The naive algorithm

A description of the algorithm I’m talking about can be found on wikipedia, but I’ll go over the algorithm somewhat quickly here just so we’re on the same page. The basic idea is that Gradient Boosting can be applied to any additive model (black-box or otherwise), and not just trees; although you will generally see this applied on trees. We won’t be assuming any particular form of model in this post, but we may occasionally add information specific to trees because of their ubiquity.

Note: an iteration in an additive tree based model is a new tree. So when we say things like train a new iteration, we mean train a new tree, if trees are your model.

The wordy way

At each step, you train a new iteration of the model where the target is the gradient of your loss with respect to the predictions of the previous iteration. You then ensemble those iterations based on the pre-defined learning rate and their losses.

The rigorous way

Let’s assume we have a feature-set $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $0 \lt \lambda \leq 1$. The algorithm is as follows:

First we define $y_0 = f_0(X) = 0$.

Now for $i \geq 1$ (and less than our stopping criteria²), we let:

\[\begin{align*} y_i =& -\frac{\partial{L(y, f_{i-1}(X))}}{\partial f_{i-1}(X)} \\ =& -\frac{\partial{L(y, y_{i-1})}}{\partial y_{i-1}} \end{align*}\]

The key here is that the form of $f$ is irrelevant. The target for the next iteration is the gradient of the loss with respect to the most recent predictions.

We then train a new iteration of the model such that:

\[\begin{align*} g_i(X) \sim& y_i \end{align*}\]

We then find the real number scaler that minimizes our loss in the following equation (AKA a line-search):

\[\begin{align*} \gamma_i =& \text{argmin}_\gamma(y - (f_{i-1}(X) + \gamma g_i(X))) \end{align*}\]

Finally we define:

\[\begin{align*} f_i(X) =& f_{i-1}(X) + \lambda\gamma_i g_i(X) \end{align*}\]

So that

\[\begin{align*} f(X) =& \sum\limits_{i=0}^N \lambda\gamma_i g_i(X) \end{align*}\]

In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.

The outline

Given a model $f_i$, we want to construct $f_{i+1}$, and we’ll do that like so:

Compute the predictions $y_i = f_i(X)$
Compute the gradient $g_{i+1}$ of the loss with respect to those predictions
Model the gradient with an iteration of the chosen learner ($f_{i+1}$)
Line search³ to find the optimal weight $f_i + \alpha_{i+1}f_{i+1}$ in the combination.
Scale that optimal weight by $\lambda$ (the learning rate parameter) and add that to the previous model
Repeat as necessary

Gradient Descent

Given a differentiable model $f$ parameterized by parameter vector $\beta$, we want to find the $\hat\beta$ that minimizes the loss. We’re going to do that by following the gradient. More rigorously, given $X,y,\beta_i,\lambda$, we want to estimate $\beta_{i+1}$. The way we do that is:

Compute the predictions $y_i = f_i(X)$
Compute the gradient, $g_{i + 1}$, of the loss with respect to the parameters
Line search to find the optimal weight $\beta_i + \alpha_{i+1}\beta_{i+1}$ in the combination.
Scale that optimal weight by $\lambda$ (the learning rate parameter) and add that to the previous $\beta$
Repeat as necessary

This outline is exactly the same as the Gradient Boosting outline! The only difference is, instead of adding the models, we add the $\beta$s. This is, in fact, a superficial difference. A linear approximation of a gradient is the gradient itself, and the sum of linear models is the linear model of the sum⁴. In that light, traditional gradient descent is just a special case of gradient boosting!

Observations

Since we’re modeling the gradient with an arbitrary (potentially black-box) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (Mean-Absolute-Error), and a black-box learner. The gradient at each point is either 1, -1, or np.nan (because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change⁵. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$⁶ which adds scale to our gradient, but trees have no such thing. It’s for this reason we generally don’t have a magnitude-based stopping criterion, but rather opt for an explicit max-iterations.

One can also sub-sample (as is a parameter in popular packages like LightGBM). Sub-sampling is the black-box model version of the familiar Stochastic Gradient Descent.

Summary

Gradient Descent is essentially an optimization on Gradient Boosting, under the assumption that your function is parameterized by a continuous parameter vector, and that the function is differentiable with respect to that parameter vector. Under those assumptions, we can greatly simplify the computation and we get Gradient Descent.

Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). ↩
One generally explicitly sets a maximum number of estimators as the stopping criterion, but it doesn’t have to be so. One could imagine setting a criteria based on the magnitude of the gradients or something like that. ↩
In practice we generally don’t include the line search and just have a decreasing $\lambda$ – and sometimes we don’t even do that. We can get away with these shortcuts because the magnitude of the gradients will decrease as you get closer to the optima, and the derivative with respect to $\beta$ is always continuous. The same cannot be said about the gradient boosting algorithm. ↩
If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum. ↩
Neglecting the case where you get a perfect fit for a meaningful amount of data, the magnitude of the residuals will always be $|-1|$ or $|1|$. ↩
This comes from the chain rule when computing the derivative of the loss with respect to the parameters. ↩