Gradient Boosting Seen as Gradient Descent

tldr

The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models. Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update1, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!

But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.

GBM

The naive algorithm

A description of the algorithm I’m talking about can be found on wikipedia, but I’ll go over the algorithm somewhat quickly here just to add another phrasing of the thing.

Let’s assume we have a feature-set $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $\lambda \leq 1$. The algorithm is as follows:

First we define $f_0(X) = 0$.

Now if $i \geq 1$, we let:

We then train a new set of trees (or whatever learner we want) such that:

We then find the real number scaler that minimizes our loss in the following equation (AKA a line-search):

Finally we define:

So that

In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.

Gradient Descent

The typical scenario we generally talk about in gradient descent is fitting linear weights in some type of model (whether that be a linear model, logistic regression, neural network, whatever). So let’s stick to this paradigm and consider the simplest case (a linear model), and we’ll show how the algorithm above is actually the same algorithm, just slightly more general.

So we’re modeling $y = X\beta + \epsilon$; in the parlance of GBM: $f = \beta$ or $f(X) = X\beta$.

To solve this, we define $\beta_0 = 0$.

Then for $i \geq 1 \in \mathbb{N}$,

Where $L, \lambda, \gamma_i$ are exactly the same as in the gradient boosting algorithm2. So that was the standard gradient descent algorithm. Now let’s show how these two are really the same beast.

Similarities

We’ll show that the update formula is the same (mutatis mutandis) in both, and then as a consequence the final models will be the constructed in the same way (because they’re both aggregates of their updates)3.

To make the notation easier, we define:

Update Formula

For the update formula, we need to first note two things:

  1. Gradients are linear by definition, so $X\frac{\partial L}{\partial \beta_{i-1}}$ can be seen as the result of modeling ($\frac{\partial L}{\partial \beta_{i-1}(X)}$) by the application of a linear function (although it is normally unwise to do so).

Which is the same formula we have in gradient boosting.

Observations

Since we’re modeling the gradient with an arbitrary (potentially black-box) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (Mean-Absolute-Error), and a black-box learner. The gradient at each point is either 1, -1, or np.nan (because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$ which adds scale to our gradient, but in trees we have no such thing. So in order to add scale, we tend to fit the line search and then add a learning rate to avoid over-fitting.

One can also sub-sample (as is a parameter in popular packages like LightGBM). Sub-sampling is the black-box model version of the familiar Stochastic Gradient Descent.

Summary

I realize this might be obvious to some, but it was pretty cool when I first realized this. I hope you found something useful and/or interesting here.

  1. Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). 

  2. If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum.