# Stacking

*To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.*

*If you have questions or suggestions about this tutorial, please open an issue here.*

In stacking one blends the predictions of different regressors or classifiers to gain, in some cases, better performance than naive averaging or majority vote.

For routine stacking tasks the MLJ user should use the `Stack`

model documented here. In this tutorial we build a two-model stack as an MLJ learning network, which we export as a new stand-alone composite model type `MyTwoStack`

. The objective of this tutorial is to: (i) Explain with julia code how stacking works; and (ii) Give an advanced demonstration of MLJ's composite model interface.

As we shall see, as a new stand-alone model type, we can apply the usual meta-algorithms, such as performance evaluation and tuning, to `MyTwoStack`

.

## Basic stacking using out-of-sample base learner predictions

A rather general stacking protocol was first described in a 1992 paper by David Wolpert. For a generic introduction to the basic two-layer stack described here, see this blog post of Burak Himmetoglu.

A basic stack consists of a number of base learners (two, in this illustration) and a single adjudicating model.

When a stacked model is called to make a prediction, the individual predictions of the base learners are made the columns of an *input* table for the adjudicating model, which then outputs the final prediction. However, it is crucial to understand that the flow of data *during training* is not the same.

The base model predictions used to train the adjudicating model are *not* the predictions of the base learners fitted to all the training data. Rather, to prevent the adjudicator giving too much weight to the base learners with low *training* error, the input data is first split into a number of folds (as in cross-validation), a base learner is trained on each fold complement individually, and corresponding predictions on the folds are spliced together to form a full-length prediction called the *out-of-sample prediction*.

For illustrative purposes we use just three folds. Each base learner will get three separate machines, for training on each fold complement, and a fourth machine, trained on all the supplied data, for use in the prediction flow.

We build the learning network with dummy data at the source nodes, so the reader inspects the workings of the network as it is built (by calling `fit!`

on nodes, and by calling the nodes themselves). As usual, this data is not seen by the exported composite model type, and the component models we choose are just default values for the hyperparameters of the composite model.

```
using MLJ
using PyPlot
using StableRNGs
```

Some models we will use:

```
linear = (@load LinearRegressor pkg=MLJLinearModels)()
knn = (@load KNNRegressor)()
tree_booster = (@load EvoTreeRegressor)()
forest = (@load RandomForestRegressor pkg=DecisionTree)()
svm = (@load SVMRegressor)()
```

```
import MLJLinearModels ✔
import NearestNeighborModels ✔
import EvoTrees ✔
import MLJDecisionTreeInterface ✔
import MLJScikitLearnInterface ✔
SVMRegressor(
kernel = "rbf",
degree = 3,
gamma = "auto",
coef0 = 0.0,
tol = 0.001,
C = 1.0,
epsilon = 0.1,
shrinking = true,
cache_size = 200,
max_iter = -1)
```

### Warm-up exercise: Define a model type to average predictions

Let's define a composite model type `MyAverageTwo`

that averages the predictions of two deterministic regressors. Here's the learning network:

```
X = source()
y = source()
model1 = linear
model2 = knn
m1 = machine(model1, X, y)
y1 = predict(m1, X)
m2 = machine(model2, X, y)
y2 = predict(m2, X)
yhat = 0.5*y1 + 0.5*y2
```

```
Node{Nothing}
args:
1: Node{Nothing}
2: Node{Nothing}
formula:
+(
#100(
predict(
Machine{LinearRegressor,…},
Source @196)),
#100(
predict(
Machine{KNNRegressor,…},
Source @196)))
```

In preparation for export, we wrap the learning network in a learning network machine, which specifies what the source nodes are, and which node is for prediction. As our exported model will make point-predictions (as opposed to probabilistic ones), we use a `Deterministic`

"surrogate" model:

`mach = machine(Deterministic(), X, y; predict=yhat)`

```
Machine{DeterministicSurrogate,…} trained 0 times; does not cache data
model: MLJBase.DeterministicSurrogate
args:
1: Source @196 ⏎ `Nothing`
2: Source @595 ⏎ `Nothing`
```

Note that we cannot actually fit this machine because we chose not to wrap our source nodes `X`

and `y`

in data.

Here's the macro call that "exports" the learning network as a new composite model `MyAverageTwo`

:

```
@from_network mach begin
mutable struct MyAverageTwo
regressor1=model1
regressor2=model2
end
end
```

Note that, unlike a normal struct definition, the defaults `model1`

and `model2`

must be specified, and they must refer to model instances in the learning network.

We can now create an instance of the new type:

`average_two = MyAverageTwo()`

```
MyAverageTwo(
regressor1 = LinearRegressor(
fit_intercept = true,
solver = nothing),
regressor2 = KNNRegressor(
K = 5,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()))
```

Evaluating this average model on the Boston data set, and comparing with the base model predictions:

```
function print_performance(model, data...)
e = evaluate(model, data...;
resampling=CV(rng=StableRNG(1234), nfolds=8),
measure=rms,
verbosity=0)
μ = round(e.measurement[1], sigdigits=5)
ste = round(std(e.per_fold[1])/sqrt(8), digits=5)
println("$model = $μ ± $(2*ste)")
end;
X, y = @load_boston
print_performance(linear, X, y)
print_performance(knn, X, y)
print_performance(average_two, X, y)
```

```
LinearRegressor = 4.8635 ± 0.34864
KNNRegressor = 6.2243 ± 0.44292
MyAverageTwo = 4.8523 ± 0.36264
```

## Stacking proper

### Helper functions:

To generate folds for generating out-of-sample predictions, we define

```
folds(data, nfolds) =
partition(1:nrows(data), (1/nfolds for i in 1:(nfolds-1))...);
```

For example, we have:

`f = folds(1:10, 3)`

`([1, 2, 3], [4, 5, 6], [7, 8, 9, 10])`

It will also be convenient to use the MLJ method `restrict(X, f, i)`

that restricts data `X`

to the `i`

th element (fold) of `f`

, and `corestrict(X, f, i)`

that restricts to the corresponding fold complement (the concatenation of all but the `i`

th fold).

For example, we have:

`corestrict(string.(1:10), f, 2)`

```
7-element Vector{String}:
"1"
"2"
"3"
"7"
"8"
"9"
"10"
```

### Choose some test data (optional) and some component models (defaults for the composite model):

```
figure(figsize=(8,6))
steps(x) = x < -3/2 ? -1 : (x < 3/2 ? 0 : 1)
x = Float64[-4, -1, 2, -3, 0, 3, -2, 1, 4]
Xraw = (x = x, )
yraw = steps.(x);
idxsort = sortperm(x)
xsort = x[idxsort]
ysort = yraw[idxsort]
step(xsort, ysort, label="truth", where="mid")
plot(x, yraw, ls="none", marker="o", label="data")
xlim(-4.5, 4.5)
legend()
```

Some models to stack (which we can change later):

```
model1 = linear
model2 = knn
```

```
KNNRegressor(
K = 5,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform())
```

The adjudicating model:

`judge = linear`

```
LinearRegressor(
fit_intercept = true,
solver = nothing)
```

### Define the training nodes

Let's instantiate some input and target source nodes for the learning network, wrapping the play data defined above in source nodes:

```
X = source(Xraw)
y = source(yraw)
```

`Source @014 ⏎ `AbstractVector{ScientificTypesBase.Count}``

Our first internal node will represent the three folds (vectors of row indices) for creating the out-of-sample predictions. We would like to define `f = folds(X, 3)`

but this will not work because `X`

is not a table, just a node representing a table. We could fix this by using the @node macro:

`f = @node folds(X, 3)`

```
Node{Nothing}
args:
1: Source @489
formula:
#10(
Source @489)
```

Now `f`

is itself a node, and so callable:

`f()`

`([1, 2, 3], [4, 5, 6], [7, 8, 9])`

However, we can also just overload `folds`

to work on nodes, using the `node`

*function*:

```
folds(X::AbstractNode, nfolds) = node(XX->folds(XX, nfolds), X)
f = folds(X, 3)
f()
```

`([1, 2, 3], [4, 5, 6], [7, 8, 9])`

In the case of `restrict`

and `corestrict`

, which also don't operate on nodes, method overloading will save us writing `@node`

all the time:

```
MLJ.restrict(X::AbstractNode, f::AbstractNode, i) =
node((XX, ff) -> restrict(XX, ff, i), X, f);
MLJ.corestrict(X::AbstractNode, f::AbstractNode, i) =
node((XX, ff) -> corestrict(XX, ff, i), X, f);
```

We are now ready to define machines for training `model1`

on each fold-complement:

```
m11 = machine(model1, corestrict(X, f, 1), corestrict(y, f, 1))
m12 = machine(model1, corestrict(X, f, 2), corestrict(y, f, 2))
m13 = machine(model1, corestrict(X, f, 3), corestrict(y, f, 3))
```

```
Machine{LinearRegressor,…} trained 0 times; caches data
model: MLJLinearModels.LinearRegressor
args:
1: Node{Nothing}
2: Node{Nothing}
```

Define each out-of-sample prediction of `model1`

:

```
y11 = predict(m11, restrict(X, f, 1));
y12 = predict(m12, restrict(X, f, 2));
y13 = predict(m13, restrict(X, f, 3));
```

Splice together the out-of-sample predictions for model1:

`y1_oos = vcat(y11, y12, y13);`

Note there is no need to overload the `vcat`

function to work on nodes; it does so out of the box, as does `hcat`

and basic arithmetic operations.

Since our source nodes are wrapping data, we can optionally check our network so far, by calling fitting and calling `y1_oos`

:

```
fit!(y1_oos, verbosity=0)
figure(figsize=(8,6))
step(xsort, ysort, label="truth", where="mid")
plot(x, y1_oos(), ls="none", marker="o", label="linear oos")
legend()
```

We now repeat the procedure for the other model:

```
m21 = machine(model2, corestrict(X, f, 1), corestrict(y, f, 1))
m22 = machine(model2, corestrict(X, f, 2), corestrict(y, f, 2))
m23 = machine(model2, corestrict(X, f, 3), corestrict(y, f, 3))
y21 = predict(m21, restrict(X, f, 1));
y22 = predict(m22, restrict(X, f, 2));
y23 = predict(m23, restrict(X, f, 3));
```

And testing the knn out-of-sample prediction:

```
y2_oos = vcat(y21, y22, y23);
fit!(y2_oos, verbosity=0)
figure(figsize=(8,6))
step(xsort, ysort, label="truth", where="mid")
plot(x, y2_oos(), ls="none", marker="o", label="knn oos")
legend()
```

Now that we have the out-of-sample base learner predictions, we are ready to merge them into the adjudicator's input table and construct the machine for training the adjudicator:

```
X_oos = MLJ.table(hcat(y1_oos, y2_oos))
m_judge = machine(judge, X_oos, y)
```

```
Machine{LinearRegressor,…} trained 0 times; caches data
model: MLJLinearModels.LinearRegressor
args:
1: Node{Nothing}
2: Source @014 ⏎ `AbstractVector{ScientificTypesBase.Count}`
```

Are we done with constructing machines? Well, not quite. Recall that when we use the stack to make predictions on new data, we will be feeding the adjudicator ordinary predictions of the base learners (rather than out-of-sample predictions). But so far, we have only defined machines to train the base learners on fold complements, not on the full data, which we do now:

```
m1 = machine(model1, X, y)
m2 = machine(model2, X, y)
```

```
Machine{KNNRegressor,…} trained 0 times; caches data
model: NearestNeighborModels.KNNRegressor
args:
1: Source @489 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
2: Source @014 ⏎ `AbstractVector{ScientificTypesBase.Count}`
```

### Define nodes still needed for prediction

To obtain the final prediction, `yhat`

, we get the base learner predictions, based on training with all data, and feed them to the adjudicator:

```
y1 = predict(m1, X);
y2 = predict(m2, X);
X_judge = MLJ.table(hcat(y1, y2))
yhat = predict(m_judge, X_judge)
```

```
Node{Machine{LinearRegressor,…}}
args:
1: Node{Nothing}
formula:
predict(
Machine{LinearRegressor,…},
table(
hcat(
predict(
Machine{LinearRegressor,…},
Source @489),
predict(
Machine{KNNRegressor,…},
Source @489))))
```

Let's check the final prediction node can be fit and called:

```
fit!(yhat, verbosity=0)
figure(figsize=(8,6))
step(xsort, ysort, label="truth", where="mid")
plot(x, yhat(), ls="none", marker="o", label="yhat")
legend()
```

Although of little statistical significance here, we note that stacking gives a lower *training* error than naive averaging:

```
e1 = rms(y1(), y())
e2 = rms(y2(), y())
emean = rms(0.5*y1() + 0.5*y2(), y())
estack = rms(yhat(), y())
@show e1 e2 emean estack;
```

```
e1 = 0.2581988897471611
e2 = 0.3771236166328254
emean = 0.2808716591058786
estack = 0.3373908215636326
```

## Export the learning network as a new model type

The learning network (less the data wrapped in the source nodes) amounts to a specification of a new composite model type for two-model stacks, trained with three-fold resampling of base model predictions. Let's create the new type `MyTwoModelStack`

, in the same way we exported the network for model averaging:

```
@from_network machine(Deterministic(), X, y; predict=yhat) begin
mutable struct MyTwoModelStack
regressor1=model1
regressor2=model2
judge=judge
end
end
my_two_model_stack = MyTwoModelStack()
```

```
MyTwoModelStack(
regressor1 = LinearRegressor(
fit_intercept = true,
solver = nothing),
regressor2 = KNNRegressor(
K = 5,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()),
judge = LinearRegressor(
fit_intercept = true,
solver = nothing))
```

And this completes the definition of our re-usable stacking model type.

## Applying `MyTwoModelStack`

to some data

Without undertaking any hyperparameter optimization, we evaluate the performance of a tree boosting algorithm and a support vector machine on a synthetic data set. As adjudicator, we'll use a random forest.

We use a synthetic set to give an example where stacking is effective but the data is not too large. (As synthetic data is based on perturbations to linear models, we are deliberately avoiding linear models in stacking illustration.)

`X, y = make_regression(1000, 20; sparse=0.75, noise=0.1, rng=123);`

#### Define the stack and compare performance

```
avg = MyAverageTwo(regressor1=tree_booster,
regressor2=svm)
stack = MyTwoModelStack(regressor1=tree_booster,
regressor2=svm,
judge=forest)
all_models = [tree_booster, svm, forest, avg, stack];
for model in all_models
print_performance(model, X, y)
end
```

```
EvoTreeRegressor{Float64,…} = 1.9201 ± 0.04538
SVMRegressor = 0.93596 ± 0.06682
RandomForestRegressor = 1.7479 ± 0.06646
MyAverageTwo = 1.3103 ± 0.06588
MyTwoModelStack = 0.90942 ± 0.07224
```

#### Tuning a stack

A standard abuse of good data hygiene is to optimize stack component models *separately* and then tune the adjudicating model hyperparameters (using the same resampling of the data) with the base learners fixed. Although more computationally expensive, better generalization might be expected by applying tuning to the stack as a whole, either simultaneously, or in in cheaper sequential steps. Since our stack is a stand-alone model, this is readily implemented.

As a proof of concept, let's see how to tune one of the base model hyperparameters, based on performance of the stack as a whole:

```
r = range(stack, :(regressor2.C), lower = 0.01, upper = 10, scale=:log)
tuned_stack = TunedModel(model=stack,
ranges=r,
tuning=Grid(shuffle=false),
measure=rms,
resampling=Holdout())
mach = fit!(machine(tuned_stack, X, y), verbosity=0)
best_stack = fitted_params(mach).best_model
best_stack.regressor2.C
```

`10.000000000000002`

Let's evaluate the best stack using the same data resampling used to the evaluate the various untuned models earlier (now we are neglecting data hygiene!):

`print_performance(best_stack, X, y)`

```
MyTwoModelStack = 0.86036 ± 0.04804
```