Ensemble models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Preliminary steps
Homogenous ensembles

Let's start by loading the relevant packages and generating some dummy data.

using MLJ
import DataFrames: DataFrame
using StableRNGs

rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng, 300))
X = DataFrame(Xraw, :auto)

train, test = partition(eachindex(y), 0.7);

Let's also load a simple model:

KNNRegressor = @load KNNRegressor
knn_model = KNNRegressor(K=10)

import NearestNeighborModels ✔
KNNRegressor(
  K = 10, 
  algorithm = :kdtree, 
  metric = Distances.Euclidean(0.0), 
  leafsize = 10, 
  reorder = true, 
  weights = NearestNeighborModels.Uniform())

As before, let's instantiate a machine that wraps the model and data:

knn = machine(knn_model, X, y)

untrained Machine; caches model-specific representations of data
  model: KNNRegressor(K = 10, …)
  args: 
    1:	Source @877 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @914 ⏎ AbstractVector{ScientificTypesBase.Continuous}

and fit it

fit!(knn, rows=train)
ŷ = predict(knn, X[test, :]) # or use rows=test
l2(ŷ, y[test]) # sum of squares loss

0.004083184660412992

The workflow above is equivalent to just calling evaluate:

evaluate(
    knn_model,
    X,
    y;
    resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)),
    measure=rms,
)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┐
│ measure                │ operation │ measurement │
├────────────────────────┼───────────┼─────────────┤
│ RootMeanSquaredError() │ predict   │ 0.124       │
└────────────────────────┴───────────┴─────────────┘

‎

MLJ offers basic support for ensembling such as bagging. Defining such an ensemble of simple "atomic" models is done via the EnsembleModel constructor:

ensemble_model = EnsembleModel(model=knn_model, n=20);

where the n=20 indicates how many models are present in the ensemble.

Now that we've instantiated an ensemble, it can be trained and tested the same as any other model:

estimates = evaluate(ensemble_model, X, y, resampling=CV())
estimates

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 0.00735     │
│   p = 2) │           │             │
└──────────┴───────────┴─────────────┘
┌─────────────────────────────────────────────────────┬─────────┐
│ per_fold                                            │ 1.96*SE │
├─────────────────────────────────────────────────────┼─────────┤
│ [0.0085, 0.0131, 0.00548, 0.00777, 0.00457, 0.0047] │ 0.00285 │
└─────────────────────────────────────────────────────┴─────────┘

here the implicit measure is the sum of squares loss (default for regressions). The measurement is the mean taken over the folds:

@show estimates.measurement[1]
@show mean(estimates.per_fold[1])

estimates.measurement[1] = 0.007350794741211419
mean(estimates.per_fold[1]) = 0.007350794741211419

Note that multiple measures can be specified jointly. Here only one measure is (implicitly) specified but we still have to select the corresponding results (whence the [1] for both the measurement and per_fold).

‎

Let's simultaneously tune the ensemble's bagging_fraction and the K-Nearest neighbour hyperparameter K. Since one of our models is a field of the other, we have nested hyperparameters:

ensemble_model

DeterministicEnsembleModel(
  model = KNNRegressor(
        K = 10, 
        algorithm = :kdtree, 
        metric = Distances.Euclidean(0.0), 
        leafsize = 10, 
        reorder = true, 
        weights = NearestNeighborModels.Uniform()), 
  atomic_weights = Float64[], 
  bagging_fraction = 0.8, 
  rng = Random._GLOBAL_RNG(), 
  n = 20, 
  acceleration = ComputationalResources.CPU1{Nothing}(nothing), 
  out_of_bag_measure = Any[])

To define a tuning grid, we construct ranges for the two parameters and collate these ranges:

B_range = range(
    ensemble_model,
    :bagging_fraction,
    lower=0.5,
    upper=1.0,)

NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25)

K_range = range(
    ensemble_model,
    :(model.K),
    lower=1,
    upper=20,
)

NumericRange(1 ≤ model.K ≤ 20; origin=10.5, unit=9.5)

The scale for a tuning grid is linear by default but can be specified to :log10 for logarithmic ranges. Now we have to define a TunedModel and fit it:

tm = TunedModel(
    model=ensemble_model,
    tuning=Grid(resolution=10), # 10x10 grid
    resampling=Holdout(fraction_train=0.8, rng=StableRNG(42)),
    ranges=[B_range, K_range],
)

tuned_ensemble = machine(tm, X, y)
fit!(tuned_ensemble, rows=train);

Note the rng=42 seeds the random number generator for reproducibility of this example.

‎

The best model can be accessed like so:

best_ensemble = fitted_params(tuned_ensemble).best_model
@show best_ensemble.model.K
@show best_ensemble.bagging_fraction

best_ensemble.model.K = 3
best_ensemble.bagging_fraction = 0.5555555555555556

The report method gives more detailed information on the tuning process:

r = report(tuned_ensemble)
keys(r)

(:best_model, :best_history_entry, :history, :best_report, :plotting)

For instance, r.plotting contains details about the optimization you might use in a plot:

r.plotting

(parameter_names = ["bagging_fraction", "model.K"],
 parameter_scales = [:linear, :linear],
 parameter_values = Any[0.8888888888888888 16; 0.7222222222222222 12; 0.5 14; 0.5555555555555556 9; 0.8333333333333334 7; 0.5555555555555556 5; 0.6111111111111112 1; 0.6111111111111112 9; 0.9444444444444444 20; 0.7777777777777778 7; 0.5555555555555556 1; 0.7222222222222222 7; 0.6111111111111112 12; 1.0 3; 0.8888888888888888 7; 0.5555555555555556 12; 0.8888888888888888 20; 1.0 18; 0.7777777777777778 9; 0.6666666666666666 18; 1.0 16; 0.7777777777777778 20; 0.5555555555555556 18; 0.6111111111111112 18; 1.0 1; 0.8333333333333334 18; 0.6666666666666666 14; 0.5 5; 0.5 9; 0.9444444444444444 1; 0.6666666666666666 12; 0.9444444444444444 7; 0.8333333333333334 9; 0.5 18; 0.7222222222222222 14; 0.9444444444444444 5; 0.8888888888888888 14; 0.6111111111111112 16; 0.6111111111111112 3; 0.5555555555555556 14; 0.6111111111111112 5; 0.7222222222222222 3; 0.7222222222222222 5; 0.6666666666666666 9; 0.8888888888888888 5; 0.8333333333333334 16; 0.6666666666666666 20; 0.9444444444444444 3; 0.7777777777777778 3; 0.6111111111111112 7; 0.5555555555555556 7; 0.7222222222222222 20; 0.6111111111111112 14; 0.9444444444444444 9; 0.5 12; 0.9444444444444444 16; 0.6666666666666666 16; 0.7777777777777778 5; 1.0 14; 0.8333333333333334 1; 0.7222222222222222 1; 1.0 12; 0.5555555555555556 20; 0.6666666666666666 7; 0.9444444444444444 18; 0.8333333333333334 3; 0.5 16; 0.8888888888888888 1; 0.7222222222222222 16; 0.7222222222222222 18; 0.6111111111111112 20; 0.8888888888888888 18; 0.7777777777777778 12; 0.5555555555555556 16; 0.8333333333333334 5; 0.8888888888888888 9; 0.5 20; 0.8888888888888888 12; 0.7777777777777778 1; 0.6666666666666666 3; 0.5 3; 0.6666666666666666 1; 0.9444444444444444 14; 0.7777777777777778 18; 1.0 5; 0.8888888888888888 3; 0.8333333333333334 20; 0.6666666666666666 5; 0.8333333333333334 12; 1.0 7; 0.5 1; 1.0 9; 0.7777777777777778 14; 0.9444444444444444 12; 0.8333333333333334 14; 1.0 20; 0.7222222222222222 9; 0.5555555555555556 3; 0.5 7; 0.7777777777777778 16],
 measurements = [0.005510537040250436, 0.005294018508449513, 0.01256484067837205, 0.0054860018049651, 0.0020693277819589496, 0.0019639439830071797, 0.0026603182414894643, 0.004444146375870455, 0.0077358949058529925, 0.0021799433620184293, 0.002901400981057841, 0.0028715722765500153, 0.0063458333077938715, 0.0036225306262139797, 0.002282828591914064, 0.00918965779826136, 0.008544481598245, 0.005444233904727673, 0.0027553596560795043, 0.013585059291628657, 0.004708208753217298, 0.011515816102703132, 0.015262646259212401, 0.015262646259212401, 0.004013058912260644, 0.007608354823647869, 0.008174126731907353, 0.0025495413808797004, 0.007349983003073203, 0.004008941270310222, 0.005569183605700463, 0.0024000064268967607, 0.0027606157648073283, 0.0177264497695686, 0.006709265024484596, 0.0019687980791523577, 0.004532308770527076, 0.010027634843248803, 0.0019779434879680985, 0.010841808736560734, 0.0020759881586082107, 0.0019617229890438992, 0.002018417496027778, 0.0031996429793088826, 0.0019590935482526337, 0.006171774142431638, 0.014983868707489145, 0.003177835188809685, 0.003177835188809685, 0.002723962310208088, 0.002723962310208088, 0.013727058382608561, 0.009915483651085747, 0.002628169289712383, 0.009036198537464698, 0.004849112481888398, 0.004849112481888398, 0.0020620505509132366, 0.004396831413497289, 0.0030967322661130544, 0.0030967322661130544, 0.0030050786079581266, 0.01837511993297943, 0.0024307747926904802, 0.006426314301505299, 0.002535709537183082, 0.016113432653505892, 0.0031873668461001063, 0.008375530466285825, 0.010134608879858106, 0.014975086337362012, 0.0070015054388494246, 0.005275102165054362, 0.013285463374577996, 0.0018856635463664257, 0.002738847963493905, 0.022079608995649973, 0.003959343153073119, 0.003848302817873994, 0.0022478853165749956, 0.0022478853165749956, 0.003198094938422581, 0.0047792631410639675, 0.008846111297017867, 0.0021374928753109066, 0.002577937915771061, 0.009416224842983895, 0.001936536575767905, 0.004489428112904196, 0.0029393446137467344, 0.0022063466071862636, 0.0025629081912535425, 0.0053210828789251345, 0.0034506146716379144, 0.005555011357520935, 0.007131032347247583, 0.0033241004726587385, 0.0016464401054677978, 0.004584602950789503, 0.007192982512486613],)

Although for that we can also use a built-in plot recipe for TunedModel:

using Plots
plot(tuned_ensemble)

Plot{Plots.GRBackend() n=4}

Finally you can always just evaluate the model by reporting l2 on the test set:

ŷ = predict(tuned_ensemble, rows=test)
@show l2(ŷ, y[test])

l2(ŷ, y[test]) = 0.00267032197153539

‎

Ensemble models

Preliminary steps

Homogenous ensembles

Training and testing an ensemble

Systematic tuning

Reporting results