Ensemble models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Preliminary steps
Homogenous ensembles

Let's start by loading the relevant packages and generating some dummy data.

using MLJ
import DataFrames: DataFrame
using StableRNGs

rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng, 300))
X = DataFrame(Xraw, :auto)

train, test = partition(eachindex(y), 0.7);

Let's also load a simple model:

KNNRegressor = @load KNNRegressor
knn_model = KNNRegressor(K=10)

import NearestNeighborModels ✔
KNNRegressor(
  K = 10, 
  algorithm = :kdtree, 
  metric = Distances.Euclidean(0.0), 
  leafsize = 10, 
  reorder = true, 
  weights = NearestNeighborModels.Uniform())

As before, let's instantiate a machine that wraps the model and data:

knn = machine(knn_model, X, y)

untrained Machine; caches model-specific representations of data
  model: KNNRegressor(K = 10, …)
  args: 
    1:	Source @777 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @258 ⏎ AbstractVector{ScientificTypesBase.Continuous}

and fit it

fit!(knn, rows=train)
ŷ = predict(knn, X[test, :]) # or use rows=test
l2(ŷ, y[test]) # sum of squares loss

0.004083184660412993

The workflow above is equivalent to just calling evaluate:

evaluate(
    knn_model,
    X,
    y;
    resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)),
    measure=rms,
)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬──────────┐
│ measure                │ operation │ measurement │ per_fold │
├────────────────────────┼───────────┼─────────────┼──────────┤
│ RootMeanSquaredError() │ predict   │ 0.124       │ [0.124]  │
└────────────────────────┴───────────┴─────────────┴──────────┘

‎

MLJ offers basic support for ensembling such as bagging. Defining such an ensemble of simple "atomic" models is done via the EnsembleModel constructor:

ensemble_model = EnsembleModel(model=knn_model, n=20);

where the n=20 indicates how many models are present in the ensemble.

Now that we've instantiated an ensemble, it can be trained and tested the same as any other model:

estimates = evaluate(ensemble_model, X, y, resampling=CV())
estimates

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┬─────────┬──────────────────────────────────────────────────────┐
│ measure  │ operation │ measurement │ 1.96*SE │ per_fold                                             │
├──────────┼───────────┼─────────────┼─────────┼──────────────────────────────────────────────────────┤
│ LPLoss(  │ predict   │ 0.00729     │ 0.00276 │ [0.00869, 0.0127, 0.0057, 0.00768, 0.00421, 0.00479] │
│   p = 2) │           │             │         │                                                      │
└──────────┴───────────┴─────────────┴─────────┴──────────────────────────────────────────────────────┘

here the implicit measure is the sum of squares loss (default for regressions). The measurement is the mean taken over the folds:

@show estimates.measurement[1]
@show mean(estimates.per_fold[1])

estimates.measurement[1] = 0.007293594101019109
mean(estimates.per_fold[1]) = 0.007293594101019109

Note that multiple measures can be specified jointly. Here only one measure is (implicitly) specified but we still have to select the corresponding results (whence the [1] for both the measurement and per_fold).

‎

Let's simultaneously tune the ensemble's bagging_fraction and the K-Nearest neighbour hyperparameter K. Since one of our models is a field of the other, we have nested hyperparameters:

ensemble_model

DeterministicEnsembleModel(
  model = KNNRegressor(
        K = 10, 
        algorithm = :kdtree, 
        metric = Distances.Euclidean(0.0), 
        leafsize = 10, 
        reorder = true, 
        weights = NearestNeighborModels.Uniform()), 
  atomic_weights = Float64[], 
  bagging_fraction = 0.8, 
  rng = Random._GLOBAL_RNG(), 
  n = 20, 
  acceleration = ComputationalResources.CPU1{Nothing}(nothing), 
  out_of_bag_measure = Any[])

To define a tuning grid, we construct ranges for the two parameters and collate these ranges:

B_range = range(
    ensemble_model,
    :bagging_fraction,
    lower=0.5,
    upper=1.0,)

NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25)

K_range = range(
    ensemble_model,
    :(model.K),
    lower=1,
    upper=20,
)

NumericRange(1 ≤ model.K ≤ 20; origin=10.5, unit=9.5)

The scale for a tuning grid is linear by default but can be specified to :log10 for logarithmic ranges. Now we have to define a TunedModel and fit it:

tm = TunedModel(
    model=ensemble_model,
    tuning=Grid(resolution=10), # 10x10 grid
    resampling=Holdout(fraction_train=0.8, rng=StableRNG(42)),
    ranges=[B_range, K_range],
)

tuned_ensemble = machine(tm, X, y)
fit!(tuned_ensemble, rows=train);

Note the rng=42 seeds the random number generator for reproducibility of this example.

‎

The best model can be accessed like so:

best_ensemble = fitted_params(tuned_ensemble).best_model
@show best_ensemble.model.K
@show best_ensemble.bagging_fraction

best_ensemble.model.K = 3
best_ensemble.bagging_fraction = 0.6111111111111112

The report method gives more detailed information on the tuning process:

r = report(tuned_ensemble)
keys(r)

(:best_model, :best_history_entry, :history, :best_report, :plotting)

For instance, r.plotting contains details about the optimization you might use in a plot:

r.plotting

(parameter_names = ["bagging_fraction", "model.K"],
 parameter_scales = [:linear, :linear],
 parameter_values = Any[0.9444444444444444 1; 0.8888888888888888 12; 0.7777777777777778 18; 0.6111111111111112 18; 0.5555555555555556 20; 0.6111111111111112 12; 0.5555555555555556 3; 0.6111111111111112 14; 0.5 1; 0.6111111111111112 5; 0.8888888888888888 3; 0.5 16; 0.6111111111111112 16; 0.6666666666666666 12; 0.6111111111111112 1; 0.8888888888888888 20; 0.6666666666666666 5; 0.5 3; 0.6666666666666666 7; 0.5555555555555556 12; 0.7777777777777778 7; 0.8333333333333334 3; 0.8333333333333334 9; 1.0 20; 0.5 7; 0.9444444444444444 16; 1.0 16; 0.5 14; 0.6666666666666666 20; 0.8333333333333334 18; 0.9444444444444444 12; 0.5 9; 0.5555555555555556 1; 0.8888888888888888 16; 0.6666666666666666 16; 0.8333333333333334 20; 0.7777777777777778 3; 0.8888888888888888 9; 0.7222222222222222 5; 0.8888888888888888 14; 0.6666666666666666 18; 0.9444444444444444 7; 0.5555555555555556 18; 0.8333333333333334 14; 1.0 7; 1.0 12; 0.8333333333333334 7; 0.8333333333333334 1; 0.7222222222222222 3; 1.0 1; 0.5555555555555556 16; 0.7222222222222222 12; 1.0 3; 0.6666666666666666 9; 0.7777777777777778 1; 0.7222222222222222 20; 0.8333333333333334 16; 0.7222222222222222 18; 0.5555555555555556 9; 0.7222222222222222 16; 0.8888888888888888 18; 0.6111111111111112 20; 1.0 5; 0.6666666666666666 1; 0.5 5; 0.9444444444444444 14; 0.7777777777777778 16; 0.7777777777777778 5; 0.5555555555555556 14; 0.5555555555555556 7; 0.6111111111111112 3; 0.5 12; 0.7222222222222222 14; 0.8333333333333334 5; 0.9444444444444444 3; 0.6666666666666666 14; 0.5 18; 0.5555555555555556 5; 0.6111111111111112 7; 0.8888888888888888 5; 0.9444444444444444 5; 0.8888888888888888 1; 0.8888888888888888 7; 0.8333333333333334 12; 0.7777777777777778 14; 0.5 20; 1.0 14; 0.6111111111111112 9; 0.7222222222222222 9; 0.7222222222222222 7; 0.9444444444444444 20; 0.6666666666666666 3; 0.7777777777777778 12; 0.7222222222222222 1; 0.7777777777777778 9; 1.0 18; 0.9444444444444444 9; 0.9444444444444444 18; 0.7777777777777778 20; 1.0 9],
 measurements = [0.0036811840064277976, 0.0036788997334642193, 0.00904583486399471, 0.00904583486399471, 0.017390137197934353, 0.006807028286037007, 0.0021741493784553636, 0.009439697438516723, 0.0021407861207496546, 0.0024840255493519987, 0.0027725972479198135, 0.013771903945890647, 0.013771903945890647, 0.005938085175176712, 0.0029058249939984087, 0.00948996166614012, 0.002467192598084072, 0.002268196209658676, 0.0024191788048816454, 0.008274940421039611, 0.002473244716126557, 0.002549331880884844, 0.0025869339370883917, 0.007131032347247583, 0.004435639444937307, 0.005125973153068548, 0.005125973153068548, 0.012300633093804343, 0.014943684993909195, 0.008441061786115027, 0.0035519148259744045, 0.006836203216028844, 0.0030551728474715175, 0.005351098410328134, 0.005351098410328134, 0.008801631923161631, 0.002239296037484419, 0.0027958792561998884, 0.0019465784292984633, 0.004665877333859382, 0.010991415469309688, 0.002554165431818356, 0.015856692143265603, 0.0048855405751468605, 0.0029393446137467353, 0.0030050786079581266, 0.002474228830570231, 0.0035643210272377135, 0.0019503514823357312, 0.004013058912260644, 0.01249466521467012, 0.004940524385702354, 0.0036225306262139788, 0.0035291988710757453, 0.003688918430055955, 0.011533248288101875, 0.006226863680700907, 0.010039476876405308, 0.004651530612427815, 0.007747385309676368, 0.006547565895390163, 0.016042278140071742, 0.0021374928753109057, 0.0031808341410224676, 0.0026169101748682003, 0.004685522100651275, 0.00704730004735361, 0.0019783224609007076, 0.011829845114023824, 0.003672219313494293, 0.0018786251881857245, 0.008595156735288354, 0.006836826077498642, 0.0020625888370619204, 0.003077933558055164, 0.008049585277034668, 0.018706448745330312, 0.0019501831966104266, 0.00266998907747956, 0.0020934018359571077, 0.0020934018359571077, 0.0039024278718444464, 0.0022324707462274946, 0.004003229747209045, 0.005309193464647045, 0.021248942265408622, 0.004396831413497289, 0.005168128390757924, 0.005168128390757924, 0.002549690220877163, 0.008114547199187241, 0.002123378794529947, 0.005029167871476261, 0.00324370378164193, 0.0027548108061392967, 0.005444233904727671, 0.002450758716225779, 0.00606642684900056, 0.011031824204117817, 0.002562908191253542],)

Although for that we can also use a built-in plot recipe for TunedModel:

using Plots
plot(tuned_ensemble)

Plot{Plots.GRBackend() n=4}

Finally you can always just evaluate the model by reporting l2 on the test set:

ŷ = predict(tuned_ensemble, rows=test)
@show l2(ŷ, y[test])

l2(ŷ, y[test]) = 0.003009691905011907

‎

Ensemble models

Preliminary steps

Homogenous ensembles

Training and testing an ensemble

Systematic tuning

Reporting results