Ensemble models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Preliminary steps

Let's start by loading the relevant packages and generating some dummy data.

using MLJ
import DataFrames: DataFrame
using PrettyPrinting
using StableRNGs

rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng, 300))
X = DataFrame(Xraw, :auto)

train, test = partition(eachindex(y), 0.7);

Let's also load a simple model:

KNNRegressor = @load KNNRegressor
knn_model = KNNRegressor(K=10)
import NearestNeighborModels ✔
KNNRegressor(
    K = 10,
    algorithm = :kdtree,
    metric = Distances.Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = NearestNeighborModels.Uniform())

As before, let's instantiate a machine that wraps the model and data:

knn = machine(knn_model, X, y)
Machine{KNNRegressor,…} trained 0 times; caches data
  model: NearestNeighborModels.KNNRegressor
  args: 
    1:	Source @057 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @811 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`

and fit it

fit!(knn, rows=train)
ŷ = predict(knn, X[test, :]) # or use rows=test
rms(ŷ, y[test])
0.06389980172436367

The few steps above are equivalent to just calling evaluate!:

evaluate!(knn, resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)),
          measure=rms)
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬──────────┐
│ measure                │ measurement │ operation │ per_fold │
├────────────────────────┼─────────────┼───────────┼──────────┤
│ RootMeanSquaredError() │ 0.124       │ predict   │ [0.124]  │
└────────────────────────┴─────────────┴───────────┴──────────┘

Homogenous ensembles

MLJ offers basic support for ensembling such as bagging. Defining such an ensemble of simple "atomic" models is done via the EnsembleModel constructor:

ensemble_model = EnsembleModel(model=knn_model, n=20);

where the n=20 indicates how many models are present in the ensemble.

Training and testing an ensemble

Now that we've instantiated an ensemble, it can be trained and tested the same as any other model:

ensemble = machine(ensemble_model, X, y)
estimates = evaluate!(ensemble, resampling=CV())
estimates
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬─────────────────────────────────────────────┐
│ measure                │ measurement │ operation │ per_fold                                    │
├────────────────────────┼─────────────┼───────────┼─────────────────────────────────────────────┤
│ RootMeanSquaredError() │ 0.0856      │ predict   │ [0.09, 0.112, 0.079, 0.089, 0.0661, 0.0694] │
└────────────────────────┴─────────────┴───────────┴─────────────────────────────────────────────┘

here the implicit measure is the rms (default for regressions). The measurement is the mean taken over the folds:

@show estimates.measurement[1]
@show mean(estimates.per_fold[1])
estimates.measurement[1] = 0.08560905340996186
mean(estimates.per_fold[1]) = 0.08423976327531794

Note that multiple measurements can be specified jointly. Here only on measurement is (implicitly) specified but we still have to select the corresponding results (whence the [1] for both the measurement and per_fold).

Systematic tuning

Let's simultaneously tune the ensemble's bagging_fraction and the K-Nearest neighbour hyperparameter K. Since one of our models is a field of the other, we have nested hyperparameters:

params(ensemble_model) |> pprint
(model = (K = 10,
          algorithm = :kdtree,
          metric = Distances.Euclidean(0.0),
          leafsize = 10,
          reorder = true,
          weights = NearestNeighborModels.Uniform()),
 atomic_weights = [],
 bagging_fraction = 0.8,
 rng = Random._GLOBAL_RNG(),
 n = 20,
 acceleration = ComputationalResources.CPU1{Nothing}(nothing),
 out_of_bag_measure = [])

To define a tuning grid, we construct ranges for the two parameters and collate these ranges:

B_range = range(ensemble_model, :bagging_fraction,
                lower=0.5, upper=1.0)
K_range = range(ensemble_model, :(model.K),
                lower=1, upper=20);

the scale for a tuning grid is linear by default but can be specified to :log10 for logarithmic ranges. Now we have to define a TunedModel and fit it:

tm = TunedModel(model=ensemble_model,
                tuning=Grid(resolution=10), # 10x10 grid
                resampling=Holdout(fraction_train=0.8, rng=StableRNG(42)),
                ranges=[B_range, K_range])

tuned_ensemble = machine(tm, X, y)
fit!(tuned_ensemble, rows=train);

Note the rng=42 seeds the random number generator for reproducibility of this example.

Reporting results

The best model can be accessed like so:

best_ensemble = fitted_params(tuned_ensemble).best_model
@show best_ensemble.model.K
@show best_ensemble.bagging_fraction
best_ensemble.model.K = 5
best_ensemble.bagging_fraction = 0.6666666666666666

The report method gives more detailed information on the tuning process:

r = report(tuned_ensemble);

For instance, r.measurements are the measurements for all pairs of hyperparameters which you could visualise nicely:

using PyPlot

figure(figsize=(8,6))

res = r.plotting
vals_b = res.parameter_values[:, 1]
vals_k = res.parameter_values[:, 2]

tricontourf(vals_b, vals_k, res.measurements)
xticks(0.5:0.1:1, fontsize=12)
xlabel("Bagging fraction", fontsize=14)
yticks([1, 5, 10, 15, 20], fontsize=12)
ylabel("Number of neighbors - K", fontsize=14)
Hyperparameter heatmap

Finally you can always just evaluate the model by reporting rms on the test set:

ŷ = predict(tuned_ensemble, rows=test)
@show rms(ŷ, y[test])
rms(ŷ, y[test]) = 0.056334546503186374