# Ensemble models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

## Preliminary steps

Let's start by loading the relevant packages and generating some dummy data.

``````using MLJ
import DataFrames: DataFrame
using PrettyPrinting
using StableRNGs

rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng, 300))
X = DataFrame(Xraw, :auto)

train, test = partition(eachindex(y), 0.7);``````

Let's also load a simple model:

``````KNNRegressor = @load KNNRegressor
knn_model = KNNRegressor(K=10)``````
``````import NearestNeighborModels ✔
KNNRegressor(
K = 10,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform())``````

As before, let's instantiate a machine that wraps the model and data:

``knn = machine(knn_model, X, y)``
``````Machine{KNNRegressor,…} trained 0 times; caches data
model: NearestNeighborModels.KNNRegressor
args:
1:	Source @057 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
2:	Source @811 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`
``````

and fit it

``````fit!(knn, rows=train)
ŷ = predict(knn, X[test, :]) # or use rows=test
rms(ŷ, y[test])``````
``0.06389980172436367``

The few steps above are equivalent to just calling `evaluate!`:

``````evaluate!(knn, resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)),
measure=rms)``````
``````PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬──────────┐
│ measure                │ measurement │ operation │ per_fold │
├────────────────────────┼─────────────┼───────────┼──────────┤
│ RootMeanSquaredError() │ 0.124       │ predict   │ [0.124]  │
└────────────────────────┴─────────────┴───────────┴──────────┘
``````

## Homogenous ensembles

MLJ offers basic support for ensembling such as bagging. Defining such an ensemble of simple "atomic" models is done via the `EnsembleModel` constructor:

``ensemble_model = EnsembleModel(model=knn_model, n=20);``

where the `n=20` indicates how many models are present in the ensemble.

### Training and testing an ensemble

Now that we've instantiated an ensemble, it can be trained and tested the same as any other model:

``````ensemble = machine(ensemble_model, X, y)
estimates = evaluate!(ensemble, resampling=CV())
estimates``````
``````PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬─────────────────────────────────────────────┐
│ measure                │ measurement │ operation │ per_fold                                    │
├────────────────────────┼─────────────┼───────────┼─────────────────────────────────────────────┤
│ RootMeanSquaredError() │ 0.0856      │ predict   │ [0.09, 0.112, 0.079, 0.089, 0.0661, 0.0694] │
└────────────────────────┴─────────────┴───────────┴─────────────────────────────────────────────┘
``````

here the implicit measure is the `rms` (default for regressions). The `measurement` is the mean taken over the folds:

``````@show estimates.measurement
@show mean(estimates.per_fold)``````
``````estimates.measurement = 0.08560905340996186
mean(estimates.per_fold) = 0.08423976327531794
``````

Note that multiple measurements can be specified jointly. Here only on measurement is (implicitly) specified but we still have to select the corresponding results (whence the `` for both the `measurement` and `per_fold`).

### Systematic tuning

Let's simultaneously tune the ensemble's `bagging_fraction` and the K-Nearest neighbour hyperparameter `K`. Since one of our models is a field of the other, we have nested hyperparameters:

``params(ensemble_model) |> pprint``
``````(model = (K = 10,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()),
atomic_weights = [],
bagging_fraction = 0.8,
rng = Random._GLOBAL_RNG(),
n = 20,
acceleration = ComputationalResources.CPU1{Nothing}(nothing),
out_of_bag_measure = [])``````

To define a tuning grid, we construct ranges for the two parameters and collate these ranges:

``````B_range = range(ensemble_model, :bagging_fraction,
lower=0.5, upper=1.0)
K_range = range(ensemble_model, :(model.K),
lower=1, upper=20);``````

the scale for a tuning grid is linear by default but can be specified to `:log10` for logarithmic ranges. Now we have to define a `TunedModel` and fit it:

``````tm = TunedModel(model=ensemble_model,
tuning=Grid(resolution=10), # 10x10 grid
resampling=Holdout(fraction_train=0.8, rng=StableRNG(42)),
ranges=[B_range, K_range])

tuned_ensemble = machine(tm, X, y)
fit!(tuned_ensemble, rows=train);``````

Note the `rng=42` seeds the random number generator for reproducibility of this example.

### Reporting results

The best model can be accessed like so:

``````best_ensemble = fitted_params(tuned_ensemble).best_model
@show best_ensemble.model.K
@show best_ensemble.bagging_fraction``````
``````best_ensemble.model.K = 5
best_ensemble.bagging_fraction = 0.6666666666666666
``````

The `report` method gives more detailed information on the tuning process:

``r = report(tuned_ensemble);``

For instance, `r.measurements` are the measurements for all pairs of hyperparameters which you could visualise nicely:

``````using PyPlot

figure(figsize=(8,6))

res = r.plotting
vals_b = res.parameter_values[:, 1]
vals_k = res.parameter_values[:, 2]

tricontourf(vals_b, vals_k, res.measurements)
xticks(0.5:0.1:1, fontsize=12)
xlabel("Bagging fraction", fontsize=14)
yticks([1, 5, 10, 15, 20], fontsize=12)
ylabel("Number of neighbors - K", fontsize=14)`````` Finally you can always just evaluate the model by reporting `rms` on the test set:

``````ŷ = predict(tuned_ensemble, rows=test)
@show rms(ŷ, y[test])``````
``````rms(ŷ, y[test]) = 0.056334546503186374
``````