AMES

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Baby steps

Let's load a reduced version of the well-known Ames House Price data set (containing six of the more important categorical features and six of the more important numerical features). As "iris" the dataset is so common that you can load it directly with @load_ames and the reduced version via @load_reduced_ames

using MLJ
using  PrettyPrinting
import DataFrames: DataFrame
import Statistics

X, y = @load_reduced_ames
X = DataFrame(X)
@show size(X)
first(X, 3) |> pretty
size(X) = (1456, 12)
┌─────────────────────────────────┬────────────┬──────────────────────────────────┬────────────┬─────────────┬────────────┬────────────┬────────────┬──────────────────────────────────┬────────────┬──────────────┬───────────┐
│ OverallQual                     │ GrLivArea  │ Neighborhood                     │ x1stFlrSF  │ TotalBsmtSF │ BsmtFinSF1 │ LotArea    │ GarageCars │ MSSubClass                       │ GarageArea │ YearRemodAdd │ YearBuilt │
│ CategoricalValue{Int64, UInt32} │ Float64    │ CategoricalValue{String, UInt32} │ Float64    │ Float64     │ Float64    │ Float64    │ Int64      │ CategoricalValue{String, UInt32} │ Float64    │ Int64        │ Int64     │
│ OrderedFactor{10}               │ Continuous │ Multiclass{25}                   │ Continuous │ Continuous  │ Continuous │ Continuous │ Count      │ Multiclass{15}                   │ Continuous │ Count        │ Count     │
├─────────────────────────────────┼────────────┼──────────────────────────────────┼────────────┼─────────────┼────────────┼────────────┼────────────┼──────────────────────────────────┼────────────┼──────────────┼───────────┤
│ 5                               │ 816.0      │ Mitchel                          │ 816.0      │ 816.0       │ 816.0      │ 6600.0     │ 2          │ _20                              │ 816.0      │ 2003         │ 1982      │
│ 8                               │ 2028.0     │ Timber                           │ 2028.0     │ 1868.0      │ 1460.0     │ 11443.0    │ 3          │ _20                              │ 880.0      │ 2006         │ 2005      │
│ 7                               │ 1509.0     │ Gilbert                          │ 807.0      │ 783.0       │ 0.0        │ 7875.0     │ 2          │ _60                              │ 393.0      │ 2003         │ 2003      │
└─────────────────────────────────┴────────────┴──────────────────────────────────┴────────────┴─────────────┴────────────┴────────────┴────────────┴──────────────────────────────────┴────────────┴──────────────┴───────────┘

and the target is a continuous vector:

@show y[1:3]
scitype(y)
y[1:3] = [138000.0, 369900.0, 180000.0]
AbstractVector{Continuous} (alias for AbstractArray{ScientificTypesBase.Continuous, 1})

so this is a standard regression problem with a mix of categorical and continuous input.

Dummy model

Remember that a model is just a container for hyperparameters; let's take a particularly simple one: the constant regression.

creg = ConstantRegressor()
ConstantRegressor(
    distribution_type = Distributions.Normal)

Wrapping the model in data creates a machine which will store training outcomes (fit-results)

cmach = machine(creg, X, y)
Machine{ConstantRegressor,…} trained 0 times; caches data
  model: MLJModels.ConstantRegressor{Distributions.Normal}
  args: 
    1:	Source @329 ⏎ `ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}, AbstractVector{ScientificTypesBase.Multiclass{15}}, AbstractVector{ScientificTypesBase.Multiclass{25}}, AbstractVector{ScientificTypesBase.OrderedFactor{10}}}}`
    2:	Source @651 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`

You can now train the machine specifying the data it should be trained on (if unspecified, all the data will be used);

train, test = partition(collect(eachindex(y)), 0.70, shuffle=true); # 70:30 split
fit!(cmach, rows=train)
ŷ = predict(cmach, rows=test)
ŷ[1:3] |> pprint
[Distributions.Normal{Float64}(μ=181016.7193326791, σ=78837.28933000278),
 Distributions.Normal{Float64}(μ=181016.7193326791, σ=78837.28933000278),
 Distributions.Normal{Float64}(μ=181016.7193326791, σ=78837.28933000278)]

Observe that the output is probabilistic, each element is a univariate normal distribution (with the same mean and variance as it's a constant model).

You can recover deterministic output by either computing the mean of predictions or using predict_mean directly (the mean function can bve applied to any distribution from Distributions.jl):

ŷ = predict_mean(cmach, rows=test)
ŷ[1:3]
3-element Vector{Float64}:
 181016.7193326791
 181016.7193326791
 181016.7193326791

You can then call one of the loss functions to assess the quality of the model by comparing the performances on the test set:

rmsl(ŷ, y[test])
0.39288848618607314

KNN-Ridge blend

Let's try something a bit fancier than a constant regressor.

  • one-hot-encode categorical inputs

  • log-transform the target

  • fit both a KNN regression and a Ridge regression on the data

  • Compute a weighted average of individual model predictions

  • inverse transform (exponentiate) the blended prediction

You will first define a fixed model where all hyperparameters are specified or set to default. Then you will see how to create a model around a learning network that can be tuned.

RidgeRegressor = @load RidgeRegressor pkg="MultivariateStats"
KNNRegressor = @load KNNRegressor
import MLJMultivariateStatsInterface ✔
import NearestNeighborModels ✔
NearestNeighborModels.KNNRegressor

Using the expanded syntax

Let's start by defining the source nodes:

Xs = source(X)
ys = source(y)
Source @163 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`

On the "first layer", there's one hot encoder and a log transform, these will respectively lead to node W and node z:

hot = machine(OneHotEncoder(), Xs)

W = transform(hot, Xs)
z = log(ys);

On the "second layer", there's a KNN regressor and a ridge regressor, these lead to node ẑ₁ and ẑ₂

knn   = machine(KNNRegressor(K=5), W, z)
ridge = machine(RidgeRegressor(lambda=2.5), W, z)

ẑ₁ = predict(ridge, W)
ẑ₂ = predict(knn, W)
Node{Machine{KNNRegressor,…}}
  args:
    1:	Node{Machine{OneHotEncoder,…}}
  formula:
    predict(
        Machine{KNNRegressor,…}, 
        transform(
            Machine{OneHotEncoder,…}, 
            Source @052))

On the "third layer", there's a weighted combination of the two regression models:

ẑ = 0.3ẑ₁ + 0.7ẑ₂;

And finally we need to invert the initial transformation of the target (which was a log):

ŷ = exp(ẑ);

You've now defined a full learning network which you can fit and use for prediction:

fit!(ŷ, rows=train)
ypreds = ŷ(rows=test)
rmsl(y[test], ypreds)
0.17100586759139108

Tuning the model

So far the hyperparameters were explicitly given but it makes more sense to learn them. For this, we define a model around the learning network which can then be trained and tuned as any model:

mutable struct KNNRidgeBlend <: DeterministicComposite
    knn_model::KNNRegressor
    ridge_model::RidgeRegressor
    knn_weight::Float64
end

We must specify how such a model should be fit, which is effectively just the learning network we had defined before except that now the parameters are contained in the struct:

function MLJ.fit(model::KNNRidgeBlend, verbosity::Int, X, y)
    Xs = source(X)
    ys = source(y)
    hot = machine(OneHotEncoder(), Xs)
    W = transform(hot, Xs)
    z = log(ys)
    ridge_model = model.ridge_model
    knn_model = model.knn_model
    ridge = machine(ridge_model, W, z)
    knn = machine(knn_model, W, z)
    # and finally
    ẑ = model.knn_weight * predict(knn, W) + (1.0 - model.knn_weight) * predict(ridge, W)
    ŷ = exp(ẑ)

    mach = machine(Deterministic(), Xs, ys; predict=ŷ)
    return!(mach, model, verbosity)
end

Note: you really want to set verbosity=0 here otherwise in the tuning you will get a lot of verbose output!

You can now instantiate and fit such a model:

krb = KNNRidgeBlend(KNNRegressor(K=5), RidgeRegressor(lambda=2.5), 0.3)
mach = machine(krb, X, y)
fit!(mach, rows=train)

preds = predict(mach, rows=test)
rmsl(y[test], preds)
0.13456064548487007

But more interestingly, the hyperparameters of the model can be tuned.

Before we get started, it's important to note that the hyperparameters of the model have different levels of nesting. This becomes explicit when trying to access elements:

@show krb.knn_weight
@show krb.knn_model.K
@show krb.ridge_model.lambda
krb.knn_weight = 0.3
krb.knn_model.K = 5
krb.ridge_model.lambda = 2.5

You can also see all the hyperparameters using the params function:

params(krb) |> pprint
(knn_model = (K = 5,
              algorithm = :kdtree,
              metric = Distances.Euclidean(0.0),
              leafsize = 10,
              reorder = true,
              weights = NearestNeighborModels.Uniform()),
 ridge_model = (lambda = 2.5, bias = true),
 knn_weight = 0.3)

The range of values to do your hyperparameter tuning over should follow the nesting structure reflected by params:

k_range = range(krb, :(knn_model.K), lower=2, upper=100, scale=:log10)
l_range = range(krb, :(ridge_model.lambda), lower=1e-4, upper=10, scale=:log10)
w_range = range(krb, :(knn_weight), lower=0.1, upper=0.9)

ranges = [k_range, l_range, w_range]
3-element Vector{MLJBase.NumericRange{T, MLJBase.Bounded, Symbol} where T}:
 NumericRange(2 ≤ knn_model.K ≤ 100; origin=51.0, unit=49.0) on log10 scale
 NumericRange(0.0001 ≤ ridge_model.lambda ≤ 10; origin=5.00005, unit=4.99995) on log10 scale
 NumericRange(0.1 ≤ knn_weight ≤ 0.9; origin=0.5, unit=0.4)

Now there remains to define how the tuning should be done, let's just specify a very coarse grid tuning with cross validation and instantiate a tuned model:

tuning = Grid(resolution=3)
resampling = CV(nfolds=6)

tm = TunedModel(model=krb, tuning=tuning, resampling=resampling,
                ranges=ranges, measure=rmsl)
DeterministicTunedModel(
    model = KNNRidgeBlend(
            knn_model = KNNRegressor,
            ridge_model = RidgeRegressor,
            knn_weight = 0.3),
    tuning = Grid(
            goal = nothing,
            resolution = 3,
            shuffle = true,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 6,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    measure = RootMeanSquaredLogError(),
    weights = nothing,
    operation = nothing,
    range = MLJBase.NumericRange{T, MLJBase.Bounded, Symbol} where T[NumericRange(2 ≤ knn_model.K ≤ 100; origin=51.0, unit=49.0) on log10 scale, NumericRange(0.0001 ≤ ridge_model.lambda ≤ 10; origin=5.00005, unit=4.99995) on log10 scale, NumericRange(0.1 ≤ knn_weight ≤ 0.9; origin=0.5, unit=0.4)],
    selection_heuristic = MLJTuning.NaiveSelection(nothing),
    train_best = true,
    repeats = 1,
    n = nothing,
    acceleration = ComputationalResources.CPU1{Nothing}(nothing),
    acceleration_resampling = ComputationalResources.CPU1{Nothing}(nothing),
    check_measure = true,
    cache = true)

which we can now finally fit...

mtm = machine(tm, X, y)
fit!(mtm, rows=train);

To retrieve the best model, you can use:

krb_best = fitted_params(mtm).best_model
@show krb_best.knn_model.K
@show krb_best.ridge_model.lambda
@show krb_best.knn_weight
krb_best.knn_model.K = 2
krb_best.ridge_model.lambda = 0.03162277660168379
krb_best.knn_weight = 0.1

you can also use mtm to make predictions (which will be done using the best model)

preds = predict(mtm, rows=test)
rmsl(y[test], preds)
0.12864316424030595