Composing models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Generating dummy data
Wrapping a supervised model in learned target transformations
The final pipeline

A tutorial showing how to wrap a supervised model in input feature preprocessing (create a pipeline model) and transforamtions of the target.

Let's start by generating some dummy data with both numerical values and categorical values:

using MLJ
import StableRNGs.StableRNG


RidgeRegressor = @load RidgeRegressor pkg=MLJLinearModels

import MLJLinearModels ✔
MLJLinearModels.RidgeRegressor

Here's a table of input features:

X = (age    = [23, 45, 34, 25, 67],
     gender = categorical(['m', 'm', 'f', 'm', 'f']))

(age = [23, 45, 34, 25, 67],
 gender = CategoricalArrays.CategoricalValue{Char, UInt32}['m', 'm', 'f', 'm', 'f'],)

And a vector target (height in mm):

y = Float64[1780, 1940, 1650, 1730, 1680];

Note that the scientific type of age is Count here:

schema(X)

┌────────┬───────────────┬────────────────────────────────┐
│ names  │ scitypes      │ types                          │
├────────┼───────────────┼────────────────────────────────┤
│ age    │ Count         │ Int64                          │
│ gender │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
└────────┴───────────────┴────────────────────────────────┘

We will want to coerce that to Continuous so that it can be given to a regressor that expects such values.

A typical workflow for such data is to one-hot-encode the categorical data and then apply some regression model on the data.

Let's say that we want to apply the following steps:

One-hot encode the categorical features in X
Apply a learned Box-Cox transformation to the target y
Train a ridge regression model on the one-hot encoded data and the transformed target.
Return target prediction on the original scale

‎

First, we wrap our supervised model in the target transformation we want:

transformed_target_model = TransformedTargetModel(
    RidgeRegressor();
    transformer=UnivariateBoxCoxTransformer(),
)

TransformedTargetModelDeterministic(
  model = RidgeRegressor(
        lambda = 1.0, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  transformer = UnivariateBoxCoxTransformer(
        n = 171, 
        shift = false), 
  inverse = nothing, 
  cache = true)

Such a model internally transforms the target by applying the Box-Cox transformation (that one that makes the data look the most Gaussian) before using it to train the ridge regresssor, but it returns target predictions on the original, untransformed scale. Here's a demonstration (with contiuous data):

rng = StableRNG(123)
Xcont = (x1 = rand(rng, 5), x2 = rand(5))
mach = machine(transformed_target_model, Xcont, y) |> fit!
yhat = predict(mach, Xcont)

5-element Vector{Float64}:
 1751.8669503721433
 1761.0380391714327
 1745.4164443644154
 1749.3333949147136
 1752.3807141222846

In case you need convincing, removing the target transformation indeed gives a different outcome:

mach = machine(RidgeRegressor(), Xcont, y) |> fit!
yhat - predict(mach, Xcont)

5-element Vector{Float64}:
 -3.886929751397247
 -4.432585089219401
 -3.8908797851088366
 -3.6635370418441653
 -4.090525387439129

‎

Next we insert our target-transformed model into a pipeline, to create a new model which includes the input data pre-processing we want:

pipe = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> transformed_target_model

DeterministicPipeline(
  f = var"#1#2"(), 
  one_hot_encoder = OneHotEncoder(
        features = Symbol[], 
        drop_last = false, 
        ordered_factor = true, 
        ignore = false), 
  transformed_target_model_deterministic = TransformedTargetModelDeterministic(
        model = RidgeRegressor(lambda = 1.0, …), 
        transformer = UnivariateBoxCoxTransformer(n = 171, …), 
        inverse = nothing, 
        cache = true), 
  cache = true)

The first element in the pipelines is just an ordinary function to coerce the :age variable to Continuous (needed because RidgeRegressor expects Continuous input).

Hyperparameters of this pipeline can be accessed (and set) using dot syntax:

pipe.transformed_target_model_deterministic.model.lambda = 10.0
pipe.one_hot_encoder.drop_last = true;

Evaluation for a pipe can be done with the evaluate! method.

evaluate(pipe, X, y, resampling=CV(nfolds=3), measure=l1)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 187.0       │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌───────────────────────┬─────────┐
│ per_fold              │ 1.96*SE │
├───────────────────────┼─────────┤
│ [168.0, 141.0, 314.0] │ 129.0   │
└───────────────────────┴─────────┘

‎

Composing models

Generating dummy data

Wrapping a supervised model in learned target transformations

The final pipeline