Composing models
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
A tutorial showing how to wrap a supervised model in input feature preprocessing (create a pipeline model) and transforamtions of the target.
Let's start by generating some dummy data with both numerical values and categorical values:
using MLJ
import StableRNGs.StableRNG
RidgeRegressor = @load RidgeRegressor pkg=MLJLinearModels
import MLJLinearModels ✔
MLJLinearModels.RidgeRegressor
Here's a table of input features:
X = (age = [23, 45, 34, 25, 67],
gender = categorical(['m', 'm', 'f', 'm', 'f']))
(age = [23, 45, 34, 25, 67],
gender = CategoricalArrays.CategoricalValue{Char, UInt32}['m', 'm', 'f', 'm', 'f'],)
And a vector target (height in mm):
y = Float64[1780, 1940, 1650, 1730, 1680];
Note that the scientific type of age
is Count
here:
schema(X)
┌────────┬───────────────┬────────────────────────────────┐
│ names │ scitypes │ types │
├────────┼───────────────┼────────────────────────────────┤
│ age │ Count │ Int64 │
│ gender │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
└────────┴───────────────┴────────────────────────────────┘
We will want to coerce that to Continuous
so that it can be given to a regressor that expects such values.
A typical workflow for such data is to one-hot-encode the categorical data and then apply some regression model on the data.
Let's say that we want to apply the following steps:
One-hot encode the categorical features in
X
Apply a learned Box-Cox transformation to the target
y
Train a ridge regression model on the one-hot encoded data and the transformed target.
Return target prediction on the original scale
First, we wrap our supervised model in the target transformation we want:
transformed_target_model = TransformedTargetModel(
RidgeRegressor();
transformer=UnivariateBoxCoxTransformer(),
)
TransformedTargetModelDeterministic(
model = RidgeRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing),
transformer = UnivariateBoxCoxTransformer(
n = 171,
shift = false),
inverse = nothing,
cache = true)
Such a model internally transforms the target by applying the Box-Cox transformation (that one that makes the data look the most Gaussian) before using it to train the ridge regresssor, but it returns target predictions on the original, untransformed scale. Here's a demonstration (with contiuous data):
rng = StableRNG(123)
Xcont = (x1 = rand(rng, 5), x2 = rand(5))
mach = machine(transformed_target_model, Xcont, y) |> fit!
yhat = predict(mach, Xcont)
5-element Vector{Float64}:
1751.8669503721433
1761.0380391714327
1745.4164443644154
1749.3333949147136
1752.3807141222846
In case you need convincing, removing the target transformation indeed gives a different outcome:
mach = machine(RidgeRegressor(), Xcont, y) |> fit!
yhat - predict(mach, Xcont)
5-element Vector{Float64}:
-3.886929751397247
-4.432585089219401
-3.8908797851088366
-3.6635370418441653
-4.090525387439129
Next we insert our target-transformed model into a pipeline, to create a new model which includes the input data pre-processing we want:
pipe = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> transformed_target_model
DeterministicPipeline(
f = var"#1#2"(),
one_hot_encoder = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true,
ignore = false),
transformed_target_model_deterministic = TransformedTargetModelDeterministic(
model = RidgeRegressor(lambda = 1.0, …),
transformer = UnivariateBoxCoxTransformer(n = 171, …),
inverse = nothing,
cache = true),
cache = true)
The first element in the pipelines is just an ordinary function to coerce the :age
variable to Continuous
(needed because RidgeRegressor
expects Continuous
input).
Hyperparameters of this pipeline can be accessed (and set) using dot syntax:
pipe.transformed_target_model_deterministic.model.lambda = 10.0
pipe.one_hot_encoder.drop_last = true;
Evaluation for a pipe can be done with the evaluate!
method.
evaluate(pipe, X, y, resampling=CV(nfolds=3), measure=l1)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss( │ predict │ 187.0 │
│ p = 1) │ │ │
└──────────┴───────────┴─────────────┘
┌───────────────────────┬─────────┐
│ per_fold │ 1.96*SE │
├───────────────────────┼─────────┤
│ [168.0, 141.0, 314.0] │ 129.0 │
└───────────────────────┴─────────┘