Composing models

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Generating dummy data

Let's start by generating some dummy data with both numerical values and categorical values:

using MLJ
using PrettyPrinting

KNNRegressor = @load KNNRegressor
# input
X = (age    = [23, 45, 34, 25, 67],
     gender = categorical(['m', 'm', 'f', 'm', 'f']))
# target
height = [178, 194, 165, 173, 168];
import NearestNeighborModels ✔

Note that the scientific type of age is Count here:

scitype(X.age)
AbstractVector{Count} (alias for AbstractArray{ScientificTypesBase.Count, 1})

We will want to coerce that to Continuous so that it can be given to a regressor that expects such values.

Declaring a pipeline

A typical workflow for such data is to one-hot-encode the categorical data and then apply some regression model on the data. Let's say that we want to apply the following steps:

  1. One hot encode the categorical features in X

  2. Standardize the target variable (:height)

  3. Train a KNN regression model on the one hot encoded data and the Standardized target.

The Pipeline constructor helps you define such a simple (non-branching) pipeline of steps to be applied in order:

pipe = Pipeline(
    coercer = X -> coerce(X, :age=>Continuous),
    one_hot_encoder = OneHotEncoder(),
    transformed_target_model = TransformedTargetModel(
        model = KNNRegressor(K=3);
        target=UnivariateStandardizer()
    )
)
DeterministicPipeline(
    coercer = var"#1#2"(),
    one_hot_encoder = OneHotEncoder(
            features = Symbol[],
            drop_last = false,
            ordered_factor = true,
            ignore = false),
    transformed_target_model = TransformedTargetModelDeterministic(
            model = KNNRegressor,
            target = UnivariateStandardizer,
            inverse = nothing,
            cache = true),
    cache = true)

Note the coercion of the :age variable to Continuous since KNNRegressor expects Continuous input. Note also the TransformedTargetModel which allows one to learn a transformation (in this case Standardization) of the target variable to be passed to the KNNRegressor.

Hyperparameters of this pipeline can be accessed (and set) using dot syntax:

pipe.transformed_target_model.model.K = 2
pipe.one_hot_encoder.drop_last = true;

Evaluation for a pipe can be done with the evaluate! method; implicitly it will construct machines that will contain the fitted parameters etc:

evaluate(
    pipe,
    X,
    height,
    resampling=Holdout(),
    measure=rms
) |> pprint
PerformanceEvaluation(11.5,)