Linear Pipelines

In MLJ a pipeline is a composite model in which models are chained together in a linear (non-branching) chain. For other arrangements, including custom architectures via learning networks, see Composing Models.

For purposes of illustration, consider a supervised learning problem with the following toy data:

using MLJ
X = (age    = [23, 45, 34, 25, 67],
     gender = categorical(['m', 'm', 'f', 'm', 'f']));
y = [67.0, 81.5, 55.6, 90.0, 61.1]

We would like to train using a K-nearest neighbor model, but the model type KNNRegressor assumes the features are all Continuous. This can be fixed by first:

  • coercing the :age feature to have Continuous type by replacing X with coerce(X, :age=>Continuous)
  • standardizing continuous features and one-hot encoding the Multiclass features using the ContinuousEncoder model

However, we can avoid separately applying these preprocessing steps (two of which require fit! steps) by combining them with the supervised KKNRegressor model in a new pipeline model, using Julia's |> syntax:

KNNRegressor = @load KNNRegressor pkg=NearestNeighborModels
pipe = (X -> coerce(X, :age=>Continuous)) |> ContinuousEncoder() |> KNNRegressor(K=2)
DeterministicPipeline(
  f = Main.var"#1#2"(), 
  continuous_encoder = ContinuousEncoder(
        drop_last = false, 
        one_hot_ordered_factors = false), 
  knn_regressor = KNNRegressor(
        K = 2, 
        algorithm = :kdtree, 
        metric = Distances.Euclidean(0.0), 
        leafsize = 10, 
        reorder = true, 
        weights = NearestNeighborModels.Uniform()), 
  cache = true)

We see above that pipe is a model whose hyperparameters are themselves other models or a function. (The names of these hyper-parameters are automatically generated. To specify your own names, use the explicit Pipeline constructor instead.)

The |> syntax can also be used to extend an existing pipeline or concatenate two existing pipelines. So, we could instead have defined:

pipe_transformer = (X -> coerce(X, :age=>Continuous)) |> ContinuousEncoder()
pipe = pipe_transformer |> KNNRegressor(K=2)

A pipeline is just a model like any other. For example, we can evaluate its performance on the data above:

evaluate(pipe, X, y, resampling=CV(nfolds=3), measure=mae)
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 11.3        │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌────────────────────┬─────────┐
│ per_fold           │ 1.96*SE │
├────────────────────┼─────────┤
│ [7.25, 17.2, 7.45] │ 7.88    │
└────────────────────┴─────────┘

To include target transformations in a pipeline, wrap the supervised component using TransformedTargetModel.

MLJBase.PipelineFunction
Pipeline(component1, component2, ... , componentk; options...)
Pipeline(name1=component1, name2=component2, ..., namek=componentk; options...)
component1 |> component2 |> ... |> componentk

Create an instance of a composite model type which sequentially composes the specified components in order. This means component1 receives inputs, whose output is passed to component2, and so forth. A "component" is either a Model instance, a model type (converted immediately to its default instance) or any callable object. Here the "output" of a model is what predict returns if it is Supervised, or what transform returns if it is Unsupervised.

Names for the component fields are automatically generated unless explicitly specified, as in

Pipeline(encoder=ContinuousEncoder(drop_last=false),
         stand=Standardizer())

The Pipeline constructor accepts keyword options discussed further below.

Ordinary functions (and other callables) may be inserted in the pipeline as shown in the following example:

Pipeline(X->coerce(X, :age=>Continuous), OneHotEncoder, ConstantClassifier)

Syntactic sugar

The |> operator is overloaded to construct pipelines out of models, callables, and existing pipelines:

LinearRegressor = @load LinearRegressor pkg=MLJLinearModels add=true
PCA = @load PCA pkg=MultivariateStats add=true

pipe1 = MLJBase.table |> ContinuousEncoder |> Standardizer
pipe2 = PCA |> LinearRegressor
pipe1 |> pipe2

At most one of the components may be a supervised model, but this model can appear in any position. A pipeline with a Supervised component is itself Supervised and implements the predict operation. It is otherwise Unsupervised (possibly Static) and implements transform.

Special operations

If all the components are invertible unsupervised models (ie, implement inverse_transform) then inverse_transform is implemented for the pipeline. If there are no supervised models, then predict is nevertheless implemented, assuming the last component is a model that implements it (some clustering models). Similarly, calling transform on a supervised pipeline calls transform on the supervised component.

Transformers that need a target in training

Some transformers that have type Unsupervised (so that the output of transform is propagated in pipelines) may require a target variable for training. An example are so-called target encoders (which transform categorical input features, based on some target observations). Provided they appear before any Supervised component in the pipelines, such models are supported. Of course a target must be provided whenever training such a pipeline, whether or not it contains a Supervised component.

Optional key-word arguments

  • prediction_type - prediction type of the pipeline; possible values: :deterministic, :probabilistic, :interval (default=:deterministic if not inferable)

  • operation - operation applied to the supervised component model, when present; possible values: predict, predict_mean, predict_median, predict_mode (default=predict)

  • cache - whether the internal machines created for component models should cache model-specific representations of data (see machine) (default=true)

Warning

Set cache=false to guarantee data anonymization.

To build more complicated non-branching pipelines, refer to the MLJ manual sections on composing models.

source