Linear Pipelines
In MLJ a pipeline is a composite model in which models are chained together in a linear (non-branching) chain. For other arrangements, including custom architectures via learning networks, see Composing Models.
For purposes of illustration, consider a supervised learning problem with the following toy data:
using MLJ
X = (age = [23, 45, 34, 25, 67],
gender = categorical(['m', 'm', 'f', 'm', 'f']));
y = [67.0, 81.5, 55.6, 90.0, 61.1]
We would like to train using a K-nearest neighbor model, but the model type KNNRegressor
assumes the features are all Continuous
. This can be fixed by first:
- coercing the
:age
feature to haveContinuous
type by replacingX
withcoerce(X, :age=>Continuous)
- standardizing continuous features and one-hot encoding the
Multiclass
features using theContinuousEncoder
model
However, we can avoid separately applying these preprocessing steps (two of which require fit!
steps) by combining them with the supervised KKNRegressor
model in a new pipeline model, using Julia's |>
syntax:
KNNRegressor = @load KNNRegressor pkg=NearestNeighborModels
pipe = (X -> coerce(X, :age=>Continuous)) |> ContinuousEncoder() |> KNNRegressor(K=2)
DeterministicPipeline(
f = Main.var"#1#2"(),
continuous_encoder = ContinuousEncoder(
drop_last = false,
one_hot_ordered_factors = false),
knn_regressor = KNNRegressor(
K = 2,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()),
cache = true)
We see above that pipe
is a model whose hyperparameters are themselves other models or a function. (The names of these hyper-parameters are automatically generated. To specify your own names, use the explicit Pipeline
constructor instead.)
The |>
syntax can also be used to extend an existing pipeline or concatenate two existing pipelines. So, we could instead have defined:
pipe_transformer = (X -> coerce(X, :age=>Continuous)) |> ContinuousEncoder()
pipe = pipe_transformer |> KNNRegressor(K=2)
A pipeline is just a model like any other. For example, we can evaluate its performance on the data above:
evaluate(pipe, X, y, resampling=CV(nfolds=3), measure=mae)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss( │ predict │ 11.3 │
│ p = 1) │ │ │
└──────────┴───────────┴─────────────┘
┌────────────────────┬─────────┐
│ per_fold │ 1.96*SE │
├────────────────────┼─────────┤
│ [7.25, 17.2, 7.45] │ 7.88 │
└────────────────────┴─────────┘
To include target transformations in a pipeline, wrap the supervised component using TransformedTargetModel
.
MLJBase.Pipeline
— FunctionPipeline(component1, component2, ... , componentk; options...)
Pipeline(name1=component1, name2=component2, ..., namek=componentk; options...)
component1 |> component2 |> ... |> componentk
Create an instance of a composite model type which sequentially composes the specified components in order. This means component1
receives inputs, whose output is passed to component2
, and so forth. A "component" is either a Model
instance, a model type (converted immediately to its default instance) or any callable object. Here the "output" of a model is what predict
returns if it is Supervised
, or what transform
returns if it is Unsupervised
.
Names for the component fields are automatically generated unless explicitly specified, as in
Pipeline(encoder=ContinuousEncoder(drop_last=false),
stand=Standardizer())
The Pipeline
constructor accepts keyword options
discussed further below.
Ordinary functions (and other callables) may be inserted in the pipeline as shown in the following example:
Pipeline(X->coerce(X, :age=>Continuous), OneHotEncoder, ConstantClassifier)
Syntactic sugar
The |>
operator is overloaded to construct pipelines out of models, callables, and existing pipelines:
LinearRegressor = @load LinearRegressor pkg=MLJLinearModels add=true
PCA = @load PCA pkg=MultivariateStats add=true
pipe1 = MLJBase.table |> ContinuousEncoder |> Standardizer
pipe2 = PCA |> LinearRegressor
pipe1 |> pipe2
At most one of the components may be a supervised model, but this model can appear in any position. A pipeline with a Supervised
component is itself Supervised
and implements the predict
operation. It is otherwise Unsupervised
(possibly Static
) and implements transform
.
Special operations
If all the components
are invertible unsupervised models (ie, implement inverse_transform
) then inverse_transform
is implemented for the pipeline. If there are no supervised models, then predict
is nevertheless implemented, assuming the last component is a model that implements it (some clustering models). Similarly, calling transform
on a supervised pipeline calls transform
on the supervised component.
Transformers that need a target in training
Some transformers that have type Unsupervised
(so that the output of transform
is propagated in pipelines) may require a target variable for training. An example are so-called target encoders (which transform categorical input features, based on some target observations). Provided they appear before any Supervised
component in the pipelines, such models are supported. Of course a target must be provided whenever training such a pipeline, whether or not it contains a Supervised
component.
Optional key-word arguments
prediction_type
- prediction type of the pipeline; possible values::deterministic
,:probabilistic
,:interval
(default=:deterministic
if not inferable)operation
- operation applied to the supervised component model, when present; possible values:predict
,predict_mean
,predict_median
,predict_mode
(default=predict
)cache
- whether the internal machines created for component models should cache model-specific representations of data (seemachine
) (default=true
)
Set cache=false
to guarantee data anonymization.
To build more complicated non-branching pipelines, refer to the MLJ manual sections on composing models.