Adding Models for General Use

Adding New Models

This guide outlines in detail the specification of the MLJ model interface and provides guidelines for implementing the interface for models intended for general use. For sample implementations, see MLJModels/src.

The machine learning tools provided by MLJ can be applied to the models in any package that imports the package MLJBase and implements the API defined there, as outlined below. For a quick-and-dirty implementation of user-defined models see Simple User Defined Models. To make new models available to all MLJ users, see Where to place code implementing new models.

It is assumed the reader has read Getting Started. To implement the API described here, some familiarity with the following packages is also helpful:

In MLJ, the basic interface exposed to the user, built atop the model interface described here, is the machine interface. After a first reading of this document, the reader may wish to refer to MLJ Internals for context.

Overview

A model is an object storing hyperparameters associated with some machine learning algorithm. In MLJ, hyperparameters include configuration parameters, like the number of threads, and special instructions, such as "compute feature rankings", which may or may not affect the final learning outcome. However, the logging level (verbosity below) is excluded.

The name of the Julia type associated with a model indicates the associated algorithm (e.g., DecisionTreeClassifier). The outcome of training a learning algorithm is called a fit-result. For ordinary multilinear regression, for example, this would be the coefficients and intercept. For a general supervised model, it is the (generally minimal) information needed to make new predictions.

The ultimate supertype of all models is MLJBase.Model, which has two abstract subtypes:

abstract type Supervised{R} <: Model end
abstract type Unsupervised <: Model end

Here the parameter R refers to a fit-result type. By declaring a model to be a subtype of MLJBase.Supervised{R} you guarantee the fit-result to be of type R and, if R is concrete, one may improve the performance of homogeneous ensembles of the model (as defined by the built-in MLJ EnsembleModel wrapper). There is no abstract type for fit-results because these types are generally declared outside of MLJBase.

WIP: The necessity to declare the fitresult type R may disappear in the future (issue #93).

Supervised models are further divided according to whether they are able to furnish probabilistic predictions of the target(s) (which they will do so by default) or directly predict "point" estimates, for each new input pattern:

abstract type Probabilistic{R} <: Supervised{R} end
abstract type Deterministic{R} <: Supervised{R} end

Further division of model types is realized through Trait declarations.

Associated with every concrete subtype of Model there must be a fit method, which implements the associated algorithm to produce the fit-result. Additionally, every Supervised model has a predict method, while Unsupervised models must have a transform method. More generally, methods such as these, that are dispatched on a model instance and a fit-result (plus other data), are called operations. Probabilistic supervised models optionally implement a predict_mode operation (in the case of classifiers) or a predict_mean and/or predict_median operations (in the case of regressors) overriding obvious fallbacks provided by MLJBase. Unsupervised models may implement an inverse_transform operation.

New model type declarations and optional clean! method

Here is an example of a concrete supervised model type declaration, made after defining an appropriate fit-result type (an optional step):

import MLJ

struct LinearFitResult{F<:AbstractFloat} <: MLJBase.MLJType
    coefficients::Vector{F}
    bias::F
end

mutable struct RidgeRegressor{F} <: MLJBase.Deterministic{LinearFitResult{F}}
    target_type::Type{F}
    lambda::Float64
end

Note. Model fields may be of any type except NamedTuple. (This is because named tuples are used to represented the nested hyperparameters of composite models (models that have other models as fields.)

Models (which are mutable) should not be given internal constructors. It is recommended that they be given an external lazy keyword constructor of the same name. This constructor defines default values for every field, and optionally corrects invalid field values by calling a clean! method (whose fallback returns an empty message string):

function MLJ.clean!(model::RidgeRegressor)
    warning = ""
    if model.lambda < 0
        warning *= "Need lambda ≥ 0. Resetting lambda=0. "
        model.lambda = 0
    end
    return warning
end

# keyword constructor
function RidgeRegressor(; target_type=Float64, lambda=0.0)

    model = RidgeRegressor(target_type, lambda)

    message = MLJBase.clean!(model)
    isempty(message) || @warn message

    return model
    
end

Supervised models

Below we describe the compulsory and optional methods to be specified for each concrete type SomeSupervisedModelType{R} <: MLJBase.Supervised{R}.

The form of data for fitting and predicting

In every circumstance, the argument X passed to the fit method described below, and the argument Xnew of the predict method, will be some table supporting the Tables.jl API. The interface implementer can control the scientific type of data appearing in X with an appropriate input_scitype_union declaration (see Trait declarations, as Union{scitypes(X)...} <: input_scitype_union(SomeSupervisedModel) will always hold. See Convenience methods below for the definition of scitypes. If the core algorithm requires data in a different or more specific form, then fit will need to coerce the table into the form desired. To this end, MLJ provides the convenience method MLJBase.matrix; MLJBase.matrix(Xtable) has type Matrix{T} where T is the tightest common type of elements of Xtable, and Xtable is any table.

Tables.jl has recently added a matrix method as well.

Other convenience methods provided by MLJBase for handling tabular data are: selectrows, selectcols, select, schema (for extracting the size, names and eltypes of a table) and table (for materializing an abstract matrix, or named tuple of vectors, as a table matching a given prototype). See Convenience methods below for details.

Note that generally the same type coercions applied to X by fit will need to be applied by predict to Xnew.

Important convention It is to be understood that the columns of the table X correspond to features and the rows to patterns.

For univariate targets, y is always a Vector or CategoricalVector, according to the value of the trait:

target_scitype_union(SomeSupervisedModelType)type of ya supertype of eltype(y)
ContinuousVectorReal
<: MulticlassCategoricalVectorUnion{CategoricalString, CategoricalValue}
<: FiniteOrderedFactorCategoricalVectorUnion{CategoricalString, CategoricalValue}
CountVectorInteger

The form of the target data y passed to fit is constrained by the target_scitype_union trait declaration as scitype_union(y) <: target_scitype_union(SomeSupervisedModelType) will always hold. See See Convenience methods for the definition of scitype_union.

So, for example, if your model is a binary classifier, you declare

target_scitype_union(SomeSupervisedModelType)=Multiclass{2}

If it can predict any number of classes, you might instead declare

target_scitype_union(SomeSupervisedModelType)=Union{Multiclass, FiniteOrderedFactor}

See also the table in Getting Started.

In the case of a multivariate target, in which case y is a vector of tuples, the same constraint scitype_union(y) <: target_scitype_union(SomeSupervisedModelType) holds. For example, if you declare target_scitype_union(SomeSupervisedModelType) = Tuple{Continuous,Count}, then each element of y will be a tuple of type Tuple{Real,Integer}.

The fit method

A compulsory fit method returns three objects:

MLJBase.fit(model::SomeSupervisedModelType, verbosity::Int, X, y) -> fitresult, cache, report

Note: The Int typing of verbosity cannot be omitted.

  1. fitresult::R is the fit-result in the sense above (which becomes an argument for predict discussed below).

  2. report is a (possibly empty) NamedTuple, for example, report=(deviance=..., dof_residual=..., stderror=..., vcov=...). Any training-related statistics, such as internal estimates of the generalization error, and feature rankings, should be returned in the report tuple. How, or if, these are generated should be controlled by hyperparameters (the fields of model). Fitted parameters, such as the coefficients of a linear model, do not go in the report as they will be extractable from fitresult (and accessible to MLJ through the fitted_params method, see below).

3. The value of cache can be nothing, unless one is also defining an update method (see below). The Julia type of cache is not presently restricted.

It is not necessary for fit to provide dimension checks or to call clean! on the model; MLJ will carry out such checks.

The method fit should never alter hyperparameter values. If the package is able to suggest better hyperparameters, as a byproduct of training, return these in the report field.

One should test that actual fit-results have the type declared in the model mutable struct declaration. To help with this, MLJBase.fitresult_type(m) returns the declared type, for any supervised model (or model type) m.

The verbosity level (0 for silent) is for passing to learning algorithm itself. A fit method wrapping such an algorithm should generally avoid doing any of its own logging.

The fitted_params method

A fitted_params method may be optionally overloaded. It's purpose is to provide MLJ accesss to a user-friendly representation of the learned parameters of the model (as opposed to the hyperparameters). They must be extractable from fitresult.

MLJBase.fitted_params(model::SomeSupervisedModelType, fitresult) -> friendly_fitresult::NamedTuple

For a linear model, for example, one might declare something like friendly_fitresult=(coefs=[...], bias=...).

The fallback is to return (fitresult=fitresult,).

The predict method

A compulsory predict method has the form

MLJBase.predict(model::SomeSupervisedModelType, fitresult, Xnew) -> yhat

Here Xnew is an any table whose entries satisfy the same scitype constraints as discussed for X above.

Prediction types for deterministic responses. In the case of Deterministic models, yhat must have the same form as the target y passed to the fit method (see above discussion on the form of data for fitting), with one exception: If predicting a Count, the prediction may be Continuous. For all models predicting a Multiclass or FiniteOrderedFactor, the categorical vectors returned by predict must have the levels in the categorical pool of the target data presented in training, even if not all levels appear in the training data or prediction itself. That is, we must have levels(yhat) == levels(y).

Unfortunately, code not written with the preservation of categorical levels in mind poses special problems. To help with this, MLJ provides a utility CategoricalDecoder which can decode a CategoricalArray into a plain array, and re-encode a prediction with the original levels intact. The CategoricalDecoder object created during fit will need to be bundled with fitresult to make it available to predict during re-encoding.

So, for example, if the core algorithm being wrapped by fit expects a nominal target yint of type Vector{Int64} then a fit method may look something like this:

function MLJBase.fit(model::SomeSupervisedModelType, verbosity, X, y)
    decoder = MLJBase.CategoricalDecoder(y, Int64)
    yint = transform(decoder, y)
    core_fitresult = SomePackage.fit(X, yint, verbosity=verbosity)
    fitresult = (decoder, core_fitresult)
    cache = nothing
    report = nothing
    return fitresult, cache, report
end

while a corresponding deterministic predict operation might look like this:

function MLJBase.predict(model::SomeSupervisedModelType, fitresult, Xnew)
    decoder, core_fitresult = fitresult
    yhat = SomePackage.predict(core_fitresult, Xnew)
    return inverse_transform(decoder, yhat)
end

Query ?MLJBase.DecodeCategorical for more information.

If you are coding a learning algorithm from scratch, rather than wrapping an existing one, conversions may be unnecessary. It may suffice to record the pool of y and bundle that with the fitresult for predict to append to the levels of its categorical output.

Prediction types for probabilistic responses. In the case of Probabilistic models with univariate targets, yhat must be a Vector whose elements are distributions (one distribution per row of Xnew).

A distribution is any instance of a subtype of Distributions.Distribution from the package Distributions.jl, or any instance of the additional types UnivariateNominal and MultivariateNominal defined in MLJBase.jl (or any other type D you define for which MLJBase.isdistribution(::D) = true, meaning Base.rand and Distributions.pdf are implemented, as well Distributions.mean/Distribution.median or Distributions.mode).

Use UnivariateNominal for Probabilistic models predicting Multiclass or FiniteOrderedFactor targets. For example, suppose levels(y)=["yes", "no", "maybe"] and set L=levels(y). Then, if the predicted probabilities for some input pattern are [0.1, 0.7, 0.2], respectively, then the prediction returned for that pattern will be UnivariateNominal(L, [0.1, 0.7, 0.2]). Query ?UnivariateNominal for more information.

The predict method will need access to all levels in the pool of the target variable presented y presented for training, which consequently need to be encoded in the fitresult returned by fit. If a CategoricalDecoder object, decoder, has been bundled in fitresult, as in the deterministic example above, then the levels are given by levels(decoder). Levels not observed in the training data (i.e., only in its pool) should be assigned probability zero.

Trait declarations

There are a number of recommended trait declarations for each model mutable structure SomeSupervisedModelType <: Supervised you define. Basic fitting, resampling and tuning in MLJ does not require these traits but some advanced MLJ meta-algorithms may require them now, or in the future. In particular, MLJ's models(::Task) method (matching models to user-specified tasks) can only identify models having a complete set of trait declarations. A full set of declarations is shown below for the DecisionTreeClassifier type (defined in the submodule DecisionTree_ of MLJModels):

MLJBase.load_path(::Type{<:DecisionTreeClassifier}) = "MLJModels.DecisionTree_.DecisionTreeClassifier" 
MLJBase.package_name(::Type{<:DecisionTreeClassifier}) = "DecisionTree"
MLJBase.package_uuid(::Type{<:DecisionTreeClassifier}) = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb"
MLJBase.package_url(::Type{<:DecisionTreeClassifier}) = "https://github.com/bensadeghi/DecisionTree.jl"
MLJBase.is_pure_julia(::Type{<:DecisionTreeClassifier}) = true
MLJBase.input_is_multivariate(::Type{<:DecisionTreeClassifier}) = true
MLJBase.input_scitype_union(::Type{<:DecisionTreeClassifier}) = MLJBase.Continuous
MLJBase.target_scitype_union(::Type{<:DecisionTreeClassifier}) = MLJBase.Multiclass

Note that models predicting multivariate targets will need to need to have target_scitype_union return an appropriate Tuple type.

For an explanation of Found and Other in the table below, see Scientific Types.

methodreturn typedeclarable return valuesdefault value
target_scitype_unionDataTypesubtype of Found or tuple of such typesUnion{Found,NTuple{<:Found}}
input_scitype_unionDataTypesubtype of Union{Missing,Found}Union{Missing,Found}
input_is_multivariateBooltrue or falsetrue
is_pure_juliaBooltrue or falsefalse
load_pathStringunrestricted"unknown"
package_nameStringunrestricted"unknown"
package_uuidStringunrestricted"unknown"
package_urlStringunrestricted"unknown"

You can test declarations of traits by calling info(SomeModelType).

The update! method

An update method may be optionally overloaded to enable a call by MLJ to retrain a model (on the same training data) to avoid repeating computations unnecessarily.

MLJBase.update(model::SomeSupervisedModelType, verbosity, old_fitresult, old_cache, X, y) -> fitresult, cache, report

If an MLJ Machine is being fit! and it is not the first time, then update is called instead of fit unless fit! has been called with new rows. However, MLJBase defines a fallback for update which just calls fit. For context, see MLJ Internals.

Learning networks wrapped as models constitute one use-case: One would like each component model to be retrained only when hyperparameter changes "upstream" make this necessary. In this case MLJ provides a fallback (specifically, the fallback is for any subtype of Supervised{Node}). A second important use-case is iterative models, where calls to increase the number of iterations only restarts the iterative procedure if other hyperparameters have also changed. For an example, see builtins/Ensembles.jl.

In the event that the argument fitresult (returned by a preceding call to fit) is not sufficient for performing an update, the author can arrange for fit to output in its cache return value any additional information required, as this is also passed as an argument to the update method.

Multivariate models

TODO

Unsupervised models

TODO

Convenience methods

CategoricalDecoder(C::CategoricalArray)
CategoricalDecoder(C::CategoricalArray, eltype, start_at_zero=false)

Construct a decoder for transforming a CategoricalArray{T} object into an ordinary array, and for re-encoding similar arrays back into a CategoricalArray{T} object having the same pool (and, in particular, the same levels) as C. If eltype is not specified then the element type of the transformed array is T. Otherwise, the element type is eltype and the elements are conversions to eltype of the internal (unsigned integer) refs of the CategoricalArray, shifted backwards by one if start_at_zero=false. One must have eltype <: Real.

If eltype = Bool, then start_at_zero is ignored.

transform(decoder::CategoricalDecoder, C::CategoricalArray)

Transform C into an ordinary Array.

inverse_transform(decoder::CategoricalDecoder, A::Array)

Transform an array A suitably compatible with decoder into a CategoricalArray having the same pool as C.

levels(decoder::CategoricalDecoder)
levels_seen(decoder::CategoricaDecoder)

Return, respectively, all levels in pool of the categorical vector C used to construct decoder (ie, levels(C)), and just those levels explicitly appearing as entries of C (ie, unique(C)).

Example

julia> using CategoricalArrays
julia> C = categorical(["a" "b"; "a" "c"])
2×2 CategoricalArray{String,2,UInt32}:
 "a"  "b"
 "a"  "c"

julia> decoder = MLJBase.CategoricalDecoder(C, eltype=Float64);
julia> A = transform(decoder, C)
2×2 Array{Float64,2}:
 1.0  2.0
 1.0  3.0

julia> inverse_transform(decoder, A[1:1,:])
1×2 CategoricalArray{String,2,UInt32}:
 "a"  "b"

julia> levels(ans)
3-element Array{String,1}:
 "a"
 "b"
 "c"
source
MLJBase.matrixFunction.
MLJBase.matrix(X)

Convert a table source X into an Matrix; or, if X is a AbstractMatrix, return X. Optimized for column-based sources.

If instead X is a sparse table, then a SparseMatrixCSC object is returned. The integer relabelling of column names follows the lexicographic ordering (as indicated by schema(X).names).

source
MLJBase.tableFunction.
MLJBase.table(cols; prototype=cols)

Convert a named tuple of vectors cols, into a table. The table type returned is the "preferred sink type" for prototype (see the Tables.jl documentation).

MLJBase.table(X::AbstractMatrix; names=nothing, prototype=nothing)

Convert an abstract matrix X into a table with names (a tuple of symbols) as column names, or with labels (:x1, :x2, ..., :xn) where n=size(X, 2), if names is not specified. If prototype=nothing, then a named tuple of vectors is returned.

Equivalent to table(cols, prototype=prototype) where cols is the named tuple of columns of X, with keys(cols) = names.

source
MLJBase.selectFunction.
select(X, r, c)

Select element of a table or sparse table at row r and column c. In the case of sparse data where the key (r, c), zero or missing is returned, depending on the value type.

See also: selectrows, selectcols

source
MLJBase.selectrowsFunction.
selectrows(X, r)

Select single or multiple rows from any table, sparse table, or abstract vector X. If X is tabular, the object returned is a table of the preferred sink type of typeof(X), even a single row is selected.

source
MLJBase.selectcolsFunction.
selectcols(X, c)

Select single or multiple columns from any table or sparse table X. If c is an abstract vector of integers or symbols, then the object returned is a table of the preferred sink type of typeof(X). If c is a single integer or column, then a Vector or CategoricalVector is returned.

source
MLJBase.schemaFunction.
schema(X)

Returns a struct with properties names, types with the obvious meanings. Here X is any table or sparse table.

source
MLJBase.nrowsFunction.
nrows(X)

Return the number of rows in a table, sparse table, or abstract vector.

source
MLJBase.scitypeFunction.
scitype(x)

Return the scientific type for scalar values that object x can represent. If x is a tuple, then Tuple{scitype.(x)...} is returned.

julia> scitype(4.5)
Continous

julia> scitype("book")
Unknown

julia> scitype((1, 4.5))
Tuple{Count,Continuous}

julia> using CategoricalArrays
julia> v = categorical([:m, :f, :f])
julia> scitype(v[1])
Multiclass{2}
source
MLJBase.scitype_unionFunction.
scitype_union(A)

Return the type union, over all elements x generated by the iterable A, of scitype(x).

source
MLJBase.scitypesFunction.
scitypes(X)

Returns a named tuple keyed on the column names of the table X with values the corresponding scitype unions over a column's entries.

source

Where to place code implementing new models

Note that different packages can implement models having the same name without causing conflicts, although an MLJ user cannot simultaneously load two such models.

There are two options for making a new model implementation available to all MLJ users:

  1. Native implementations (preferred option). The implementation code lives in the same package that contains the learning algorithms implementing the interface. In this case, it is sufficient to open an issue at MLJRegistry requesting the package to be registered with MLJ. Registering a package allows the MLJ user to access its models' metadata and to selectively load them.

  2. External implementations (short-term alternative). The model implementation code is necessarily separate from the package SomePkg defining the learning algorithm being wrapped. In this case, the recommended procedure is to include the implementation code at MLJModels/src via a pull-request, and test code at MLJModels/test. Assuming SomePkg is the only package imported by the implementation code, one needs to: (i) register SomePkg at MLJRegistry as explained above; and (ii) add a corresponding @require line in the PR to MLJModels/src/MLJModels.jl to enable lazy-loading of that package by MLJ (following the pattern of existing additions). If other packages must be imported, add them to the MLJModels project file after checking they are not already there. If it is really necessary, packages can be also added to Project.toml for testing purposes.

Additionally, one needs to ensure that the implementation code defines the package_name and load_path model traits appropriately, so that MLJ's @load macro can find the necessary code (see MLJModels/src for examples). The @load command can only be tested after registration. If changes are made, lodge an issue at MLJRegistry to make the changes available to MLJ.