Adding New Models
This guide outlines in detail the specification of the MLJ model interface and provides guidelines for implementing the interface for models intended for general use. For sample implementations, see MLJModels/src.
The machine learning tools provided by MLJ can be applied to the models in any package that imports the package MLJBase and implements the API defined there, as outlined below. For a quick-and-dirty implementation of user-defined models see Simple User Defined Models. To make new models available to all MLJ users, see Where to place code implementing new models.
It is assumed the reader has read Getting Started. To implement the API described here, some familiarity with the following packages is also helpful:
Distributions.jl (for probabilistic predictions)
CategoricalArrays.jl (essential if you are implementing a model handling data of
Multiclass
orFiniteOrderedFactor
scitype)Tables.jl (if you're algorithm needs input data in a novel format).
In MLJ, the basic interface exposed to the user, built atop the model interface described here, is the machine interface. After a first reading of this document, the reader may wish to refer to MLJ Internals for context.
Overview
A model is an object storing hyperparameters associated with some machine learning algorithm. In MLJ, hyperparameters include configuration parameters, like the number of threads, and special instructions, such as "compute feature rankings", which may or may not affect the final learning outcome. However, the logging level (verbosity
below) is excluded.
The name of the Julia type associated with a model indicates the associated algorithm (e.g., DecisionTreeClassifier
). The outcome of training a learning algorithm is called a fit-result. For ordinary multilinear regression, for example, this would be the coefficients and intercept. For a general supervised model, it is the (generally minimal) information needed to make new predictions.
The ultimate supertype of all models is MLJBase.Model
, which has two abstract subtypes:
abstract type Supervised{R} <: Model end
abstract type Unsupervised <: Model end
Here the parameter R
refers to a fit-result type. By declaring a model to be a subtype of MLJBase.Supervised{R}
you guarantee the fit-result to be of type R
and, if R
is concrete, one may improve the performance of homogeneous ensembles of the model (as defined by the built-in MLJ EnsembleModel
wrapper). There is no abstract type for fit-results because these types are generally declared outside of MLJBase.
WIP: The necessity to declare the fitresult type
R
may disappear in the future (issue #93).
Supervised
models are further divided according to whether they are able to furnish probabilistic predictions of the target(s) (which they will do so by default) or directly predict "point" estimates, for each new input pattern:
abstract type Probabilistic{R} <: Supervised{R} end
abstract type Deterministic{R} <: Supervised{R} end
Further division of model types is realized through Trait declarations.
Associated with every concrete subtype of Model
there must be a fit
method, which implements the associated algorithm to produce the fit-result. Additionally, every Supervised
model has a predict
method, while Unsupervised
models must have a transform
method. More generally, methods such as these, that are dispatched on a model instance and a fit-result (plus other data), are called operations. Probabilistic
supervised models optionally implement a predict_mode
operation (in the case of classifiers) or a predict_mean
and/or predict_median
operations (in the case of regressors) overriding obvious fallbacks provided by MLJBase
. Unsupervised
models may implement an inverse_transform
operation.
New model type declarations and optional clean! method
Here is an example of a concrete supervised model type declaration, made after defining an appropriate fit-result type (an optional step):
import MLJ
struct LinearFitResult{F<:AbstractFloat} <: MLJBase.MLJType
coefficients::Vector{F}
bias::F
end
mutable struct RidgeRegressor{F} <: MLJBase.Deterministic{LinearFitResult{F}}
target_type::Type{F}
lambda::Float64
end
Note. Model fields may be of any type except NamedTuple
. (This is because named tuples are used to represented the nested hyperparameters of composite models (models that have other models as fields.)
Models (which are mutable) should not be given internal constructors. It is recommended that they be given an external lazy keyword constructor of the same name. This constructor defines default values for every field, and optionally corrects invalid field values by calling a clean!
method (whose fallback returns an empty message string):
function MLJ.clean!(model::RidgeRegressor)
warning = ""
if model.lambda < 0
warning *= "Need lambda ≥ 0. Resetting lambda=0. "
model.lambda = 0
end
return warning
end
# keyword constructor
function RidgeRegressor(; target_type=Float64, lambda=0.0)
model = RidgeRegressor(target_type, lambda)
message = MLJBase.clean!(model)
isempty(message) || @warn message
return model
end
Supervised models
Below we describe the compulsory and optional methods to be specified for each concrete type SomeSupervisedModelType{R} <: MLJBase.Supervised{R}
.
The form of data for fitting and predicting
In every circumstance, the argument X
passed to the fit
method described below, and the argument Xnew
of the predict
method, will be some table supporting the Tables.jl API. The interface implementer can control the scientific type of data appearing in X
with an appropriate input_scitype_union
declaration (see Trait declarations, as Union{scitypes(X)...} <: input_scitype_union(SomeSupervisedModel)
will always hold. See Convenience methods below for the definition of scitypes
. If the core algorithm requires data in a different or more specific form, then fit
will need to coerce the table into the form desired. To this end, MLJ provides the convenience method MLJBase.matrix
; MLJBase.matrix(Xtable)
has type Matrix{T}
where T
is the tightest common type of elements of Xtable
, and Xtable
is any table.
Tables.jl has recently added a
matrix
method as well.
Other convenience methods provided by MLJBase for handling tabular data are: selectrows
, selectcols
, select
, schema
(for extracting the size, names and eltypes of a table) and table
(for materializing an abstract matrix, or named tuple of vectors, as a table matching a given prototype). See Convenience methods below for details.
Note that generally the same type coercions applied to X
by fit
will need to be applied by predict
to Xnew
.
Important convention It is to be understood that the columns of the table X
correspond to features and the rows to patterns.
For univariate targets, y
is always a Vector
or CategoricalVector
, according to the value of the trait:
target_scitype_union(SomeSupervisedModelType) | type of y | a supertype of eltype(y) |
---|---|---|
Continuous | Vector | Real |
<: Multiclass | CategoricalVector | Union{CategoricalString, CategoricalValue} |
<: FiniteOrderedFactor | CategoricalVector | Union{CategoricalString, CategoricalValue} |
Count | Vector | Integer |
The form of the target data y
passed to fit
is constrained by the target_scitype_union
trait declaration as scitype_union(y) <: target_scitype_union(SomeSupervisedModelType)
will always hold. See See Convenience methods for the definition of scitype_union
.
So, for example, if your model is a binary classifier, you declare
target_scitype_union(SomeSupervisedModelType)=Multiclass{2}
If it can predict any number of classes, you might instead declare
target_scitype_union(SomeSupervisedModelType)=Union{Multiclass, FiniteOrderedFactor}
See also the table in Getting Started.
In the case of a multivariate target, in which case y
is a vector of tuples, the same constraint scitype_union(y) <: target_scitype_union(SomeSupervisedModelType)
holds. For example, if you declare target_scitype_union(SomeSupervisedModelType) = Tuple{Continuous,Count}
, then each element of y
will be a tuple of type Tuple{Real,Integer}
.
The fit method
A compulsory fit
method returns three objects:
MLJBase.fit(model::SomeSupervisedModelType, verbosity::Int, X, y) -> fitresult, cache, report
Note: The Int
typing of verbosity
cannot be omitted.
fitresult::R
is the fit-result in the sense above (which becomes an argument forpredict
discussed below).report
is a (possibly empty)NamedTuple
, for example,report=(deviance=..., dof_residual=..., stderror=..., vcov=...)
. Any training-related statistics, such as internal estimates of the generalization error, and feature rankings, should be returned in thereport
tuple. How, or if, these are generated should be controlled by hyperparameters (the fields ofmodel
). Fitted parameters, such as the coefficients of a linear model, do not go in the report as they will be extractable fromfitresult
(and accessible to MLJ through thefitted_params
method, see below).
3. The value of cache
can be nothing
, unless one is also defining an update
method (see below). The Julia type of cache
is not presently restricted.
It is not necessary for fit
to provide dimension checks or to call clean!
on the model; MLJ will carry out such checks.
The method fit
should never alter hyperparameter values. If the package is able to suggest better hyperparameters, as a byproduct of training, return these in the report field.
One should test that actual fit-results have the type declared in the model mutable struct
declaration. To help with this, MLJBase.fitresult_type(m)
returns the declared type, for any supervised model (or model type) m
.
The verbosity
level (0 for silent) is for passing to learning algorithm itself. A fit
method wrapping such an algorithm should generally avoid doing any of its own logging.
The fitted_params method
A fitted_params
method may be optionally overloaded. It's purpose is to provide MLJ accesss to a user-friendly representation of the learned parameters of the model (as opposed to the hyperparameters). They must be extractable from fitresult
.
MLJBase.fitted_params(model::SomeSupervisedModelType, fitresult) -> friendly_fitresult::NamedTuple
For a linear model, for example, one might declare something like friendly_fitresult=(coefs=[...], bias=...)
.
The fallback is to return (fitresult=fitresult,)
.
The predict method
A compulsory predict
method has the form
MLJBase.predict(model::SomeSupervisedModelType, fitresult, Xnew) -> yhat
Here Xnew
is an any table whose entries satisfy the same scitype constraints as discussed for X
above.
Prediction types for deterministic responses. In the case of Deterministic
models, yhat
must have the same form as the target y
passed to the fit
method (see above discussion on the form of data for fitting), with one exception: If predicting a Count
, the prediction may be Continuous
. For all models predicting a Multiclass
or FiniteOrderedFactor
, the categorical vectors returned by predict
must have the levels in the categorical pool of the target data presented in training, even if not all levels appear in the training data or prediction itself. That is, we must have levels(yhat) == levels(y)
.
Unfortunately, code not written with the preservation of categorical levels in mind poses special problems. To help with this, MLJ provides a utility CategoricalDecoder
which can decode a CategoricalArray
into a plain array, and re-encode a prediction with the original levels intact. The CategoricalDecoder
object created during fit
will need to be bundled with fitresult
to make it available to predict
during re-encoding.
So, for example, if the core algorithm being wrapped by fit
expects a nominal target yint
of type Vector{Int64}
then a fit
method may look something like this:
function MLJBase.fit(model::SomeSupervisedModelType, verbosity, X, y)
decoder = MLJBase.CategoricalDecoder(y, Int64)
yint = transform(decoder, y)
core_fitresult = SomePackage.fit(X, yint, verbosity=verbosity)
fitresult = (decoder, core_fitresult)
cache = nothing
report = nothing
return fitresult, cache, report
end
while a corresponding deterministic predict
operation might look like this:
function MLJBase.predict(model::SomeSupervisedModelType, fitresult, Xnew)
decoder, core_fitresult = fitresult
yhat = SomePackage.predict(core_fitresult, Xnew)
return inverse_transform(decoder, yhat)
end
Query ?MLJBase.DecodeCategorical
for more information.
If you are coding a learning algorithm from scratch, rather than wrapping an existing one, conversions may be unnecessary. It may suffice to record the pool of y
and bundle that with the fitresult for predict
to append to the levels of its categorical output.
Prediction types for probabilistic responses. In the case of Probabilistic
models with univariate targets, yhat
must be a Vector
whose elements are distributions (one distribution per row of Xnew
).
A distribution is any instance of a subtype of Distributions.Distribution
from the package Distributions.jl, or any instance of the additional types UnivariateNominal
and MultivariateNominal
defined in MLJBase.jl (or any other type D
you define for which MLJBase.isdistribution(::D) = true
, meaning Base.rand
and Distributions.pdf
are implemented, as well Distributions.mean
/Distribution.median
or Distributions.mode
).
Use UnivariateNominal
for Probabilistic
models predicting Multiclass
or FiniteOrderedFactor
targets. For example, suppose levels(y)=["yes", "no", "maybe"]
and set L=levels(y)
. Then, if the predicted probabilities for some input pattern are [0.1, 0.7, 0.2]
, respectively, then the prediction returned for that pattern will be UnivariateNominal(L, [0.1, 0.7, 0.2])
. Query ?UnivariateNominal
for more information.
The predict
method will need access to all levels in the pool of the target variable presented y
presented for training, which consequently need to be encoded in the fitresult
returned by fit
. If a CategoricalDecoder
object, decoder
, has been bundled in fitresult
, as in the deterministic example above, then the levels are given by levels(decoder)
. Levels not observed in the training data (i.e., only in its pool) should be assigned probability zero.
Trait declarations
There are a number of recommended trait declarations for each model mutable structure SomeSupervisedModelType <: Supervised
you define. Basic fitting, resampling and tuning in MLJ does not require these traits but some advanced MLJ meta-algorithms may require them now, or in the future. In particular, MLJ's models(::Task)
method (matching models to user-specified tasks) can only identify models having a complete set of trait declarations. A full set of declarations is shown below for the DecisionTreeClassifier
type (defined in the submodule DecisionTree_ of MLJModels):
MLJBase.load_path(::Type{<:DecisionTreeClassifier}) = "MLJModels.DecisionTree_.DecisionTreeClassifier"
MLJBase.package_name(::Type{<:DecisionTreeClassifier}) = "DecisionTree"
MLJBase.package_uuid(::Type{<:DecisionTreeClassifier}) = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb"
MLJBase.package_url(::Type{<:DecisionTreeClassifier}) = "https://github.com/bensadeghi/DecisionTree.jl"
MLJBase.is_pure_julia(::Type{<:DecisionTreeClassifier}) = true
MLJBase.input_is_multivariate(::Type{<:DecisionTreeClassifier}) = true
MLJBase.input_scitype_union(::Type{<:DecisionTreeClassifier}) = MLJBase.Continuous
MLJBase.target_scitype_union(::Type{<:DecisionTreeClassifier}) = MLJBase.Multiclass
Note that models predicting multivariate targets will need to need to have target_scitype_union
return an appropriate Tuple
type.
For an explanation of Found
and Other
in the table below, see Scientific Types.
method | return type | declarable return values | default value |
---|---|---|---|
target_scitype_union | DataType | subtype of Found or tuple of such types | Union{Found,NTuple{<:Found}} |
input_scitype_union | DataType | subtype of Union{Missing,Found} | Union{Missing,Found} |
input_is_multivariate | Bool | true or false | true |
is_pure_julia | Bool | true or false | false |
load_path | String | unrestricted | "unknown" |
package_name | String | unrestricted | "unknown" |
package_uuid | String | unrestricted | "unknown" |
package_url | String | unrestricted | "unknown" |
You can test declarations of traits by calling info(SomeModelType)
.
The update! method
An update
method may be optionally overloaded to enable a call by MLJ to retrain a model (on the same training data) to avoid repeating computations unnecessarily.
MLJBase.update(model::SomeSupervisedModelType, verbosity, old_fitresult, old_cache, X, y) -> fitresult, cache, report
If an MLJ Machine
is being fit!
and it is not the first time, then update
is called instead of fit
unless fit!
has been called with new rows. However, MLJBase
defines a fallback for update
which just calls fit
. For context, see MLJ Internals.
Learning networks wrapped as models constitute one use-case: One would like each component model to be retrained only when hyperparameter changes "upstream" make this necessary. In this case MLJ provides a fallback (specifically, the fallback is for any subtype of Supervised{Node}
). A second important use-case is iterative models, where calls to increase the number of iterations only restarts the iterative procedure if other hyperparameters have also changed. For an example, see builtins/Ensembles.jl
.
In the event that the argument fitresult
(returned by a preceding call to fit
) is not sufficient for performing an update, the author can arrange for fit
to output in its cache
return value any additional information required, as this is also passed as an argument to the update
method.
Multivariate models
TODO
Unsupervised models
TODO
Convenience methods
MLJBase.CategoricalDecoder
— Type.CategoricalDecoder(C::CategoricalArray)
CategoricalDecoder(C::CategoricalArray, eltype, start_at_zero=false)
Construct a decoder for transforming a CategoricalArray{T}
object into an ordinary array, and for re-encoding similar arrays back into a CategoricalArray{T}
object having the same pool
(and, in particular, the same levels) as C
. If eltype
is not specified then the element type of the transformed array is T
. Otherwise, the element type is eltype
and the elements are conversions to eltype
of the internal (unsigned integer) ref
s of the CategoricalArray
, shifted backwards by one if start_at_zero=false
. One must have eltype <: Real
.
If eltype = Bool
, then start_at_zero
is ignored.
transform(decoder::CategoricalDecoder, C::CategoricalArray)
Transform C
into an ordinary Array
.
inverse_transform(decoder::CategoricalDecoder, A::Array)
Transform an array A
suitably compatible with decoder
into a CategoricalArray
having the same pool
as C
.
levels(decoder::CategoricalDecoder)
levels_seen(decoder::CategoricaDecoder)
Return, respectively, all levels in pool of the categorical vector C
used to construct decoder
(ie, levels(C)
), and just those levels explicitly appearing as entries of C
(ie, unique(C)
).
Example
julia> using CategoricalArrays
julia> C = categorical(["a" "b"; "a" "c"])
2×2 CategoricalArray{String,2,UInt32}:
"a" "b"
"a" "c"
julia> decoder = MLJBase.CategoricalDecoder(C, eltype=Float64);
julia> A = transform(decoder, C)
2×2 Array{Float64,2}:
1.0 2.0
1.0 3.0
julia> inverse_transform(decoder, A[1:1,:])
1×2 CategoricalArray{String,2,UInt32}:
"a" "b"
julia> levels(ans)
3-element Array{String,1}:
"a"
"b"
"c"
MLJBase.matrix
— Function.MLJBase.matrix(X)
Convert a table source X
into an Matrix
; or, if X
is a AbstractMatrix
, return X
. Optimized for column-based sources.
If instead X is a sparse table, then a SparseMatrixCSC
object is returned. The integer relabelling of column names follows the lexicographic ordering (as indicated by schema(X).names
).
MLJBase.table
— Function.MLJBase.table(cols; prototype=cols)
Convert a named tuple of vectors cols
, into a table. The table type returned is the "preferred sink type" for prototype
(see the Tables.jl documentation).
MLJBase.table(X::AbstractMatrix; names=nothing, prototype=nothing)
Convert an abstract matrix X
into a table with names
(a tuple of symbols) as column names, or with labels (:x1, :x2, ..., :xn)
where n=size(X, 2)
, if names
is not specified. If prototype=nothing, then a named tuple of vectors is returned.
Equivalent to table(cols, prototype=prototype)
where cols
is the named tuple of columns of X
, with keys(cols) = names
.
MLJBase.select
— Function.select(X, r, c)
Select element of a table or sparse table at row r
and column c
. In the case of sparse data where the key (r, c)
, zero or missing
is returned, depending on the value type.
See also: selectrows, selectcols
MLJBase.selectrows
— Function.selectrows(X, r)
Select single or multiple rows from any table, sparse table, or abstract vector X
. If X
is tabular, the object returned is a table of the preferred sink type of typeof(X)
, even a single row is selected.
MLJBase.selectcols
— Function.selectcols(X, c)
Select single or multiple columns from any table or sparse table X
. If c
is an abstract vector of integers or symbols, then the object returned is a table of the preferred sink type of typeof(X)
. If c
is a single integer or column, then a Vector
or CategoricalVector
is returned.
MLJBase.schema
— Function.schema(X)
Returns a struct with properties names
, types
with the obvious meanings. Here X
is any table or sparse table.
MLJBase.nrows
— Function.nrows(X)
Return the number of rows in a table, sparse table, or abstract vector.
MLJBase.scitype
— Function.scitype(x)
Return the scientific type for scalar values that object x
can represent. If x
is a tuple, then Tuple{scitype.(x)...}
is returned.
julia> scitype(4.5)
Continous
julia> scitype("book")
Unknown
julia> scitype((1, 4.5))
Tuple{Count,Continuous}
julia> using CategoricalArrays
julia> v = categorical([:m, :f, :f])
julia> scitype(v[1])
Multiclass{2}
MLJBase.scitype_union
— Function.scitype_union(A)
Return the type union, over all elements x
generated by the iterable A
, of scitype(x)
.
MLJBase.scitypes
— Function.scitypes(X)
Returns a named tuple keyed on the column names of the table X
with values the corresponding scitype unions over a column's entries.
Where to place code implementing new models
Note that different packages can implement models having the same name without causing conflicts, although an MLJ user cannot simultaneously load two such models.
There are two options for making a new model implementation available to all MLJ users:
Native implementations (preferred option). The implementation code lives in the same package that contains the learning algorithms implementing the interface. In this case, it is sufficient to open an issue at MLJRegistry requesting the package to be registered with MLJ. Registering a package allows the MLJ user to access its models' metadata and to selectively load them.
External implementations (short-term alternative). The model implementation code is necessarily separate from the package
SomePkg
defining the learning algorithm being wrapped. In this case, the recommended procedure is to include the implementation code at MLJModels/src via a pull-request, and test code at MLJModels/test. AssumingSomePkg
is the only package imported by the implementation code, one needs to: (i) registerSomePkg
at MLJRegistry as explained above; and (ii) add a corresponding@require
line in the PR to MLJModels/src/MLJModels.jl to enable lazy-loading of that package by MLJ (following the pattern of existing additions). If other packages must be imported, add them to the MLJModels project file after checking they are not already there. If it is really necessary, packages can be also added to Project.toml for testing purposes.
Additionally, one needs to ensure that the implementation code defines the package_name
and load_path
model traits appropriately, so that MLJ
's @load
macro can find the necessary code (see MLJModels/src for examples). The @load
command can only be tested after registration. If changes are made, lodge an issue at MLJRegistry to make the changes available to MLJ.