Fit, predict, transform

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Preliminary steps


As in "choosing a model", let's load the Iris dataset and unpack it:

using MLJ
import Statistics
using PrettyPrinting
using StableRNGs

X, y = @load_iris;

let's also load the DecisionTreeClassifier:

DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
tree_model = DecisionTreeClassifier()
import MLJDecisionTreeInterface ✔
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

MLJ Machine

In MLJ, remember that a model is an object that only serves as a container for the hyperparameters of the model. A machine is an object wrapping both a model and data and can contain information on the trained model; it does not fit the model by itself. However, it does check that the model is compatible with the scientific type of the data and will warn you otherwise.

tree = machine(tree_model, X, y)
Machine{DecisionTreeClassifier,…} trained 0 times; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
    1:	Source @605 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @014 ⏎ `AbstractVector{ScientificTypesBase.Multiclass{3}}`

A machine is used both for supervised and unsupervised model. In this tutorial we give an example for the supervised model first and then go on with the unsupervised case.

Training and testing a supervised model

Now that you've declared the model you'd like to consider and the data, we are left with the standard training and testing step for a supervised learning algorithm.

Splitting the data

To split the data into a training and testing set, you can use the function partition to obtain indices for data points that should be considered either as training or testing data:

rng = StableRNG(566)
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=rng)
3-element Vector{Int64}:

Fitting and testing the machine

To fit the machine, you can use the function fit! specifying the rows to be used for the training:

fit!(tree, rows=train)
Machine{DecisionTreeClassifier,…} trained 1 time; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
    1:	Source @605 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @014 ⏎ `AbstractVector{ScientificTypesBase.Multiclass{3}}`

Note that this modifies the machine which now contains the trained parameters of the decision tree. You can inspect the result of the fitting with the fitted_params method:

fitted_params(tree) |> pprint
(tree = Decision Tree
Leaves: 5
Depth:  4,
 encoding =
     Dict(CategoricalArrays.CategoricalValue{String, UInt32} "virginica" =>
          CategoricalArrays.CategoricalValue{String, UInt32} "setosa" =>
          CategoricalArrays.CategoricalValue{String, UInt32} "versicolor" =>

This fitresult will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).

You can now use the machine to make predictions with the predict function specifying rows to be used for the prediction:

ŷ = predict(tree, rows=test)
@show ŷ[1]
ŷ[1] = UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)

Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the mode function on or using predict_mode:

ȳ = predict_mode(tree, rows=test)
@show ȳ[1]
@show mode(ŷ[1])
ȳ[1] = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"
mode(ŷ[1]) = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"

To measure the discrepancy between and y you could use the average cross entropy:

mce = cross_entropy(ŷ, y[test]) |> mean
round(mce, digits=4)

Unsupervised models

Unsupervised models define a transform method, and may optionally implement an inverse_transform method. As in the supervised case, we use a machine to wrap the unsupervised model and the data:

v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)
Machine{UnivariateStandardizer,…} trained 0 times; caches data
  model: MLJModels.UnivariateStandardizer
    1:	Source @762 ⏎ `AbstractVector{ScientificTypesBase.Count}`

We can then fit the machine and use it to apply the corresponding data transformation:

w = transform(stand, v)
@show round.(w, digits=2)
@show mean(w)
@show std(w)
round.(w, digits = 2) = [-1.16, -0.39, 0.39, 1.16]
mean(w) = 0.0
std(w) = 1.0

In this case, the model also has an inverse transform:

vv = inverse_transform(stand, w)
sum(abs.(vv .- v))