Fit, predict, transform
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
As in "choosing a model", let's load the Iris dataset and unpack it:
using MLJ
import Statistics
using PrettyPrinting
using StableRNGs
X, y = @load_iris;
let's also load the DecisionTreeClassifier
:
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
tree_model = DecisionTreeClassifier()
import MLJDecisionTreeInterface ✔
DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
display_depth = 5,
feature_importance = :impurity,
rng = Random._GLOBAL_RNG())
In MLJ, remember that a model is an object that only serves as a container for the hyperparameters of the model. A machine is an object wrapping both a model and data and can contain information on the trained model; it does not fit the model by itself. However, it does check that the model is compatible with the scientific type of the data and will warn you otherwise.
tree = machine(tree_model, X, y)
untrained Machine; caches model-specific representations of data
model: DecisionTreeClassifier(max_depth = -1, …)
args:
1: Source @481 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @685 ⏎ AbstractVector{ScientificTypesBase.Multiclass{3}}
A machine is used both for supervised and unsupervised model. In this tutorial we give an example for the supervised model first and then go on with the unsupervised case.
Now that you've declared the model you'd like to consider and the data, we are left with the standard training and testing step for a supervised learning algorithm.
To split the data into a training and testing set, you can use the function partition
to obtain indices for data points that should be considered either as training or testing data:
rng = StableRNG(566)
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=rng)
test[1:3]
3-element Vector{Int64}:
39
54
9
To fit the machine, you can use the function fit!
specifying the rows to be used for the training:
fit!(tree, rows=train)
trained Machine; caches model-specific representations of data
model: DecisionTreeClassifier(max_depth = -1, …)
args:
1: Source @481 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @685 ⏎ AbstractVector{ScientificTypesBase.Multiclass{3}}
Note that this modifies the machine which now contains the trained parameters of the decision tree. You can inspect the result of the fitting with the fitted_params
method:
fitted_params(tree) |> pprint
(tree =
DecisionTree.InfoNode{Float64, UInt32}(Decision Tree
Leaves: 5
Depth: 4, nchildren=2),
raw_tree = Decision Tree
Leaves: 5
Depth: 4,
encoding =
Dict(0x00000002 =>
CategoricalArrays.CategoricalValue{String, UInt32} "versicolor",
0x00000003 =>
CategoricalArrays.CategoricalValue{String, UInt32} "virginica",
0x00000001 =>
CategoricalArrays.CategoricalValue{String, UInt32} "setosa"),
features = [:sepal_length, :sepal_width, :petal_length, :petal_width])
This fitresult
will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).
You can now use the machine to make predictions with the predict
function specifying rows to be used for the prediction:
ŷ = predict(tree, rows=test)
@show ŷ[1]
ŷ[1] = UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the mode
function on ŷ
or using predict_mode
:
ȳ = predict_mode(tree, rows=test)
@show ȳ[1]
@show mode(ŷ[1])
ȳ[1] = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"
mode(ŷ[1]) = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"
To measure the discrepancy between ŷ
and y
you could use the cross entropy:
mce = cross_entropy(ŷ, y[test])
round(mce, digits=4)
2.4029
Unsupervised models define a transform
method, and may optionally implement an inverse_transform
method. As in the supervised case, we use a machine to wrap the unsupervised model and the data:
v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)
untrained Machine; caches model-specific representations of data
model: UnivariateStandardizer()
args:
1: Source @685 ⏎ AbstractVector{ScientificTypesBase.Count}
We can then fit the machine and use it to apply the corresponding data transformation:
fit!(stand)
w = transform(stand, v)
@show round.(w, digits=2)
@show mean(w)
@show std(w)
round.(w, digits = 2) = [-1.16, -0.39, 0.39, 1.16]
mean(w) = 0.0
std(w) = 1.0
In this case, the model also has an inverse transform:
vv = inverse_transform(stand, w)
sum(abs.(vv .- v))
0.0