Getting Started
Installation instructions
Cheatsheet
Plug-and-play model evaluation
To load some data add RDatasets to your load path and enter
julia> using RDatasets
julia> iris = dataset("datasets", "iris"); # a DataFrame
and then split the data into input and target parts:
julia> using MLJ
julia> y, X = unpack(iris, ==(:Species), colname -> true);
julia> first(X, 3)
3×4 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │
│ │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────────┼────────────┼─────────────┼────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │
In MLJ a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name.
Assuming the DecisionTree package is in your load path, we can use @load
to load the code defining the DecisionTreeClassifier
model type. This macro also returns an instance, with default hyperparameters.
Drop the verbosity=1
declaration for silent loading:
julia> tree_model = @load DecisionTreeClassifier verbosity=1
import MLJModels ✔
import DecisionTree ✔
import MLJModels.DecisionTree_.DecisionTreeClassifier ✔
MLJModels.DecisionTree_.DecisionTreeClassifier(pruning_purity = 1.0,
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
display_depth = 5,
post_prune = false,
merge_purity_threshold = 0.9,
pdf_smoothing = 0.05,) @ 1…30
Important: DecisionTree and most other packages implementing machine learning algorithms for use in MLJ are not MLJ dependencies. If such a package is not in your load path you will receive an error explaining how to add the package to your current environment.
Once loaded, a model can be evaluated with the evaluate
method:
julia> evaluate(tree_model, X, y,
resampling=CV(shuffle=true), measure=cross_entropy, verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
measurement = [0.360337],
per_fold = Array{Float64,1}[[0.0327898, 0.0327898, 0.524111, 0.360337, 0.360337, 0.851659]],
per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 4.12713, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 4.12713, 4.12713 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)
Evaluating against multiple performance measures is also possible. See Evaluating model performance for details.
Fit and predict
To illustrate MLJ's fit and predict interface, let's perform the above evaluations by hand.
Wrapping the model in data creates a machine which will store training outcomes:
julia> tree = machine(tree_model, X, y)
Machine{DecisionTreeClassifier} @ 5…41
Training and testing on a hold-out set:
julia> train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
julia> fit!(tree, rows=train);
[ Info: Training Machine{DecisionTreeClassifier} @ 5…41.
julia> yhat = predict(tree, X[test,:]);
julia> yhat[3:5]
3-element Array{UnivariateFinite{String,UInt8,Float64},1}:
UnivariateFinite(setosa=>0.9677419354838711, versicolor=>0.01612903225806452, virginica=>0.01612903225806452)
UnivariateFinite(setosa=>0.9677419354838711, versicolor=>0.01612903225806452, virginica=>0.01612903225806452)
UnivariateFinite(setosa=>0.016129032258064516, versicolor=>0.016129032258064516, virginica=>0.9677419354838709)
julia> cross_entropy(yhat, y[test]) |> mean
0.39673156168717766
Notice that yhat
is a vector of Distribution
objects (because DecisionTreeClassifier makes probabilistic predictions). The methods of the Distributions package can be applied to such distributions:
julia> broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
3-element Array{Float64,1}:
0.01612903225806452
0.01612903225806452
0.9677419354838709
julia> mode.(yhat[3:5])
3-element Array{CategoricalArrays.CategoricalString{UInt8},1}:
"setosa"
"setosa"
"virginica"
One can explicitly get modes by using predict_mode
instead of predict
:
julia> predict_mode(tree, rows=test[3:5])
3-element Array{CategoricalArrays.CategoricalString{UInt8},1}:
"setosa"
"setosa"
"virginica"
Machines have an internal state which allows them to avoid redundant calculations when retrained, in certain conditions - for example when increasing the number of trees in a random forest, or the number of epochs in a neural network. The machine building syntax also anticipates a more general syntax for composing multiple models, as explained in Composing Models.
There is a version of evaluate
for machines as well as models:
julia> evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
measure=cross_entropy,
verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
measurement = [0.141972],
per_fold = Array{Float64,1}[[0.141972]],
per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)
Changing a hyperparameter and re-evaluating:
julia> tree_model.max_depth = 3
3
julia> evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
measure=cross_entropy,
verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
measurement = [0.185266],
per_fold = Array{Float64,1}[[0.185266]],
per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 1.11514, 0.0327898 … 0.0327898, 0.0327898, 0.0327898, 1.11514, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)
Next steps
To learn a little more about what MLJ can do, take the MLJ tour, and then return to the manual as needed. Read at least the remainder of this page before considering serious use of MLJ.
Prerequisites
MLJ assumes some familiarity with the CategoricalValue
and CategoricalString
types from CategoricalArrays.jl, used here for representing categorical data. For probabilistic predictors, a basic acquaintance with Distributions.jl is also assumed.
Data containers and scientific types
The MLJ user should acquaint themselves with some basic assumptions about the form of data expected by MLJ, as outlined below.
machine(model::Supervised, X, y)
machine(model::Unsupervised, X)
Each supervised model in MLJ declares the permitted scientific type of the inputs X
and targets y
that can be bound to it in the first constructor above, rather than specifying specific machine types (such as Array{Float32, 2}
). Similar remarks apply to the input X
of an unsupervised model. Scientific types are julia types defined in the package ScientificTypes.jl, which also defines the convention used here (and there called mlj) for assigning a specific scientific type (interpretation) to each julia object (see the scitype
examples below).
The basic "scalar" scientific types are Continuous
, Multiclass{N}
, OrderedFactor{N}
and Count
. Be sure you read Container element types below to be guarantee your scalar data is interpreted correctly. Additionally, most data containers - such as tuples, vectors, matrices and tables - have a scientific type.
Figure 1. Part of the scientific type heirarchy in ScientificTypes.jl.
julia> scitype(4.6)
Continuous
julia> scitype(42)
Count
julia> x1 = categorical(["yes", "no", "yes", "maybe"]);
julia> scitype(x1)
AbstractArray{Multiclass{3},1}
julia> X = (x1=x1, x2=rand(4), x3=rand(4)) # a "column table"
(x1 = CategoricalArrays.CategoricalString{UInt32}["yes", "no", "yes", "maybe"],
x2 = [0.821551, 0.35096, 0.36934, 0.203404],
x3 = [0.206296, 0.880461, 0.534247, 0.119658],)
julia> scitype(X)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{3},1}}}
Tabular data
All data containers compatible with the Tables.jl interface (which includes all source formats listed here) have the scientific type Table{K}
, where K
depends on the scientific types of the columns, which can be individually inspected using schema
:
julia> schema(X)
(names = (:x1, :x2, :x3),
types = (CategoricalArrays.CategoricalString{UInt32}, Float64, Float64),
scitypes = (Multiclass{3}, Continuous, Continuous),
nrows = 4,)
Inputs
Since an MLJ model only specifies the scientific type of data, if that type is Table
- which is the case for the majority of MLJ models - then any Tables.jl format is permitted. However, the Tables.jl API excludes matrices. If Xmatrix
is a matrix, convert it to a column table using X = MLJ.table(Xmatrix)
.
Specifically, the requirement for an arbitrary model's input is scitype(X) <: input_scitype(model)
.
Targets
The target y
expected by MLJ models is generally an AbstractVector
. A multivariate target y
will generally be a vector of tuples. The tuples need not have uniform length, so some forms of sequence prediction are supported. Only the element types of y
matter (the types of y[j]
for each j
).
Specifically, the type requirement for a model target is scitype(y) <: target_scitype(model)
.
Querying a model for acceptable data types
Given a model instance, one can inspect the admissible scientific types of its input and target by querying the scientific type of the model itself:
julia> tree = DecisionTreeClassifier();
julia> scitype(tree)
(input_scitype = ScientificTypes.Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:Continuous),
target_scitype = AbstractArray{#s21,1} where #s21<:Finite,
is_probabilistic = true,)
This does not work if relevant model code has not been loaded. In that case one can extract this information from the model type's registry entry, using info
:
julia> info("DecisionTreeClassifier")
DecisionTreeClassifier from DecisionTree.jl.
[Documentation](https://github.com/bensadeghi/DecisionTree.jl).
(name = "DecisionTreeClassifier",
package_name = "DecisionTree",
is_supervised = true,
docstring = "DecisionTreeClassifier from DecisionTree.jl.\n[Documentation](https://github.com/bensadeghi/DecisionTree.jl).",
hyperparameter_types = ["Float64", "Int64", "Int64", "Int64", "Float64", "Int64", "Int64", "Bool", "Float64", "Float64"],
hyperparameters = Symbol[:pruning_purity, :max_depth, :min_samples_leaf, :min_samples_split, :min_purity_increase, :n_subfeatures, :display_depth, :post_prune, :merge_purity_threshold, :pdf_smoothing],
implemented_methods = Symbol[:fit, :predict, :clean!, :fitted_params],
is_pure_julia = true,
is_wrapper = false,
load_path = "MLJModels.DecisionTree_.DecisionTreeClassifier",
package_license = "unknown",
package_url = "https://github.com/bensadeghi/DecisionTree.jl",
package_uuid = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb",
prediction_type = :probabilistic,
supports_weights = false,
input_scitype = Table{_s13} where _s13<:(AbstractArray{_s12,1} where _s12<:Continuous),
target_scitype = AbstractArray{_s709,1} where _s709<:Finite,)
See also Working with tasks on searching for models solving a specified task.
Container element types
Models in MLJ will always apply the mlj convention described in ScientificTypes.jl to decide how to interpret the elements of your container types. Here are the key aspects of that convention:
Any
AbstractFloat
is interpreted asContinuous
.Any
Integer
is interpreted asCount
.Any
CategoricalValue
orCategoricalString
,x
, is interpreted asMulticlass
orOrderedFactor
, depending on the value ofx.pool.ordered
.String
s andChar
s are not interpreted asFinite
; they haveUnknown
scitype. Coerce vectors of strings or characters toCategoricalVector
s if they representMulticlass
orOrderedFactor
data. Do?coerce
and?unpack
to learn how.In particular, integers (including
Bool
s) cannot be used to represent categorical data.
To coerce the scientific type of a vector or table, use the coerce
method (re-exported from ScientificTypes.jl).