To load some data add RDatasets to your load path and enter

julia> using RDatasets

julia> iris = dataset("datasets", "iris"); # a DataFrame

and then split the data into input and target parts:

julia> using MLJ

julia> y, X = unpack(iris, ==(:Species), colname -> true);

julia> first(X, 3)
3×4 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │
│     │ Float64     │ Float64    │ Float64     │ Float64    │
├─────┼─────────────┼────────────┼─────────────┼────────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │

In MLJ a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name.

Assuming the DecisionTree package is in your load path, we can use @load to load the code defining the DecisionTreeClassifier model type. This macro also returns an instance, with default hyperparameters.

Drop the verbosity=1 declaration for silent loading:

julia> tree_model = @load DecisionTreeClassifier verbosity=1
import MLJModels ✔
import DecisionTree ✔
import MLJModels.DecisionTree_.DecisionTreeClassifier ✔
MLJModels.DecisionTree_.DecisionTreeClassifier(pruning_purity = 1.0,
                                               max_depth = -1,
                                               min_samples_leaf = 1,
                                               min_samples_split = 2,
                                               min_purity_increase = 0.0,
                                               n_subfeatures = 0,
                                               display_depth = 5,
                                               post_prune = false,
                                               merge_purity_threshold = 0.9,
                                               pdf_smoothing = 0.05,) @ 1…30

Important: DecisionTree and most other packages implementing machine learning algorithms for use in MLJ are not MLJ dependencies. If such a package is not in your load path you will receive an error explaining how to add the package to your current environment.

Once loaded, a model can be evaluated with the evaluate method:

julia> evaluate(tree_model, X, y,
                resampling=CV(shuffle=true), measure=cross_entropy, verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
 measurement = [0.360337],
 per_fold = Array{Float64,1}[[0.0327898, 0.0327898, 0.524111, 0.360337, 0.360337, 0.851659]],
 per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 4.12713, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898], [0.0327898, 0.0327898, 4.12713, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898, 4.12713, 4.12713  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)

Evaluating against multiple performance measures is also possible. See Evaluating model performance for details.

Fit and predict

To illustrate MLJ's fit and predict interface, let's perform the above evaluations by hand.

Wrapping the model in data creates a machine which will store training outcomes:

julia> tree = machine(tree_model, X, y)
Machine{DecisionTreeClassifier} @ 5…41

Training and testing on a hold-out set:

julia> train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split

julia> fit!(tree, rows=train);
[ Info: Training Machine{DecisionTreeClassifier} @ 5…41.

julia> yhat = predict(tree, X[test,:]);

julia> yhat[3:5]
3-element Array{UnivariateFinite{String,UInt8,Float64},1}:
 UnivariateFinite(setosa=>0.9677419354838711, versicolor=>0.01612903225806452, virginica=>0.01612903225806452)
 UnivariateFinite(setosa=>0.9677419354838711, versicolor=>0.01612903225806452, virginica=>0.01612903225806452)
 UnivariateFinite(setosa=>0.016129032258064516, versicolor=>0.016129032258064516, virginica=>0.9677419354838709)

julia> cross_entropy(yhat, y[test]) |> mean
0.39673156168717766

Notice that yhat is a vector of Distribution objects (because DecisionTreeClassifier makes probabilistic predictions). The methods of the Distributions package can be applied to such distributions:

julia> broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
3-element Array{Float64,1}:
 0.01612903225806452
 0.01612903225806452
 0.9677419354838709

julia> mode.(yhat[3:5])
3-element Array{CategoricalArrays.CategoricalString{UInt8},1}:
 "setosa"
 "setosa"
 "virginica"

One can explicitly get modes by using predict_mode instead of predict:

julia> predict_mode(tree, rows=test[3:5])
3-element Array{CategoricalArrays.CategoricalString{UInt8},1}:
 "setosa"
 "setosa"
 "virginica"

Machines have an internal state which allows them to avoid redundant calculations when retrained, in certain conditions - for example when increasing the number of trees in a random forest, or the number of epochs in a neural network. The machine building syntax also anticipates a more general syntax for composing multiple models, as explained in Composing Models.

There is a version of evaluate for machines as well as models:

julia> evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
                       measure=cross_entropy,
                       verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
 measurement = [0.141972],
 per_fold = Array{Float64,1}[[0.141972]],
 per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 4.12713, 0.0327898, 0.0327898, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)

Changing a hyperparameter and re-evaluating:

julia> tree_model.max_depth = 3
3

julia> evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
                 measure=cross_entropy,
                 verbosity=0)
(measure = MLJBase.CrossEntropy[cross_entropy],
 measurement = [0.185266],
 per_fold = Array{Float64,1}[[0.185266]],
 per_observation = Array{Array{Float64,1},1}[[[0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 1.11514, 0.0327898  …  0.0327898, 0.0327898, 0.0327898, 1.11514, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898, 0.0327898]]],)

Next steps

To learn a little more about what MLJ can do, take the MLJ tour, and then return to the manual as needed. Read at least the remainder of this page before considering serious use of MLJ.

Prerequisites

MLJ assumes some familiarity with the CategoricalValue and CategoricalString types from CategoricalArrays.jl, used here for representing categorical data. For probabilistic predictors, a basic acquaintance with Distributions.jl is also assumed.

Data containers and scientific types

The MLJ user should acquaint themselves with some basic assumptions about the form of data expected by MLJ, as outlined below.

machine(model::Supervised, X, y) 
machine(model::Unsupervised, X)

Each supervised model in MLJ declares the permitted scientific type of the inputs X and targets y that can be bound to it in the first constructor above, rather than specifying specific machine types (such as Array{Float32, 2}). Similar remarks apply to the input X of an unsupervised model. Scientific types are julia types defined in the package ScientificTypes.jl, which also defines the convention used here (and there called mlj) for assigning a specific scientific type (interpretation) to each julia object (see the scitype examples below).

The basic "scalar" scientific types are Continuous, Multiclass{N}, OrderedFactor{N} and Count. Be sure you read Container element types below to be guarantee your scalar data is interpreted correctly. Additionally, most data containers - such as tuples, vectors, matrices and tables - have a scientific type.

Figure 1. Part of the scientific type heirarchy in ScientificTypes.jl.

julia> scitype(4.6)
Continuous

julia> scitype(42)
Count

julia> x1 = categorical(["yes", "no", "yes", "maybe"]);

julia> scitype(x1)
AbstractArray{Multiclass{3},1}

julia> X = (x1=x1, x2=rand(4), x3=rand(4))  # a "column table"
(x1 = CategoricalArrays.CategoricalString{UInt32}["yes", "no", "yes", "maybe"],
 x2 = [0.821551, 0.35096, 0.36934, 0.203404],
 x3 = [0.206296, 0.880461, 0.534247, 0.119658],)

julia> scitype(X)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{3},1}}}

Tabular data

All data containers compatible with the Tables.jl interface (which includes all source formats listed here) have the scientific type Table{K}, where K depends on the scientific types of the columns, which can be individually inspected using schema:

julia> schema(X)
(names = (:x1, :x2, :x3),
 types = (CategoricalArrays.CategoricalString{UInt32}, Float64, Float64),
 scitypes = (Multiclass{3}, Continuous, Continuous),
 nrows = 4,)

Inputs

Since an MLJ model only specifies the scientific type of data, if that type is Table - which is the case for the majority of MLJ models - then any Tables.jl format is permitted. However, the Tables.jl API excludes matrices. If Xmatrix is a matrix, convert it to a column table using X = MLJ.table(Xmatrix).

Specifically, the requirement for an arbitrary model's input is scitype(X) <: input_scitype(model).

Targets

The target y expected by MLJ models is generally an AbstractVector. A multivariate target y will generally be a vector of tuples. The tuples need not have uniform length, so some forms of sequence prediction are supported. Only the element types of y matter (the types of y[j] for each j).

Specifically, the type requirement for a model target is scitype(y) <: target_scitype(model).

Querying a model for acceptable data types

Given a model instance, one can inspect the admissible scientific types of its input and target by querying the scientific type of the model itself:

julia> tree = DecisionTreeClassifier();
julia> scitype(tree)
(input_scitype = ScientificTypes.Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:Continuous),
 target_scitype = AbstractArray{#s21,1} where #s21<:Finite,
 is_probabilistic = true,)

This does not work if relevant model code has not been loaded. In that case one can extract this information from the model type's registry entry, using info:

julia> info("DecisionTreeClassifier")
DecisionTreeClassifier from DecisionTree.jl.
[Documentation](https://github.com/bensadeghi/DecisionTree.jl).
(name = "DecisionTreeClassifier",
 package_name = "DecisionTree",
 is_supervised = true,
 docstring = "DecisionTreeClassifier from DecisionTree.jl.\n[Documentation](https://github.com/bensadeghi/DecisionTree.jl).",
 hyperparameter_types = ["Float64", "Int64", "Int64", "Int64", "Float64", "Int64", "Int64", "Bool", "Float64", "Float64"],
 hyperparameters = Symbol[:pruning_purity, :max_depth, :min_samples_leaf, :min_samples_split, :min_purity_increase, :n_subfeatures, :display_depth, :post_prune, :merge_purity_threshold, :pdf_smoothing],
 implemented_methods = Symbol[:fit, :predict, :clean!, :fitted_params],
 is_pure_julia = true,
 is_wrapper = false,
 load_path = "MLJModels.DecisionTree_.DecisionTreeClassifier",
 package_license = "unknown",
 package_url = "https://github.com/bensadeghi/DecisionTree.jl",
 package_uuid = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb",
 prediction_type = :probabilistic,
 supports_weights = false,
 input_scitype = Table{_s13} where _s13<:(AbstractArray{_s12,1} where _s12<:Continuous),
 target_scitype = AbstractArray{_s709,1} where _s709<:Finite,)

See also Working with tasks on searching for models solving a specified task.

Container element types

Models in MLJ will always apply the mlj convention described in ScientificTypes.jl to decide how to interpret the elements of your container types. Here are the key aspects of that convention:

Any AbstractFloat is interpreted as Continuous.
Any Integer is interpreted as Count.
Any CategoricalValue or CategoricalString, x, is interpreted as Multiclass or OrderedFactor, depending on the value of x.pool.ordered.
Strings and Chars are not interpreted as Finite; they have Unknown scitype. Coerce vectors of strings or characters to CategoricalVectors if they represent Multiclass or OrderedFactor data. Do ?coerce and ?unpack to learn how.
In particular, integers (including Bools) cannot be used to represent categorical data.

To coerce the scientific type of a vector or table, use the coerce method (re-exported from ScientificTypes.jl).

Getting Started

Installation instructions

Cheatsheet

Plug-and-play model evaluation