Choosing and evaluating a model

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Data and its interpretation

Machine type and scientific type

using RDatasets
using MLJ
iris = dataset("datasets", "iris")

first(iris, 3) |> pretty
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species                         │
│ Float64     │ Float64    │ Float64     │ Float64    │ CategoricalValue{String, UInt8} │
│ Continuous  │ Continuous │ Continuous  │ Continuous │ Multiclass{3}                   │
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa                          │
│ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa                          │
│ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa                          │

Observe that below each column name there are two types given: the first one is the machine type and the second one is the scientific type.

  • machine type: is the Julia type the data is currently encoded as, for instance Float64,

  • scientific type: is a type corresponding to how the data should be interpreted, for instance Multiclass{3}.

If you want to specify a different scientific type than the one inferred, you can do so by using the function coerce along with pairs of column names and scientific types:

iris2 = coerce(iris, :PetalWidth => OrderedFactor)
first(iris2[:, [:PetalLength, :PetalWidth]], 1) |> pretty
│ PetalLength │ PetalWidth                        │
│ Float64     │ CategoricalValue{Float64, UInt32} │
│ Continuous  │ OrderedFactor{22}                 │
│ 1.4         │ 0.2                               │

Unpacking data

The function unpack helps specify the target and the input for a regression or classification task

y, X = unpack(iris, ==(:Species))
first(X, 1) |> pretty
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │
│ Float64     │ Float64    │ Float64     │ Float64    │
│ Continuous  │ Continuous │ Continuous  │ Continuous │
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │

The two arguments after the dataframes should be understood as functions over column names specifying the target and the input data respectively. Let's look in more details at what we used here:

  • ==(:Species) is a shorthand to specify that the target should be the column with name equal to :Species,

  • colname -> true indicates that every other column is to be taken as input

Let's try another one:

y, X = unpack(iris, ==(:Species), !=(:PetalLength))
first(X, 1) |> pretty
│ SepalLength │ SepalWidth │ PetalWidth │
│ Float64     │ Float64    │ Float64    │
│ Continuous  │ Continuous │ Continuous │
│ 5.1         │ 3.5        │ 0.2        │

You can also use the shorthand @load_iris for such common examples:

X, y = @load_iris;

Choosing a model

In MLJ, a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name (and only that).

A number of models are available in MLJ, usually thanks to external packages interfacing with MLJ (see also MLJModels.jl). In order to see which ones are appropriate for the data you have and its scientific interpretation, you can use the function models along with the function matching; let us look specifically at models which support a probabilistic output:

for m in models(matching(X, y))
    if m.prediction_type == :probabilistic
        println(rpad(, 30), "($(m.package_name))")
AdaBoostClassifier            (ScikitLearn)
AdaBoostStumpClassifier       (DecisionTree)
BaggingClassifier             (ScikitLearn)
BayesianLDA                   (MultivariateStats)
BayesianLDA                   (ScikitLearn)
BayesianQDA                   (ScikitLearn)
BayesianSubspaceLDA           (MultivariateStats)
ConstantClassifier            (MLJModels)
DecisionTreeClassifier        (BetaML)
DecisionTreeClassifier        (DecisionTree)
DummyClassifier               (ScikitLearn)
EvoTreeClassifier             (EvoTrees)
ExtraTreesClassifier          (ScikitLearn)
GaussianNBClassifier          (NaiveBayes)
GaussianNBClassifier          (ScikitLearn)
GaussianProcessClassifier     (ScikitLearn)
GradientBoostingClassifier    (ScikitLearn)
KNNClassifier                 (NearestNeighborModels)
KNeighborsClassifier          (ScikitLearn)
KernelPerceptronClassifier    (BetaML)
LDA                           (MultivariateStats)
LGBMClassifier                (LightGBM)
LogisticCVClassifier          (ScikitLearn)
LogisticClassifier            (MLJLinearModels)
LogisticClassifier            (ScikitLearn)
MultinomialClassifier         (MLJLinearModels)
NeuralNetworkClassifier       (MLJFlux)
PegasosClassifier             (BetaML)
PerceptronClassifier          (BetaML)
ProbabilisticSGDClassifier    (ScikitLearn)
RandomForestClassifier        (BetaML)
RandomForestClassifier        (DecisionTree)
RandomForestClassifier        (ScikitLearn)
SubspaceLDA                   (MultivariateStats)
XGBoostClassifier             (XGBoost)

Loading a model

Most models are implemented outside of the MLJ ecosystem; you therefore have to load models using the @load command.

Note: you must have the package from which the model is loaded available in your environment (in this case DecisionTree.jl) otherwise MLJ will not be able to load the model.

For instance, let's say you want to fit a K-Nearest Neighbours classifier:

knc = @load KNeighborsClassifier
import MLJScikitLearnInterface ✔

In some cases, there may be several packages offering the same model, for instance LinearRegressor is offered by both GLM.jl and ScikitLearn.jl so you will need to specify the package you would like to use by adding pkg="ThePackage" in the load command:

linreg = @load LinearRegressor pkg=GLM
import MLJGLMInterface ✔