Choosing and evaluating a model
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
using RDatasets
using MLJ
iris = dataset("datasets", "iris")
first(iris, 3) |> pretty
┌─────────────┬────────────┬─────────────┬────────────┬─────────────────────────────────┐
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ Float64 │ Float64 │ Float64 │ Float64 │ CategoricalValue{String, UInt8} │
│ Continuous │ Continuous │ Continuous │ Continuous │ Multiclass{3} │
├─────────────┼────────────┼─────────────┼────────────┼─────────────────────────────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
└─────────────┴────────────┴─────────────┴────────────┴─────────────────────────────────┘
Observe that below each column name there are two types given: the first one is the machine type and the second one is the scientific type.
machine type: is the Julia type the data is currently encoded as, for instance
Float64
,scientific type: is a type corresponding to how the data should be interpreted, for instance
Multiclass{3}
.
If you want to specify a different scientific type than the one inferred, you can do so by using the function coerce
along with pairs of column names and scientific types:
iris2 = coerce(iris, :PetalWidth => OrderedFactor)
first(iris2[:, [:PetalLength, :PetalWidth]], 1) |> pretty
┌─────────────┬───────────────────────────────────┐
│ PetalLength │ PetalWidth │
│ Float64 │ CategoricalValue{Float64, UInt32} │
│ Continuous │ OrderedFactor{22} │
├─────────────┼───────────────────────────────────┤
│ 1.4 │ 0.2 │
└─────────────┴───────────────────────────────────┘
The function unpack
helps specify the target and the input for a regression or classification task
y, X = unpack(iris, ==(:Species))
first(X, 1) |> pretty
┌─────────────┬────────────┬─────────────┬────────────┐
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │
│ Float64 │ Float64 │ Float64 │ Float64 │
│ Continuous │ Continuous │ Continuous │ Continuous │
├─────────────┼────────────┼─────────────┼────────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │
└─────────────┴────────────┴─────────────┴────────────┘
The two arguments after the dataframes should be understood as functions over column names specifying the target and the input data respectively. Let's look in more details at what we used here:
==(:Species)
is a shorthand to specify that the target should be the column with name equal to:Species
,colname -> true
indicates that every other column is to be taken as input
Let's try another one:
y, X = unpack(iris, ==(:Species), !=(:PetalLength))
first(X, 1) |> pretty
┌─────────────┬────────────┬────────────┐
│ SepalLength │ SepalWidth │ PetalWidth │
│ Float64 │ Float64 │ Float64 │
│ Continuous │ Continuous │ Continuous │
├─────────────┼────────────┼────────────┤
│ 5.1 │ 3.5 │ 0.2 │
└─────────────┴────────────┴────────────┘
You can also use the shorthand @load_iris
for such common examples:
X, y = @load_iris;
In MLJ, a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name (and only that).
A number of models are available in MLJ, usually thanks to external packages interfacing with MLJ (see also MLJModels.jl
). In order to see which ones are appropriate for the data you have and its scientific interpretation, you can use the function models
along with the function matching
; let us look specifically at models which support a probabilistic output:
for m in models(matching(X, y))
if m.prediction_type == :probabilistic
println(rpad(m.name, 30), "($(m.package_name))")
end
end
AdaBoostClassifier (MLJScikitLearnInterface)
AdaBoostStumpClassifier (DecisionTree)
BaggingClassifier (MLJScikitLearnInterface)
BayesianLDA (MLJScikitLearnInterface)
BayesianLDA (MultivariateStats)
BayesianQDA (MLJScikitLearnInterface)
BayesianSubspaceLDA (MultivariateStats)
CatBoostClassifier (CatBoost)
ConstantClassifier (MLJModels)
DecisionTreeClassifier (BetaML)
DecisionTreeClassifier (DecisionTree)
DummyClassifier (MLJScikitLearnInterface)
EvoTreeClassifier (EvoTrees)
ExtraTreesClassifier (MLJScikitLearnInterface)
GaussianNBClassifier (MLJScikitLearnInterface)
GaussianNBClassifier (NaiveBayes)
GaussianProcessClassifier (MLJScikitLearnInterface)
GradientBoostingClassifier (MLJScikitLearnInterface)
HistGradientBoostingClassifier(MLJScikitLearnInterface)
KNNClassifier (NearestNeighborModels)
KNeighborsClassifier (MLJScikitLearnInterface)
KernelPerceptronClassifier (BetaML)
LDA (MultivariateStats)
LGBMClassifier (LightGBM)
LogisticCVClassifier (MLJScikitLearnInterface)
LogisticClassifier (MLJLinearModels)
LogisticClassifier (MLJScikitLearnInterface)
MultinomialClassifier (MLJLinearModels)
NeuralNetworkClassifier (BetaML)
NeuralNetworkClassifier (MLJFlux)
PegasosClassifier (BetaML)
PerceptronClassifier (BetaML)
ProbabilisticNuSVC (LIBSVM)
ProbabilisticSGDClassifier (MLJScikitLearnInterface)
ProbabilisticSVC (LIBSVM)
RandomForestClassifier (BetaML)
RandomForestClassifier (DecisionTree)
RandomForestClassifier (MLJScikitLearnInterface)
StableForestClassifier (SIRUS)
StableRulesClassifier (SIRUS)
SubspaceLDA (MultivariateStats)
XGBoostClassifier (XGBoost)
Most models are implemented outside of the MLJ ecosystem; you therefore have to load models using the @load
command.
Note: you must have the package from which the model is loaded available in your environment (in this case DecisionTree.jl
) otherwise MLJ will not be able to load the model.
For instance, let's say you want to fit a K-Nearest Neighbours classifier:
knc = @load KNeighborsClassifier
import MLJScikitLearnInterface ✔
MLJScikitLearnInterface.KNeighborsClassifier
In some cases, there may be several packages offering the same model, for instance LinearRegressor
is offered by both GLM.jl
and ScikitLearn.jl
so you will need to specify the package you would like to use by adding pkg="ThePackage"
in the load command:
linreg = @load LinearRegressor pkg=GLM
import MLJGLMInterface ✔
MLJGLMInterface.LinearRegressor