MLJ for Data Scientists in Two Hours

An end-to-end example using the Telco Churn dataset

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Summary of methods and types introduced
Warm up: Building a model for the iris dataset
Getting the Telco data
Type coercion
Preparing a holdout set for final testing
Splitting data into target and features
Loading a model and checking type requirements
Building a model pipeline to incorporate feature encoding
Evaluating the pipeline model's performance
1. Evaluating by hand (with a holdout set)
2. Automated performance evaluation (more typical workflow)
Filtering out unimportant features
Wrapping our iterative model in control strategies
Hyper-parameter optimization (model tuning)
Saving our model
Final performance estimate
Testing the final model

An application of the MLJ toolbox to the Telco Customer Churn dataset, aimed at practicing data scientists new to MLJ (Machine Learning in Julia). This tutorial does not cover exploratory data analysis.

MLJ is a general machine learning toolbox (i.e., not just deep-learning).

For other MLJ learning resources see the Learning MLJ section of the manual.

Topics covered: Grabbing and preparing a dataset, basic fit/predict workflow, constructing a pipeline to include data pre-processing, estimating performance metrics, ROC curves, confusion matrices, feature importance, basic feature selection, controlling iterative models, hyper-parameter optimization (tuning).

Prerequisites for this tutorial. Previous experience building, evaluating, and optimizing machine learning models using scikit-learn, caret, MLR, weka, or similar tool. No previous experience with MLJ. Only fairly basic familiarity with Julia is required. Uses DataFrames.jl but in a minimal way (this cheatsheet may help).

Time. Between two and three hours, first time through.

code	purpose
`OpenML.load(id)`	grab a dataset from OpenML.org
`scitype(X)`	inspect the scientific type (scitype) of object `X`
`schema(X)`	inspect the column scitypes (scientific types) of a table `X`
`coerce(X, ...)`	fix column encodings to get appropriate scitypes
`partition(data, frac1, frac2, ...; rng=...)`	vertically split `data`, which can be a table, vector or matrix
`unpack(table, f1, f2, ...)`	horizontally split `table` based on conditions `f1`, `f2`, ..., applied to column names
`@load ModelType pkg=...`	load code defining a model type
`input_scitype(model)`	inspect the scitype that a model requires for features (inputs)
`target_scitype(model)`	inspect the scitype that a model requires for the target (labels)
`ContinuousEncoder`	built-in model type for re-encoding all features as `Continuous`
`model1 ∣> model2 ∣> ...`	combine multiple models into a pipeline
`measures("under curve")`	list all measures (metrics) with string "under curve" in documentation
`accuracy(yhat, y)`	compute accuracy of predictions `yhat` against ground truth observations `y`
`auc(yhat, y), brier_loss(yhat, y)`	evaluate two probabilistic measures (`yhat` a vector of probability distributions)
`machine(model, X, y)`	bind `model` to training data `X` (features) and `y` (target)
`fit!(mach, rows=...)`	train machine using specified rows (observation indices)
`predict(mach, rows=...)`,	make in-sample model predictions given specified rows
`predict(mach, Xnew)`	make predictions given new features `Xnew`
`fitted_params(mach)`	inspect learned parameters
`report(mach)`	inspect other outcomes of training
`feature_importances(mach)`	inspect feature importances, where reported
`confmat(yhat, y)`	confusion matrix for predictions `yhat` and ground truth `y`
`roc(yhat, y)`	compute points on the receiver-operator Characteristic
`StratifiedCV(nfolds=6)`	6-fold stratified cross-validation resampling strategy
`Holdout(fraction_train=0.7)`	holdout resampling strategy
`evaluate(model, X, y; resampling=..., options...)`	estimate performance metrics for `model` using the data `X`, `y`
`FeatureSelector()`	transformer for selecting features
`Step(3)`	iteration control for stepping 3 iterations
`NumberSinceBest(6)`, `TimeLimit(60/5), InvalidValue()`	stopping criterion iteration controls
`IteratedModel(model=..., controls=..., options...)`	wrap an iterative `model` in controls
`range(model, :some_hyperparam, lower=..., upper=...)`	define a numeric range
`RandomSearch()`	random search tuning strategy
`TunedModel(model=..., tuning=..., options...)`	wrap the supervised `model` in specified `tuning` strategy

‎

Before turning to the Telco Customer Churn dataset, we very quickly build a predictive model for Fisher's well-known iris data set, as way of introducing the main actors in any MLJ workflow. Details that you don't fully grasp should become clearer in the Telco study.

This section is a condensed adaption of the Getting Started example in the MLJ documentation.

First, using the built-in iris dataset, we load and inspect the features X_iris (a table) and target variable y_iris (a vector):

using MLJ

X_iris, y_iris = @load_iris;
schema(X_iris)

┌──────────────┬────────────┬─────────┐
│ names        │ scitypes   │ types   │
├──────────────┼────────────┼─────────┤
│ sepal_length │ Continuous │ Float64 │
│ sepal_width  │ Continuous │ Float64 │
│ petal_length │ Continuous │ Float64 │
│ petal_width  │ Continuous │ Float64 │
└──────────────┴────────────┴─────────┘

y_iris[1:4]

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"

levels(y_iris)

3-element Vector{String}:
 "setosa"
 "versicolor"
 "virginica"

We load a decision tree model, from the package DecisionTree.jl:

DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree # model type
model = DecisionTree(min_samples_split=5)                    # model instance

import MLJDecisionTreeInterface ✔
DecisionTreeClassifier(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 5, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

In MLJ, a model is just a container for hyper-parameters of some learning algorithm. It does not store learned parameters.

Next, we bind the model together with the available data in what's called a machine:

mach = machine(model, X_iris, y_iris)

untrained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = -1, …)
  args: 
    1:	Source @683 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @738 ⏎ AbstractVector{ScientificTypesBase.Multiclass{3}}

A machine is essentially just a model (ie, hyper-parameters) plus data, but it additionally stores learned parameters (the tree) once it is trained on some view of the data:

train_rows = vcat(1:60, 91:150); # some row indices (observations are rows not columns)
fit!(mach, rows=train_rows)
fitted_params(mach)

(tree = DecisionTree.InfoNode{Float64, UInt32}(Decision Tree
Leaves: 5
Depth:  3, nchildren=2),
 raw_tree = Decision Tree
Leaves: 5
Depth:  3,
 encoding = Dict{UInt32, CategoricalArrays.CategoricalValue{String, UInt32}}(0x00000002 => "versicolor", 0x00000003 => "virginica", 0x00000001 => "setosa"),
 features = [:sepal_length, :sepal_width, :petal_length, :petal_width],)

A machine stores some other information enabling warm restart for some models, but we won't go into that here. You are allowed to access and mutate the model parameter:

mach.model.min_samples_split  = 10
fit!(mach, rows=train_rows) # re-train with new hyper-parameter

trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = -1, …)
  args: 
    1:	Source @683 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @738 ⏎ AbstractVector{ScientificTypesBase.Multiclass{3}}

Now we can make predictions on some other view of the data, as in

predict(mach, rows=71:73)

3-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.Multiclass{3}, String, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>0.0, versicolor=>0.25, virginica=>0.75)

or on completely new data, as in

Xnew = (sepal_length = [5.1, 6.3],
        sepal_width = [3.0, 2.5],
        petal_length = [1.4, 4.9],
        petal_width = [0.3, 1.5])
yhat = predict(mach, Xnew)

2-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.Multiclass{3}, String, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>0.0, versicolor=>0.25, virginica=>0.75)

These are probabilistic predictions which can be manipulated using a widely adopted interface defined in the Distributions.jl package. For example, we can get raw probabilities like this:

pdf.(yhat, "virginica")

2-element Vector{Float64}:
 0.0
 0.75

A single prediction is displayed like this:

yhat[2]

UnivariateFinite{ScientificTypesBase.Multiclass{3}}(setosa=>0.0, versicolor=>0.25, virginica=>0.75)

We now turn to the Telco dataset. ‎

import DataFrames

data = OpenML.load(42178) # data set from OpenML.org
df0 = DataFrames.DataFrame(data)
first(df0, 4)

4×21 DataFrame
 Row │ customerID  gender  SeniorCitizen  Partner  Dependents  tenure   PhoneService  MultipleLines     InternetService  OnlineSecurity  OnlineBackup  DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract        PaperlessBilling  PaymentMethod              MonthlyCharges  TotalCharges  Churn
     │ String      String  Float64        String   String      Float64  String        String            String           String          String        String            String       String       String           String          String            String                     Float64         String        String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 7590-VHVEG  Female            0.0  Yes      No              1.0  No            No phone service  DSL              No              Yes           No                No           No           No               Month-to-month  Yes               Electronic check                    29.85  29.85         No
   2 │ 5575-GNVDE  Male              0.0  No       No             34.0  Yes           No                DSL              Yes             No            Yes               No           No           No               One year        No                Mailed check                        56.95  1889.5        No
   3 │ 3668-QPYBK  Male              0.0  No       No              2.0  Yes           No                DSL              Yes             Yes           No                No           No           No               Month-to-month  Yes               Mailed check                        53.85  108.15        Yes
   4 │ 7795-CFOCW  Male              0.0  No       No             45.0  No            No phone service  DSL              Yes             No            Yes               Yes          No           No               One year        No                Bank transfer (automatic)           42.3   1840.75       No

The object of this tutorial is to build and evaluate supervised learning models to predict the Churn variable, a binary variable measuring customer retention, based on other variables that are relevant.

In the table, observations correspond to rows, and features to columns, which is the convention for representing all two-dimensional data in MLJ.

‎

Introduces: scitype, schema, coerce

A "scientific type" or scitype indicates how MLJ will interpret data. For example, typeof(3.14) == Float64, while scitype(3.14) == Continuous and also scitype(3.14f0) == Continuous. In MLJ, model data requirements are articulated using scitypes.

Here are common "scalar" scitypes:

There are also container scitypes. For example, the scitype of any N-dimensional array is AbstractArray{S, N}, where S is the scitype of the elements:

scitype(["cat", "mouse", "dog"])

AbstractVector{Textual} (alias for AbstractArray{ScientificTypesBase.Textual, 1})

The schema operator summarizes the column scitypes of a table:

schema(df0) |> DataFrames.DataFrame  # converted to DataFrame for better display

21×3 DataFrame
 Row │ names             scitypes    types
     │ Symbol            DataType    DataType
─────┼────────────────────────────────────────
   1 │ customerID        Textual     String
   2 │ gender            Textual     String
   3 │ SeniorCitizen     Continuous  Float64
   4 │ Partner           Textual     String
   5 │ Dependents        Textual     String
   6 │ tenure            Continuous  Float64
   7 │ PhoneService      Textual     String
   8 │ MultipleLines     Textual     String
   9 │ InternetService   Textual     String
  10 │ OnlineSecurity    Textual     String
  11 │ OnlineBackup      Textual     String
  12 │ DeviceProtection  Textual     String
  13 │ TechSupport       Textual     String
  14 │ StreamingTV       Textual     String
  15 │ StreamingMovies   Textual     String
  16 │ Contract          Textual     String
  17 │ PaperlessBilling  Textual     String
  18 │ PaymentMethod     Textual     String
  19 │ MonthlyCharges    Continuous  Float64
  20 │ TotalCharges      Textual     String
  21 │ Churn             Textual     String

All of the fields being interpreted as Textual are really something else, either Multiclass or, in the case of TotalCharges, Continuous. In fact, TotalCharges is mostly floats wrapped as strings. However, it needs special treatment because some elements consist of a single space, " ", which we'll treat as "0.0".

fix_blanks(v) = map(v) do x
    if x == " "
        return "0.0"
    else
        return x
    end
end

df0.TotalCharges = fix_blanks(df0.TotalCharges);

Coercing the TotalCharges type to ensure a Continuous scitype:

coerce!(df0, :TotalCharges => Continuous);

Coercing all remaining Textual data to Multiclass:

coerce!(df0, Textual => Multiclass);

Finally, we'll coerce our target variable Churn to be OrderedFactor, rather than Multiclass, to enable a reliable interpretation of metrics like "true positive rate". By convention, the first class is the negative one:

coerce!(df0, :Churn => OrderedFactor)
levels(df0.Churn) # to check order

2-element Vector{String}:
 "No"
 "Yes"

Re-inspecting the scitypes:

schema(df0) |> DataFrames.DataFrame

21×3 DataFrame
 Row │ names             scitypes          types
     │ Symbol            DataType          DataType
─────┼──────────────────────────────────────────────────────────────────────
   1 │ customerID        Multiclass{7043}  CategoricalValue{String, UInt32}
   2 │ gender            Multiclass{2}     CategoricalValue{String, UInt32}
   3 │ SeniorCitizen     Continuous        Float64
   4 │ Partner           Multiclass{2}     CategoricalValue{String, UInt32}
   5 │ Dependents        Multiclass{2}     CategoricalValue{String, UInt32}
   6 │ tenure            Continuous        Float64
   7 │ PhoneService      Multiclass{2}     CategoricalValue{String, UInt32}
   8 │ MultipleLines     Multiclass{3}     CategoricalValue{String, UInt32}
   9 │ InternetService   Multiclass{3}     CategoricalValue{String, UInt32}
  10 │ OnlineSecurity    Multiclass{3}     CategoricalValue{String, UInt32}
  11 │ OnlineBackup      Multiclass{3}     CategoricalValue{String, UInt32}
  12 │ DeviceProtection  Multiclass{3}     CategoricalValue{String, UInt32}
  13 │ TechSupport       Multiclass{3}     CategoricalValue{String, UInt32}
  14 │ StreamingTV       Multiclass{3}     CategoricalValue{String, UInt32}
  15 │ StreamingMovies   Multiclass{3}     CategoricalValue{String, UInt32}
  16 │ Contract          Multiclass{3}     CategoricalValue{String, UInt32}
  17 │ PaperlessBilling  Multiclass{2}     CategoricalValue{String, UInt32}
  18 │ PaymentMethod     Multiclass{4}     CategoricalValue{String, UInt32}
  19 │ MonthlyCharges    Continuous        Float64
  20 │ TotalCharges      Continuous        Float64
  21 │ Churn             OrderedFactor{2}  CategoricalValue{String, UInt32}

‎

Introduces: partition

To reduce training times for the purposes of this tutorial, we're going to dump 90% of observations (after shuffling) and split off 30% of the remainder for use as a lock-and-throw-away-the-key holdout set.

import Random.Xoshiro
rng = Xoshiro(123)
df, df_test, df_dumped = partition(df0, 0.07, 0.03; # in ratios 7:3:90
                                   stratify=df0.Churn,
                                   rng=rng);

The reader interested in including all data can instead do df, df_test = partition(df0, 0.7; rng=rng ).

We have included the option stratify=df0.Churn to ensure the Churn classes have similary distributions in df and df_test.

‎

Introduces: unpack

In the following call, the column with name Churn is copied over to a vector y, and every remaining column, except customerID (which contains no useful information) goes into a table X. Here Churn is the target variable for which we seek predictions, given new versions of the features X.

y, X = unpack(df, ==(:Churn), !=(:customerID));
schema(X).names

(:gender, :SeniorCitizen, :Partner, :Dependents, :tenure, :PhoneService, :MultipleLines, :InternetService, :OnlineSecurity, :OnlineBackup, :DeviceProtection, :TechSupport, :StreamingTV, :StreamingMovies, :Contract, :PaperlessBilling, :PaymentMethod, :MonthlyCharges, :TotalCharges)

intersect([:Churn, :customerID], schema(X).names)

Symbol[]

We'll do the same for the holdout data:

ytest, Xtest = unpack(df_test, ==(:Churn), !=(:customerID));

‎

Introduces: @load, input_scitype, target_scitype

For tools helping us to identify suitable models, see the Model Search section of the manual. We will build a gradient tree-boosting model, a popular first choice for structured data like we have here. Model code is contained in a third-party package called EvoTrees.jl which is loaded as follows:

Booster = @load EvoTreeClassifier pkg=EvoTrees

import EvoTrees ✔
EvoTrees.EvoTreeClassifier

Recall that a model is just a container for some algorithm's hyper-parameters. Let's create a Booster with default values for all the hyper-parameters, except its RNG, which we'll take to be a Xoshiro because it is thread safe:

booster = Booster(rng=Xoshiro(123))

EvoTreeClassifier(
  nrounds = 100, 
  L2 = 0.0, 
  lambda = 0.0, 
  gamma = 0.0, 
  eta = 0.1, 
  max_depth = 6, 
  min_weight = 1.0, 
  rowsample = 1.0, 
  colsample = 1.0, 
  nbins = 64, 
  alpha = 0.5, 
  tree_type = "binary", 
  rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80))

This model is appropriate for the kind of target variable we have because of the following passing test:

scitype(y) <: target_scitype(booster)

true

Our features X are also compatible with booster:

scitype(X) <: input_scitype(booster)

true

The majority of MLJ supervised models expects all features to be Continuous (and this used to be true for an earlier version of the EvoTrees.jl models.). For the sake of illustration, we will pretend this is true here, and introduce our own categorical feature encoding, discussed next.

To see a list of all models that directly support the data (X, y) we can do this:

models(matching(X, y))

7-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = CatBoostClassifier, package_name = CatBoost, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = EvoTreeClassifier, package_name = EvoTrees, ... )
 (name = LinearBinaryClassifier, package_name = GLM, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )

‎

Introduces: ContinuousEncoder, pipeline operator |>

The built-in ContinuousEncoder model transforms an arbitrary table to a table whose features are all Continuous (dropping any fields it does not know how to encode). In particular, all Multiclass features are one-hot encoded.

A pipeline is a stand-alone model that internally combines one or more models in a linear (non-branching) pipeline. Here's a pipeline that adds the ContinuousEncoder as a pre-processor to the gradient tree-boosting model above:

pipe = ContinuousEncoder() |> booster

ProbabilisticPipeline(
  continuous_encoder = ContinuousEncoder(
        drop_last = false, 
        one_hot_ordered_factors = false), 
  evo_tree_classifier = EvoTreeClassifier(
        nrounds = 100, 
        L2 = 0.0, 
        lambda = 0.0, 
        gamma = 0.0, 
        eta = 0.1, 
        max_depth = 6, 
        min_weight = 1.0, 
        rowsample = 1.0, 
        colsample = 1.0, 
        nbins = 64, 
        alpha = 0.5, 
        tree_type = "binary", 
        rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80)), 
  cache = true)

Note that the component models appear as hyper-parameters of pipe. Pipelines are an implementation of a more general model composition interface provided by MLJ that advanced users may want to learn about.

From the above display, we see that component model hyper-parameters are now nested, but they are still accessible (important in hyper-parameter optimization):

pipe.evo_tree_classifier.max_depth

‎

Introduces: measures (function), measures: brier_loss, auc, accuracy; machine, fit!, predict, fitted_params, report, roc, resampling strategy StratifiedCV, evaluate, FeatureSelector, feature_importances

Without touching our test set Xtest, ytest, we will estimate the performance of our pipeline model, with default hyper-parameters, in two different ways:

Evaluating by hand. First, we'll do this "by hand" using the fit! and predict workflow illustrated for the iris data set above, using a holdout resampling strategy. At the same time we'll see how to generate a confusion matrix, ROC curve, and inspect feature importances.

Automated performance evaluation. Next we'll apply the more typical and convenient evaluate workflow, but using StratifiedCV (stratified cross-validation) which is more informative.

In any case, we need to choose some measures (metrics) to quantify the performance of our model. For a complete list of measures, one does measures(). Or we also can do:

measures("Brier")

OrderedCollections.LittleDict{Any, Any, Vector{Any}, Vector{Any}} with 2 entries:
  BrierScore => (aliases = ("brier_score", "quadratic_score"), consumes_multiple_observations = true, can_report_unaggregated = true, kind_of_proxy = Distribution(), observation_scitype = Union{Missing, Infinite, Finite}, can_consume_tables = false, supports_weights = true, supports_class_weights = true, orientation = Score(), external_aggregation_mode = Mean(), human_name = "brier score")
  BrierLoss => (aliases = ("brier_loss", "cross_entropy", "quadratic_loss"), consumes_multiple_observations = true, can_report_unaggregated = true, kind_of_proxy = Distribution(), observation_scitype = Union{Missing, Infinite, Finite}, can_consume_tables = false, supports_weights = true, supports_class_weights = true, orientation = Loss(), external_aggregation_mode = Mean(), human_name = "brier loss")

We will be primarily using brier_loss, but also auc (area under the ROC curve) and accuracy.

Our pipeline model can be trained just like the decision tree model we built for the iris data set. Binding all non-test data to the pipeline model:

mach_pipe = machine(pipe, X, y)

untrained Machine; does not cache data
  model: ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = false, …), …)
  args: 
    1:	Source @506 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Multiclass{2}}, AbstractVector{ScientificTypesBase.Multiclass{4}}}}
    2:	Source @237 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

We already encountered the partition method above. Here we apply it to row indices, instead of data containers, as fit! and predict only need a view of the data to work.

train, validation = partition(1:length(y), 0.7)
fit!(mach_pipe, rows=train)

trained Machine; does not cache data
  model: ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = false, …), …)
  args: 
    1:	Source @506 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Multiclass{2}}, AbstractVector{ScientificTypesBase.Multiclass{4}}}}
    2:	Source @237 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

We note in passing that we can access two kinds of information from a trained machine:

The learned parameters (eg, coefficients of a linear model): We use fitted_params(mach_pipe)
Other by-products of training (eg, p-values): We use report(mach_pipe)

fp = fitted_params(mach_pipe);
keys(fp)

(:evo_tree_classifier, :continuous_encoder)

For example, we can extract the raw EvoTrees.jl learned parameter object:

fp.evo_tree_classifier.fitresult

EvoTrees.EvoTree{EvoTrees.MLogLoss, 2}
 - Contains 101 trees in field `trees` (incl. 1 bias tree).
 - Data input has 45 features.
 - [:target_levels, :target_isordered, :fnames, :feattypes, :edges, :featbins] info accessible in field `info`

And, from the report, extract the names of all features generated for the one-hot encoding:

rpt = report(mach_pipe);
keys(rpt.continuous_encoder)

(:features_to_keep, :new_features)

join(string.(rpt.continuous_encoder.new_features), ", ") |> println

gender__Female, gender__Male, SeniorCitizen, Partner__No, Partner__Yes, Dependents__No, Dependents__Yes, tenure, PhoneService__No, PhoneService__Yes, MultipleLines__No, MultipleLines__No phone service, MultipleLines__Yes, InternetService__DSL, InternetService__Fiber optic, InternetService__No, OnlineSecurity__No, OnlineSecurity__No internet service, OnlineSecurity__Yes, OnlineBackup__No, OnlineBackup__No internet service, OnlineBackup__Yes, DeviceProtection__No, DeviceProtection__No internet service, DeviceProtection__Yes, TechSupport__No, TechSupport__No internet service, TechSupport__Yes, StreamingTV__No, StreamingTV__No internet service, StreamingTV__Yes, StreamingMovies__No, StreamingMovies__No internet service, StreamingMovies__Yes, Contract__Month-to-month, Contract__One year, Contract__Two year, PaperlessBilling__No, PaperlessBilling__Yes, PaymentMethod__Bank transfer (automatic), PaymentMethod__Credit card (automatic), PaymentMethod__Electronic check, PaymentMethod__Mailed check, MonthlyCharges, TotalCharges

Another example of information sometimes appearing in a report is feature importances. However, for models supporting feature importances, they are always available directly.

reports_feature_importances(pipe)

true

Pkg.status()

Status `~/GoogleDrive/Julia/MLJ/DataScienceTutorials/_literate/end-to-end/telco/Project.toml`
  [a93c6f00] DataFrames v1.6.1
  [7806a523] DecisionTree v0.12.4
  [31c24e10] Distributions v0.25.109
  [f6006082] EvoTrees v0.16.7
  [add582a8] MLJ v0.20.6
  [a7f614a8] MLJBase v1.4.0
  [c6f25543] MLJDecisionTreeInterface v0.4.2
  [1b6a4a23] MLJMultivariateStatsInterface v0.5.3
  [91a5bcdd] Plots v1.40.4
  [9a3f8284] Random

This holds because the supervised component of our pipeline supports feature importances:

reports_feature_importances(booster)

true

And we can get the booster feature imporances from the pipeline's machine like this:

fi = feature_importances(mach_pipe)

45-element Vector{Pair{String, Float64}}:
                           "MonthlyCharges" => 0.23890620397780685
                 "Contract__Month-to-month" => 0.16232294017366125
                                   "tenure" => 0.14677468271414482
                             "TotalCharges" => 0.14666498887062812
                              "Partner__No" => 0.030441338887616118
                           "gender__Female" => 0.025860497369449405
                         "OnlineBackup__No" => 0.024299735947768884
                            "SeniorCitizen" => 0.02280229847728746
              "PaymentMethod__Mailed check" => 0.02153253456456951
                       "OnlineSecurity__No" => 0.020133018731836202
                         "PhoneService__No" => 0.019949239825781784
          "PaymentMethod__Electronic check" => 0.019613869758192123
             "InternetService__Fiber optic" => 0.01710091424020063
   "PaymentMethod__Credit card (automatic)" => 0.011909647547861153
                          "TechSupport__No" => 0.0107331453464553
                        "MultipleLines__No" => 0.01008697622583624
                     "PaperlessBilling__No" => 0.00943634524564372
 "PaymentMethod__Bank transfer (automatic)" => 0.00821687161106134
                     "DeviceProtection__No" => 0.008179337485642651
                       "MultipleLines__Yes" => 0.006695659965183685
                          "StreamingTV__No" => 0.006085178170408351
                         "TechSupport__Yes" => 0.00573288513590814
                       "Contract__Two year" => 0.005648160146719027
                           "Dependents__No" => 0.0052877365294750085
                       "Contract__One year" => 0.003658289499817455
                      "StreamingMovies__No" => 0.003337851338772691
                        "OnlineBackup__Yes" => 0.003151342690951194
                      "OnlineSecurity__Yes" => 0.0028559465736072826
                     "InternetService__DSL" => 0.0022030341978608006
                    "DeviceProtection__Yes" => 0.0002864796512690182
                         "StreamingTV__Yes" => 9.284909858362433e-5
      "OnlineSecurity__No internet service" => 0.0
                          "Dependents__Yes" => 0.0
                             "Partner__Yes" => 0.0
                    "PaperlessBilling__Yes" => 0.0
        "OnlineBackup__No internet service" => 0.0
                      "InternetService__No" => 0.0
                     "StreamingMovies__Yes" => 0.0
                             "gender__Male" => 0.0
                        "PhoneService__Yes" => 0.0
    "DeviceProtection__No internet service" => 0.0
         "StreamingTV__No internet service" => 0.0
     "StreamingMovies__No internet service" => 0.0
         "TechSupport__No internet service" => 0.0
          "MultipleLines__No phone service" => 0.0

which we'll put into data frame for later:

feature_importance_table =
    (feature=Symbol.(first.(fi)), importance=last.(fi)) |> DataFrames.DataFrame;

For models not reporting feature importances, we recommend the Shapley.jl package.

Returning to predictions and evaluations of our measures:

ŷ = predict(mach_pipe, rows=validation);
print(
    "Measurements:\n",
    "  brier loss: ", brier_loss(ŷ, y[validation]), "\n",
    "  auc:        ", auc(ŷ, y[validation]),                "\n",
    "  accuracy:   ", accuracy(mode.(ŷ), y[validation])
)

Measurements:
  brier loss: 0.32075024
  auc:        0.7906746031746031
  accuracy:   0.7972972972972973

Note that we need mode in the last case because accuracy expects point predictions, not probabilistic ones. (One can alternatively use predict_mode to generate the predictions.)

While we're here, lets also generate a confusion matrix and receiver-operator characteristic (ROC):

confmat(mode.(ŷ), y[validation])

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  No  │ Yes  │
├─────────┼──────┼──────┤
│   No    │  99  │  17  │
├─────────┼──────┼──────┤
│   Yes   │  13  │  19  │
└─────────┴──────┴──────┘

Note: Importing the plotting package and calling the plotting functions for the first time can take a minute or so.

using Plots
Plots.scalefontsizes(0.85)

roc = roc_curve(ŷ, y[validation])
plt = scatter(roc, legend=false)
plot!(plt, xlab="false positive rate", ylab="true positive rate")
plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)

‎

We can also get performance estimates with a single call to the evaluate function, which also allows for more complicated resampling - in this case stratified cross-validation. To make this more comprehensive, we set repeats=3 below to make our cross-validation "Monte Carlo" (3 random size-6 partitions of the observation space, for a total of 18 folds) and set acceleration=CPUThreads() to parallelize the computation.

We choose a StratifiedCV resampling strategy; the complete list of options is here.

e_pipe = evaluate(pipe, X, y,
                  resampling=StratifiedCV(nfolds=6, rng=rng),
                  measures=[brier_loss, auc, accuracy],
                  repeats=3,
                  acceleration=CPUThreads())

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────┬──────────────┬─────────────┐
│   │ measure          │ operation    │ measurement │
├───┼──────────────────┼──────────────┼─────────────┤
│ A │ BrierLoss()      │ predict      │ 0.36        │
│ B │ AreaUnderCurve() │ predict      │ 0.79        │
│ C │ Accuracy()       │ predict_mode │ 0.759       │
└───┴──────────────────┴──────────────┴─────────────┘
┌───┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┐
│   │ per_fold                                                                                                                             │ 1.96*SE │
├───┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┤
│ A │ Float32[0.271, 0.441, 0.347, 0.381, 0.311, 0.446, 0.397, 0.305, 0.397, 0.259, 0.442, 0.282, 0.385, 0.289, 0.337, 0.46, 0.364, 0.366] │ 0.0305  │
│ B │ [0.846, 0.729, 0.826, 0.792, 0.812, 0.717, 0.742, 0.827, 0.766, 0.839, 0.753, 0.853, 0.71, 0.842, 0.829, 0.745, 0.782, 0.803]        │ 0.0227  │
│ C │ [0.795, 0.707, 0.78, 0.732, 0.793, 0.695, 0.735, 0.793, 0.72, 0.817, 0.707, 0.817, 0.771, 0.829, 0.805, 0.72, 0.732, 0.72]           │ 0.0213  │
└───┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┘

(There is also a version of evaluate for machines. Query the evaluate and evaluate! doc-strings to learn more about these functions and what the PerformanceEvaluation object e_pipe records.)

While less than ideal, let's adopt the common practice of using the standard error of a cross-validation score as an estimate of the uncertainty of a performance measure's expected value. To get a 95% confidence interval based on this error, use "measurement ± delta" where "delta" is the number in the "1.96*SE" column.

‎

Introduces: FeatureSelector

Before continuing, we'll modify our pipeline to drop those features with low feature importance, to speed up later optimization. For a more sophisticated alternative, you may want to try MLJ's RecursiveFeatureSelection model wrapper. Run doc("RecursiveFeatureElimination") for details.

unimportant_features = filter(:importance => <(0.005), feature_importance_table).feature

pipe2 = ContinuousEncoder() |>
    FeatureSelector(features=unimportant_features, ignore=true) |> booster

ProbabilisticPipeline(
  continuous_encoder = ContinuousEncoder(
        drop_last = false, 
        one_hot_ordered_factors = false), 
  feature_selector = FeatureSelector(
        features = [Symbol("Contract__One year"), :StreamingMovies__No, :OnlineBackup__Yes, :OnlineSecurity__Yes, :InternetService__DSL, :DeviceProtection__Yes, :StreamingTV__Yes, Symbol("OnlineSecurity__No internet service"), :Dependents__Yes, :Partner__Yes, :PaperlessBilling__Yes, Symbol("OnlineBackup__No internet service"), :InternetService__No, :StreamingMovies__Yes, :gender__Male, :PhoneService__Yes, Symbol("DeviceProtection__No internet service"), Symbol("StreamingTV__No internet service"), Symbol("StreamingMovies__No internet service"), Symbol("TechSupport__No internet service"), Symbol("MultipleLines__No phone service")], 
        ignore = true), 
  evo_tree_classifier = EvoTreeClassifier(
        nrounds = 100, 
        L2 = 0.0, 
        lambda = 0.0, 
        gamma = 0.0, 
        eta = 0.1, 
        max_depth = 6, 
        min_weight = 1.0, 
        rowsample = 1.0, 
        colsample = 1.0, 
        nbins = 64, 
        alpha = 0.5, 
        tree_type = "binary", 
        rng = Random.Xoshiro(0x779da755ff77601e, 0x2bfc9047d023e84f, 0x244081633d9f19f8, 0xcfc8b8b9e72e3690, 0xf4e85a418b9c4f80)), 
  cache = true)

‎

Introduces: control strategies: Step, NumberSinceBest, TimeLimit, InvalidValue, model wrapper IteratedModel, resampling strategy: Holdout

We want to optimize the hyper-parameters of our model. Since our model is iterative, these parameters include the (nested) iteration parameter pipe.evo_tree_classifier.nrounds. Sometimes this parameter is optimized first, fixed, and then maybe optimized again after the other parameters. Here we take a more principled approach, wrapping our model in a control strategy that makes it "self-iterating". The strategy applies a stopping criterion to out-of-sample estimates of the model performance, constructed using an internally constructed holdout set. In this way, we avoid some data hygiene issues, and, when we subsequently optimize other parameters, we will always being using an optimal number of iterations.

Note that this approach can be applied to any iterative MLJ model, eg, the neural network models provided by MLJFlux.jl.

First, we select appropriate controls from this list:

controls = [
    Step(1),              # to increment iteration parameter (`pipe.nrounds`)
    NumberSinceBest(4),   # main stopping criterion
    TimeLimit(2/3600),    # never train more than 2 sec
    InvalidValue()        # stop if NaN or ±Inf encountered
]

4-element Vector{Any}:
 IterationControl.Step(1)
 EarlyStopping.NumberSinceBest(4)
 EarlyStopping.TimeLimit(Dates.Millisecond(2000))
 EarlyStopping.InvalidValue()

Now we wrap our pipeline model using the IteratedModel wrapper, being sure to specify the measure on which internal estimates of the out-of-sample performance will be based:

iterated_pipe = IteratedModel(model=pipe2,
                              controls=controls,
                              measure=brier_loss,
                              resampling=Holdout(fraction_train=0.7))

ProbabilisticIteratedModel(
  model = ProbabilisticPipeline(
        continuous_encoder = ContinuousEncoder(drop_last = false, …), 
        feature_selector = FeatureSelector(features = [Symbol("Contract__One year"), :StreamingMovies__No, :OnlineBackup__Yes, :OnlineSecurity__Yes, :InternetService__DSL, :DeviceProtection__Yes, :StreamingTV__Yes, Symbol("OnlineSecurity__No internet service"), :Dependents__Yes, :Partner__Yes, :PaperlessBilling__Yes, Symbol("OnlineBackup__No internet service"), :InternetService__No, :StreamingMovies__Yes, :gender__Male, :PhoneService__Yes, Symbol("DeviceProtection__No internet service"), Symbol("StreamingTV__No internet service"), Symbol("StreamingMovies__No internet service"), Symbol("TechSupport__No internet service"), Symbol("MultipleLines__No phone service")], …), 
        evo_tree_classifier = EvoTrees.EvoTreeClassifier{EvoTrees.MLogLoss}
 - nrounds: 100
 - L2: 0.0
 - lambda: 0.0
 - gamma: 0.0
 - eta: 0.1
 - max_depth: 6
 - min_weight: 1.0
 - rowsample: 1.0
 - colsample: 1.0
 - nbins: 64
 - alpha: 0.5
 - tree_type: binary
 - rng: Random.Xoshiro(0x779da755ff77601e, 0x2bfc9047d023e84f, 0x244081633d9f19f8, 0xcfc8b8b9e72e3690, 0xf4e85a418b9c4f80)
, 
        cache = true), 
  controls = Any[IterationControl.Step(1), EarlyStopping.NumberSinceBest(4), EarlyStopping.TimeLimit(Dates.Millisecond(2000)), EarlyStopping.InvalidValue()], 
  resampling = Holdout(
        fraction_train = 0.7, 
        shuffle = false, 
        rng = Random._GLOBAL_RNG()), 
  measure = BrierLoss(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  retrain = false, 
  check_measure = true, 
  iteration_parameter = nothing, 
  cache = true)

We've set resampling=Holdout(fraction_train=0.7) to arrange that data attached to our model should be internally split into a train set (70%) and a holdout set (30%) for determining the out-of-sample estimate of the Brier loss.

For demonstration purposes, let's bind iterated_model to all data not in our don't-touch holdout set, and train on all of that data:

mach_iterated_pipe = machine(iterated_pipe, X, y)
fit!(mach_iterated_pipe);

To recap, internally this training is split into two separate steps:

A controlled iteration step, training on the holdout set, with the total number of iterations determined by the specified stopping criteria (based on the out-of-sample performance estimates)
A final step that trains the atomic model on all available data using the number of iterations determined in the first step. Calling predict on mach_iterated_pipe means using the learned parameters of the second step.

‎

Introduces: range, model wrapper TunedModel, RandomSearch

We now turn to hyper-parameter optimization. A tool not discussed here is the learning_curve function, which can be useful when wanting to visualize the effect of changes to a single hyper-parameter (which could be an iteration parameter). See, for example, this section of the manual or this tutorial.

Fine tuning the hyper-parameters of a gradient booster can be somewhat involved. Here we settle for simultaneously optimizing two key parameters: max_depth and η (learning_rate).

Like iteration control, model optimization in MLJ is implemented as a model wrapper, called TunedModel. After wrapping a model in a tuning strategy and binding the wrapped model to data in a machine called mach, calling fit!(mach) instigates a search for optimal model hyperparameters, within a specified range, and then uses all supplied data to train the best model. To predict using that model, one then calls predict(mach, Xnew). In this way the wrapped model may be viewed as a "self-tuning" version of the unwrapped model. That is, wrapping the model simply transforms certain hyper-parameters into learned parameters (just as IteratedModel does for an iteration parameter).

To start with, we define ranges for the parameters of interest. Since these parameters are nested, let's force a display of our model to a larger depth:

show(iterated_pipe, 2)

ProbabilisticIteratedModel(
  model = ProbabilisticPipeline(
        continuous_encoder = ContinuousEncoder(
              drop_last = false, 
              one_hot_ordered_factors = false), 
        feature_selector = FeatureSelector(
              features = [Symbol("Contract__One year"), :StreamingMovies__No, :OnlineBackup__Yes, :OnlineSecurity__Yes, :InternetService__DSL, :DeviceProtection__Yes, :StreamingTV__Yes, Symbol("OnlineSecurity__No internet service"), :Dependents__Yes, :Partner__Yes, :PaperlessBilling__Yes, Symbol("OnlineBackup__No internet service"), :InternetService__No, :StreamingMovies__Yes, :gender__Male, :PhoneService__Yes, Symbol("DeviceProtection__No internet service"), Symbol("StreamingTV__No internet service"), Symbol("StreamingMovies__No internet service"), Symbol("TechSupport__No internet service"), Symbol("MultipleLines__No phone service")], 
              ignore = true), 
        evo_tree_classifier = EvoTreeClassifier(
              nrounds = 100, 
              L2 = 0.0, 
              lambda = 0.0, 
              gamma = 0.0, 
              eta = 0.1, 
              max_depth = 6, 
              min_weight = 1.0, 
              rowsample = 1.0, 
              colsample = 1.0, 
              nbins = 64, 
              alpha = 0.5, 
              tree_type = "binary", 
              rng = Random.Xoshiro(0x779da755ff77601e, 0x2bfc9047d023e84f, 0x244081633d9f19f8, 0xcfc8b8b9e72e3690, 0xf4e85a418b9c4f80)), 
        cache = true), 
  controls = Any[IterationControl.Step(1), EarlyStopping.NumberSinceBest(4), EarlyStopping.TimeLimit(Dates.Millisecond(2000)), EarlyStopping.InvalidValue()], 
  resampling = Holdout(
        fraction_train = 0.7, 
        shuffle = false, 
        rng = Random._GLOBAL_RNG()), 
  measure = BrierLoss(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  retrain = false, 
  check_measure = true, 
  iteration_parameter = nothing, 
  cache = true)

p1 = :(model.evo_tree_classifier.eta)
p2 = :(model.evo_tree_classifier.max_depth)

r1 = range(iterated_pipe, p1, lower=-2, upper=-0.5, scale=x->10^x)
r2 = range(iterated_pipe, p2, lower=2, upper=6)

NumericRange(2 ≤ model.evo_tree_classifier.max_depth ≤ 6; origin=4.0, unit=2.0)

Nominal ranges are defined by specifying values instead of lower and upper.

Next, we choose an optimization strategy from this list:

tuning = RandomSearch(rng=rng)

RandomSearch(
  bounded = Distributions.Uniform, 
  positive_unbounded = Distributions.Gamma, 
  other = Distributions.Normal, 
  rng = Random.Xoshiro(0x028dfc8b09743300, 0x63602df923831c17, 0x583c62916d9b98c3, 0x9a3836f04c25bd68, 0xf4e85a418b9c4f80))

Then we wrap the model, specifying a resampling strategy and a measure, as we did for IteratedModel. In fact, we can include a battery of measures; by default, optimization is with respect to performance estimates based on the first measure, but estimates for all measures can be accessed from the model's report.

The keyword n specifies the total number of models (sets of hyper-parameters) to evaluate.

tuned_iterated_pipe = TunedModel(model=iterated_pipe,
                                 range=[r1, r2],
                                 tuning=tuning,
                                 measures=[brier_loss, auc, accuracy],
                                 resampling=StratifiedCV(nfolds=6, rng=rng),
                                 acceleration=CPUThreads(),
                                 n=40)

ProbabilisticTunedModel(
  model = ProbabilisticIteratedModel(
        model = ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = false, …), …), 
        controls = Any[IterationControl.Step(1), EarlyStopping.NumberSinceBest(4), EarlyStopping.TimeLimit(Dates.Millisecond(2000)), EarlyStopping.InvalidValue()], 
        resampling = Holdout(fraction_train = 0.7, …), 
        measure = BrierLoss(), 
        weights = nothing, 
        class_weights = nothing, 
        operation = nothing, 
        retrain = false, 
        check_measure = true, 
        iteration_parameter = nothing, 
        cache = true), 
  tuning = RandomSearch(
        bounded = Distributions.Uniform, 
        positive_unbounded = Distributions.Gamma, 
        other = Distributions.Normal, 
        rng = Random.Xoshiro(0x028dfc8b09743300, 0x63602df923831c17, 0x583c62916d9b98c3, 0x9a3836f04c25bd68, 0xf4e85a418b9c4f80)), 
  resampling = StratifiedCV(
        nfolds = 6, 
        shuffle = true, 
        rng = Random.Xoshiro(0x028dfc8b09743300, 0x63602df923831c17, 0x583c62916d9b98c3, 0x9a3836f04c25bd68, 0xf4e85a418b9c4f80)), 
  measure = StatisticalMeasuresBase.FussyMeasure[BrierLoss(), AreaUnderCurve(), Accuracy()], 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = MLJBase.NumericRange{T, MLJBase.Bounded} where T[NumericRange(0.01 ≤ model.evo_tree_classifier.eta ≤ 0.3162; after scaling: origin=0.05623, unit=5.623), NumericRange(2 ≤ model.evo_tree_classifier.max_depth ≤ 6; origin=4.0, unit=2.0)], 
  selection_heuristic = MLJTuning.NaiveSelection(nothing), 
  train_best = true, 
  repeats = 1, 
  n = 40, 
  acceleration = ComputationalResources.CPUThreads{Int64}(12), 
  acceleration_resampling = ComputationalResources.CPU1{Nothing}(nothing), 
  check_measure = true, 
  cache = true, 
  compact_history = true, 
  logger = nothing)

To save time, we skip the repeats here.

Binding our final model to data and training:

mach_tuned_iterated_pipe = machine(tuned_iterated_pipe, X, y)
fit!(mach_tuned_iterated_pipe)

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = ProbabilisticIteratedModel(model = ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = false, …), …), …), …)
  args: 
    1:	Source @030 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Multiclass{2}}, AbstractVector{ScientificTypesBase.Multiclass{4}}}}
    2:	Source @174 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

As explained above, the training we have just performed was split internally into two separate steps:

A step to determine the parameter values that optimize the aggregated cross-validation scores
A final step that trains the optimal model on all available data. Future predictions predict(mach_tuned_iterated_pipe, ...) are based on this final training step.

From report(mach_tuned_iterated_pipe) we can extract details about the optimization procedure. For example:

rpt2 = report(mach_tuned_iterated_pipe);
best_booster = rpt2.best_model.model.evo_tree_classifier

print(
    "Optimal hyper-parameters: \n",
    "  max_depth: ", best_booster.max_depth, "\n",
    "  eta:         ", best_booster.eta
)

Optimal hyper-parameters: 
  max_depth: 3
  eta:         0.06363819027728988

e_best = rpt2.best_history_entry
e_best.evaluation

CompactPerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────┬──────────────┬─────────────┐
│   │ measure          │ operation    │ measurement │
├───┼──────────────────┼──────────────┼─────────────┤
│ A │ BrierLoss()      │ predict      │ 0.286       │
│ B │ AreaUnderCurve() │ predict      │ 0.824       │
│ C │ Accuracy()       │ predict_mode │ 0.817       │
└───┴──────────────────┴──────────────┴─────────────┘
┌───┬───────────────────────────────────────────────────┬─────────┐
│   │ per_fold                                          │ 1.96*SE │
├───┼───────────────────────────────────────────────────┼─────────┤
│ A │ Float32[0.305, 0.299, 0.286, 0.235, 0.295, 0.294] │ 0.0226  │
│ B │ [0.786, 0.795, 0.826, 0.91, 0.806, 0.82]          │ 0.0393  │
│ C │ [0.819, 0.817, 0.805, 0.878, 0.793, 0.793]        │ 0.0279  │
└───┴───────────────────────────────────────────────────┴─────────┘

Digging a little deeper, we can learn what stopping criterion was applied in the case of the optimal model, and how many iterations were required:

rpt2.best_report.controls |> collect

4-element Vector{Tuple{Any, NamedTuple}}:
 (IterationControl.Step(1), (new_iterations = 54,))
 (EarlyStopping.NumberSinceBest(4), (done = true, log = "Stop triggered by EarlyStopping.NumberSinceBest(4) stopping criterion. "))
 (EarlyStopping.TimeLimit(Dates.Millisecond(2000)), (done = false, log = ""))
 (EarlyStopping.InvalidValue(), (done = false, log = ""))

Finally, we can visualize the optimization results:

plot(mach_tuned_iterated_pipe, size=(600,450))

‎

Introduces: MLJ.save

Here's how to serialize our final, trained self-iterating, self-tuning pipeline machine using Julia's native serializer (see the manual for more options):

FILE = joinpath(tempdir(), "tuned_iterated_pipe.jls")
MLJ.save(FILE, mach_tuned_iterated_pipe)

We'll deserialize this in "Testing the final model" below.

‎

Finally, to get an even more accurate estimate of performance, we can evaluate our model using stratified cross-validation and all the data attached to our machine. Because this evaluation implies nested resampling, this computation takes quite a bit longer than the previous one (which is being repeated six times, using 5/6th of the data each time):

e_tuned_iterated_pipe = evaluate(tuned_iterated_pipe, X, y,
                                 resampling=StratifiedCV(nfolds=6, rng=rng),
                                 measures=[brier_loss, auc, accuracy])

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────┬──────────────┬─────────────┐
│   │ measure          │ operation    │ measurement │
├───┼──────────────────┼──────────────┼─────────────┤
│ A │ BrierLoss()      │ predict      │ 0.292       │
│ B │ AreaUnderCurve() │ predict      │ 0.822       │
│ C │ Accuracy()       │ predict_mode │ 0.781       │
└───┴──────────────────┴──────────────┴─────────────┘
┌───┬───────────────────────────────────────────────────┬─────────┐
│   │ per_fold                                          │ 1.96*SE │
├───┼───────────────────────────────────────────────────┼─────────┤
│ A │ Float32[0.295, 0.258, 0.321, 0.371, 0.246, 0.261] │ 0.0415  │
│ B │ [0.81, 0.874, 0.786, 0.727, 0.86, 0.873]          │ 0.0514  │
│ C │ [0.795, 0.829, 0.72, 0.72, 0.817, 0.805]          │ 0.0429  │
└───┴───────────────────────────────────────────────────┴─────────┘

For comparison, here again is the evaluation for the basic pipeline model (no feature selection and default hyperparameters):

e_pipe

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────┬──────────────┬─────────────┐
│   │ measure          │ operation    │ measurement │
├───┼──────────────────┼──────────────┼─────────────┤
│ A │ BrierLoss()      │ predict      │ 0.36        │
│ B │ AreaUnderCurve() │ predict      │ 0.79        │
│ C │ Accuracy()       │ predict_mode │ 0.759       │
└───┴──────────────────┴──────────────┴─────────────┘
┌───┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┐
│   │ per_fold                                                                                                                             │ 1.96*SE │
├───┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┤
│ A │ Float32[0.271, 0.441, 0.347, 0.381, 0.311, 0.446, 0.397, 0.305, 0.397, 0.259, 0.442, 0.282, 0.385, 0.289, 0.337, 0.46, 0.364, 0.366] │ 0.0305  │
│ B │ [0.846, 0.729, 0.826, 0.792, 0.812, 0.717, 0.742, 0.827, 0.766, 0.839, 0.753, 0.853, 0.71, 0.842, 0.829, 0.745, 0.782, 0.803]        │ 0.0227  │
│ C │ [0.795, 0.707, 0.78, 0.732, 0.793, 0.695, 0.735, 0.793, 0.72, 0.817, 0.707, 0.817, 0.771, 0.829, 0.805, 0.72, 0.732, 0.72]           │ 0.0213  │
└───┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┘

Tuning appears to improve all three scores (not just the Brier loss used in optimization). However, 95% confidence intervals, based on the standard errors, suggest we are not detecting statistically significant differences for auc and accuracy. In any case, the default booster hyper-parameters do a pretty good job. But it would definitely be worth revisiting this in the case we use all the data.

We now determine the performance of our model on our lock-and-throw-away-the-key holdout set. To demonstrate deserialization, we'll pretend we're in a new Julia session (but have called import/using on the same packages). Then the following should suffice to recover our model trained under "Hyper-parameter optimization" above:

mach_restored = machine(FILE)

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = ProbabilisticIteratedModel(model = ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = false, …), …), …), …)
  args:

We compute predictions on the holdout set:

ŷ_tuned = predict(mach_restored, Xtest);
ŷ_tuned[1]

UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(No=>0.712, Yes=>0.288)

And can compute the final performance measures:

print(
    "Tuned model measurements on test:\n",
    "  brier loss: ", brier_loss(ŷ_tuned, ytest), "\n",
    "  auc:        ", auc(ŷ_tuned, ytest),                "\n",
    "  accuracy:   ", accuracy(mode.(ŷ_tuned), ytest)
)

Tuned model measurements on test:
  brier loss: 0.27600387
  auc:        0.8487327188940093
  accuracy:   0.7914691943127962

For comparison, here's the performance for the basic pipeline model

mach_basic = machine(pipe, X, y)
fit!(mach_basic, verbosity=0)

ŷ_basic = predict(mach_basic, Xtest);

print(
    "Basic model measurements on test set:\n",
    "  brier loss: ", brier_loss(ŷ_basic, ytest), "\n",
    "  auc:        ", auc(ŷ_basic, ytest),                "\n",
    "  accuracy:   ", accuracy(mode.(ŷ_basic), ytest)
)

Basic model measurements on test set:
  brier loss: 0.33141056
  auc:        0.8130184331797236
  accuracy:   0.7725118483412322

‎