Horse colic data

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Initial data processing
A baseline model
Trying another model

In this example, we consider the UCI "horse colic" dataset

This is a reasonably messy classification problem with missing values etc and so some work should be expected in the feature processing.

The data is pre-split in training and testing and we will keep it as such

using MLJ
using HTTP
using CSV
import DataFrames: DataFrame, select!, Not
req1 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data")
req2 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.test")
header = ["surgery", "age", "hospital_number",
    "rectal_temperature", "pulse",
    "respiratory_rate", "temperature_extremities",
    "peripheral_pulse", "mucous_membranes",
    "capillary_refill_time", "pain",
    "peristalsis", "abdominal_distension",
    "nasogastric_tube", "nasogastric_reflux",
    "nasogastric_reflux_ph", "feces", "abdomen",
    "packed_cell_volume", "total_protein",
    "abdomcentesis_appearance", "abdomcentesis_total_protein",
    "outcome", "surgical_lesion", "lesion_1", "lesion_2", "lesion_3",
    "cp_data"]
csv_opts = (header=header, delim=' ', missingstring="?",
            ignorerepeated=true)
data_train = CSV.read(req1.body, DataFrame; csv_opts...)
data_test  = CSV.read(req2.body, DataFrame; csv_opts...)
@show size(data_train)
@show size(data_test)

size(data_train) = (300, 28)
size(data_test) = (68, 28)

‎

To simplify the analysis, we will drop the columns Lesion * as they would need specific re-encoding which would distract us a bit.

unwanted = [:lesion_1, :lesion_2, :lesion_3]
data = vcat(data_train, data_test)
select!(data, Not(unwanted));

Let's also keep track of the initial train-test split

train = 1:nrows(data_train)
test = last(train) .+ (1:nrows(data_test));

We know from reading the description that some of these features represent multiclass data; to facilitate the interpretation, we can use autotype from ScientificTypes. By default, autotype will check all columns and suggest a Finite type assuming there are relatively few distinct values in the column. More sophisticated rules can be passed, see ScientificTypes.jl:

coerce!(data, autotype(data));

Let's see column by column whether it looks ok now

schema(data)

┌─────────────────────────────┬───────────────────────────────────┬───────────────────────────────────────────────────┐
│ names                       │ scitypes                          │ types                                             │
├─────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────────────────┤
│ surgery                     │ Union{Missing, OrderedFactor{2}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ age                         │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
│ hospital_number             │ Count                             │ Int64                                             │
│ rectal_temperature          │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ pulse                       │ Union{Missing, Count}             │ Union{Missing, Int64}                             │
│ respiratory_rate            │ Union{Missing, Count}             │ Union{Missing, Int64}                             │
│ temperature_extremities     │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ peripheral_pulse            │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ mucous_membranes            │ Union{Missing, OrderedFactor{6}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ capillary_refill_time       │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ pain                        │ Union{Missing, OrderedFactor{5}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ peristalsis                 │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdominal_distension        │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_tube            │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_reflux          │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_reflux_ph       │ Union{Missing, OrderedFactor{24}} │ Union{Missing, CategoricalValue{Float64, UInt32}} │
│ feces                       │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdomen                     │ Union{Missing, OrderedFactor{5}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ packed_cell_volume          │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ total_protein               │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ abdomcentesis_appearance    │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdomcentesis_total_protein │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ outcome                     │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ surgical_lesion             │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
│ cp_data                     │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
└─────────────────────────────┴───────────────────────────────────┴───────────────────────────────────────────────────┘

Most columns are now treated as either Multiclass or Ordered, this corresponds to the description of the data. For instance:

Surgery is described as 1=yes / 2=no
Age is described as 1=adult / 2=young

Inspecting the rest of the descriptions and the current scientific type, there are a few more things that can be observed:

hospital number is still a count, this means that there are relatively many hospitals and so that's probably not very useful,
pulse and respiratory rate are still as count but the data description suggests that they can be considered as continuous

length(unique(data.hospital_number))

yeah let's drop that

data = select!(data, Not(:hospital_number));

let's also coerce the pulse and respiratory rate, by coercing all remaining Count features to Continuous:

coerce!(data, Count => Continuous)
schema(data)

┌─────────────────────────────┬───────────────────────────────────┬───────────────────────────────────────────────────┐
│ names                       │ scitypes                          │ types                                             │
├─────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────────────────┤
│ surgery                     │ Union{Missing, OrderedFactor{2}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ age                         │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
│ rectal_temperature          │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ pulse                       │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ respiratory_rate            │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ temperature_extremities     │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ peripheral_pulse            │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ mucous_membranes            │ Union{Missing, OrderedFactor{6}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ capillary_refill_time       │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ pain                        │ Union{Missing, OrderedFactor{5}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ peristalsis                 │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdominal_distension        │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_tube            │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_reflux          │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ nasogastric_reflux_ph       │ Union{Missing, OrderedFactor{24}} │ Union{Missing, CategoricalValue{Float64, UInt32}} │
│ feces                       │ Union{Missing, OrderedFactor{4}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdomen                     │ Union{Missing, OrderedFactor{5}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ packed_cell_volume          │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ total_protein               │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ abdomcentesis_appearance    │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ abdomcentesis_total_protein │ Union{Missing, Continuous}        │ Union{Missing, Float64}                           │
│ outcome                     │ Union{Missing, OrderedFactor{3}}  │ Union{Missing, CategoricalValue{Int64, UInt32}}   │
│ surgical_lesion             │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
│ cp_data                     │ OrderedFactor{2}                  │ CategoricalValue{Int64, UInt32}                   │
└─────────────────────────────┴───────────────────────────────────┴───────────────────────────────────────────────────┘

‎

There's quite a lot of missing values, in this tutorial we'll be a bit rough in how we deal with them applying the following rules of thumb:

drop the rows where the outcome is unknown
drop columns with more than 20% missing values
simple imputation of whatever's left

missing_outcome = ismissing.(data.outcome)
idx_missing_outcome = missing_outcome |> findall

2-element Vector{Int64}:
 133
 309

Ok there's only two row which is nice, let's remove them from the train/test indices:

train = setdiff!(train |> collect, idx_missing_outcome)
test = setdiff!(test |> collect, idx_missing_outcome)
all = vcat(train, test);

Now let's look at how many missings there are per features

for name in names(data)
    col = data[all, name]
    ratio_missing = sum(ismissing.(col)) / length(all) * 100
    println(rpad(name, 30), round(ratio_missing, sigdigits=3))
end

surgery                       0.0
age                           0.0
rectal_temperature            18.9
pulse                         7.1
respiratory_rate              19.4
temperature_extremities       17.5
peripheral_pulse              22.7
mucous_membranes              13.1
capillary_refill_time         10.4
pain                          17.2
peristalsis                   13.9
abdominal_distension          17.5
nasogastric_tube              35.5
nasogastric_reflux            36.1
nasogastric_reflux_ph         81.1
feces                         34.7
abdomen                       39.1
packed_cell_volume            9.84
total_protein                 11.5
abdomcentesis_appearance      52.7
abdomcentesis_total_protein   63.9
outcome                       0.0
surgical_lesion               0.0
cp_data                       0.0

Let's drop the ones with more than 20% (quite a few!)

unwanted = [:peripheral_pulse, :nasogastric_tube, :nasogastric_reflux,
            :nasogastric_reflux_ph, :feces, :abdomen,
            :abdomcentesis_appearance, :abdomcentesis_total_protein]
select!(data, Not(unwanted));

Note that we could have done this better and investigated the nature of the features for which there's a lot of missing values but don't forget that our goal is to showcase MLJ!

Let's conclude by filling all missing values and separating the features from the target:

@load FillImputer
filler = machine(FillImputer(), data[all, :]) |> fit!
data = MLJ.transform(filler, data)

y, X = unpack(data, ==(:outcome)); # a vector and a table

import MLJModels ✔

‎

Let's define a first sensible model and get a baseline, basic steps are:

one-hot-encode the categoricals
feed all this into a classifier

@load OneHotEncoder
MultinomialClassifier = @load MultinomialClassifier pkg="MLJLinearModels"

import MLJModels ✔
import MLJLinearModels ✔
MLJLinearModels.MultinomialClassifier

Let's have convenient handles over the training data

Xtrain = X[train,:]
ytrain = y[train];

And let's define a pipeline corresponding to the operations above

pipe = OneHotEncoder() |>  MultinomialClassifier()

ProbabilisticPipeline(
  one_hot_encoder = OneHotEncoder(
        features = Symbol[], 
        drop_last = false, 
        ordered_factor = true, 
        ignore = false), 
  multinomial_classifier = MultinomialClassifier(
        lambda = 2.220446049250313e-16, 
        gamma = 0.0, 
        penalty = :l2, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  cache = true)

Here are log loss and accuracy estimates of model performance, obtained by training on 90% of the train data, and computing the measures on the remaining 10%:

metrics = [log_loss, accuracy]
evaluate(
    pipe, Xtrain, ytrain;
    resampling = Holdout(fraction_train=0.9),
    measures = metrics,
)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 0.835       │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 0.667       │
└───┴──────────────────────┴──────────────┴─────────────┘

If we perform cross-validation instead, we also get a very rough idea of the uncertainties in our estimates ("SE" is "standard error"):

evaluate(pipe, Xtrain, ytrain; resampling=CV(nfolds=6), measures=metrics)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 0.735       │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 0.686       │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬─────────────────────────────────────────┬─────────┐
│   │ per_fold                                │ 1.96*SE │
├───┼─────────────────────────────────────────┼─────────┤
│ A │ [0.978, 0.566, 0.65, 0.525, 0.91, 0.78] │ 0.162   │
│ B │ [0.62, 0.74, 0.68, 0.72, 0.64, 0.714]   │ 0.0418  │
└───┴─────────────────────────────────────────┴─────────┘

And for a final comparison, here's how we do on the test set, which we have not yet touched:

mach = machine(pipe, Xtrain, ytrain) |> fit!
fit!(mach, verbosity=0)
yhat_prob = predict(mach, X[test,:])
m = log_loss(yhat_prob, y[test])
println("log loss: ", round(m, sigdigits=4))

yhat = mode.(yhat_prob)
m = accuracy(yhat, y[test])
println("accuracy: ", round(m, sigdigits=4))

log loss: 0.965
accuracy: 0.7164

That's not bad at all actually.

Since our data set is relatively small, we do not expect a statistically significant improvement with hyperparameter tuning. However, for the sake of illustration we'll attempt this.

lambdas = range(pipe, :(multinomial_classifier.lambda), lower=1e-3, upper=100, scale=:log10)
tuned_pipe = TunedModel(
    pipe;
    tuning=Grid(resolution=20),
    range=lambdas, measure=log_loss,
    acceleration=CPUThreads(),
)
mach = machine(tuned_pipe, Xtrain, ytrain) |> fit!
best_pipe = fitted_params(mach).best_model

ProbabilisticPipeline(
  one_hot_encoder = OneHotEncoder(
        features = Symbol[], 
        drop_last = false, 
        ordered_factor = true, 
        ignore = false), 
  multinomial_classifier = MultinomialClassifier(
        lambda = 0.06951927961775606, 
        gamma = 0.0, 
        penalty = :l2, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  cache = true)

evaluate!(mach; resampling=CV(nfolds=6), measures=metrics)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 0.706       │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 0.686       │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬───────────────────────────────────────────┬─────────┐
│   │ per_fold                                  │ 1.96*SE │
├───┼───────────────────────────────────────────┼─────────┤
│ A │ [0.798, 0.63, 0.638, 0.528, 0.866, 0.774] │ 0.111   │
│ B │ [0.68, 0.7, 0.7, 0.74, 0.64, 0.653]       │ 0.0317  │
└───┴───────────────────────────────────────────┴─────────┘

So, not likely a statistically significant difference. Nevertheless, let's see how we do with accuracy on a holdout set:

fit!(mach) # fit on all the train data
yhat = predict_mode(mach, X[test,:])
m = accuracy(yhat, y[test])
println("accuracy: ", round(m, sigdigits=4))

accuracy: 0.7612

So an improvement after all on the test set.

We've probably reached the limit of a simple linear model.

‎

There are lots of categoricals, so maybe it's just better to use something that deals well with that like a tree-based classifier.

EvoTreeClassifier = @load EvoTreeClassifier
model = EvoTreeClassifier()
mach = machine(model, Xtrain, ytrain)
evaluate!(mach; resampling=CV(nfolds=6), measures=metrics)

import EvoTrees ✔
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 0.848       │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 0.719       │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬─────────────────────────────────────────┬─────────┐
│   │ per_fold                                │ 1.96*SE │
├───┼─────────────────────────────────────────┼─────────┤
│ A │ [0.903, 0.699, 0.917, 0.694, 0.9, 0.98] │ 0.106   │
│ B │ [0.78, 0.68, 0.7, 0.8, 0.64, 0.714]     │ 0.0532  │
└───┴─────────────────────────────────────────┴─────────┘

That's better accuracy, although not significantly better, than the other CV estimates based on train, and without any tuning. Do we actually do better on the test set?

fit!(mach) # fit on all the train data
yhat = predict_mode(mach, X[test,:])
m = accuracy(yhat, y[test])
println("accuracy: ", round(m, sigdigits=4))

accuracy: 0.791

So, the best so far.

We could investigate more, try tuning etc, but the key points of this tutorial was to show how to handle data with missing values.

‎

Horse colic data

Initial data processing

Getting the data

Inspecting columns

Dealing with missing values

A baseline model

Trying another model