Horse colic data

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Initial data processing

In this example, we consider the UCI "horse colic" dataset

This is a reasonably messy classification problem with missing values etc and so some work should be expected in the feature processing.

Getting the data

The data is pre-split in training and testing and we will keep it as such

using MLJ
using HTTP
using CSV
import DataFrames: DataFrame, select!, Not
req1 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data")
req2 = HTTP.get("http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.test")
header = ["surgery", "age", "hospital_number",
    "rectal_temperature", "pulse",
    "respiratory_rate", "temperature_extremities",
    "peripheral_pulse", "mucous_membranes",
    "capillary_refill_time", "pain",
    "peristalsis", "abdominal_distension",
    "nasogastric_tube", "nasogastric_reflux",
    "nasogastric_reflux_ph", "feces", "abdomen",
    "packed_cell_volume", "total_protein",
    "abdomcentesis_appearance", "abdomcentesis_total_protein",
    "outcome", "surgical_lesion", "lesion_1", "lesion_2", "lesion_3",
    "cp_data"]
csv_opts = (header=header, delim=' ', missingstring="?",
            ignorerepeated=true)
data_train = CSV.read(req1.body, DataFrame; csv_opts...)
data_test  = CSV.read(req2.body, DataFrame; csv_opts...)
@show size(data_train)
@show size(data_test)
size(data_train) = (300, 28)
size(data_test) = (68, 28)

Inspecting columns

To simplify the analysis, we will drop the columns Lesion * as they would need specific re-encoding which would distract us a bit.

unwanted = [:lesion_1, :lesion_2, :lesion_3]
data = vcat(data_train, data_test)
select!(data, Not(unwanted));

Let's also keep track of the initial train-test split

train = 1:nrows(data_train)
test = last(train) .+ (1:nrows(data_test));

We know from reading the description that some of these features represent multiclass data; to facilitate the interpretation, we can use autotype from ScientificTypes. By default, autotype will check all columns and suggest a Finite type assuming there are relatively few distinct values in the column. More sophisticated rules can be passed, see ScientificTypes.jl:

datac = coerce(data, autotype(data));

Let's see column by column whether it looks ok now

sch = schema(datac)
for (name, scitype) in zip(sch.names, sch.scitypes)
    println(rpad("$name", 30), scitype)
end
surgery                       Union{Missing, OrderedFactor{2}}
age                           OrderedFactor{2}
hospital_number               Count
rectal_temperature            Union{Missing, Continuous}
pulse                         Union{Missing, Count}
respiratory_rate              Union{Missing, Count}
temperature_extremities       Union{Missing, OrderedFactor{4}}
peripheral_pulse              Union{Missing, OrderedFactor{4}}
mucous_membranes              Union{Missing, OrderedFactor{6}}
capillary_refill_time         Union{Missing, OrderedFactor{3}}
pain                          Union{Missing, OrderedFactor{5}}
peristalsis                   Union{Missing, OrderedFactor{4}}
abdominal_distension          Union{Missing, OrderedFactor{4}}
nasogastric_tube              Union{Missing, OrderedFactor{3}}
nasogastric_reflux            Union{Missing, OrderedFactor{3}}
nasogastric_reflux_ph         Union{Missing, OrderedFactor{24}}
feces                         Union{Missing, OrderedFactor{4}}
abdomen                       Union{Missing, OrderedFactor{5}}
packed_cell_volume            Union{Missing, Continuous}
total_protein                 Union{Missing, Continuous}
abdomcentesis_appearance      Union{Missing, OrderedFactor{3}}
abdomcentesis_total_protein   Union{Missing, Continuous}
outcome                       Union{Missing, OrderedFactor{3}}
surgical_lesion               OrderedFactor{2}
cp_data                       OrderedFactor{2}

Most columns are now treated as either Multiclass or Ordered, this corresponds to the description of the data. For instance:

  • Surgery is described as 1=yes / 2=no

  • Age is described as 1=adult / 2=young

Inspecting the rest of the descriptions and the current scientific type, there are a few more things that can be observed:

  • hospital number is still a count, this means that there are relatively many hospitals and so that's probably not very useful,

  • pulse and respiratory rate are still as count but the data description suggests that they can be considered as continuous

length(unique(datac.hospital_number))
346

yeah let's drop that

datac = select!(datac, Not(:hospital_number));

let's also coerce the pulse and respiratory rate, in fact we can do that with autotype specifying as rule the discrete_to_continuous

datac = coerce(datac, autotype(datac, rules=(:discrete_to_continuous,)));

Dealing with missing values

There's quite a lot of missing values, in this tutorial we'll be a bit rough in how we deal with them applying the following rules of thumb:

  • drop the rows where the outcome is unknown

  • drop columns with more than 20% missing values

  • simple imputation of whatever's left

missing_outcome = ismissing.(datac.outcome)
idx_missing_outcome = missing_outcome |> findall
2-element Vector{Int64}:
 133
 309

Ok there's only two row which is nice, let's remove them from the train/test indices and drop the rows

train = setdiff!(train |> collect, idx_missing_outcome)
test = setdiff!(test |> collect, idx_missing_outcome)
datac = datac[.!missing_outcome, :];

Now let's look at how many missings there are per features

for name in names(datac)
    col = datac[:, name]
    ratio_missing = sum(ismissing.(col)) / nrows(datac) * 100
    println(rpad(name, 30), round(ratio_missing, sigdigits=3))
end
surgery                       0.0
age                           0.0
rectal_temperature            18.9
pulse                         7.1
respiratory_rate              19.4
temperature_extremities       17.5
peripheral_pulse              22.7
mucous_membranes              13.1
capillary_refill_time         10.4
pain                          17.2
peristalsis                   13.9
abdominal_distension          17.5
nasogastric_tube              35.5
nasogastric_reflux            36.1
nasogastric_reflux_ph         81.1
feces                         34.7
abdomen                       39.1
packed_cell_volume            9.84
total_protein                 11.5
abdomcentesis_appearance      52.7
abdomcentesis_total_protein   63.9
outcome                       0.0
surgical_lesion               0.0
cp_data                       0.0

Let's drop the ones with more than 20% (quite a few!)

unwanted = [:peripheral_pulse, :nasogastric_tube, :nasogastric_reflux,
        :nasogastric_reflux_ph, :feces, :abdomen, :abdomcentesis_appearance, :abdomcentesis_total_protein]
select!(datac, Not(unwanted));

Note that we could have done this better and investigated the nature of the features for which there's a lot of missing values but don't forget that our goal is to showcase MLJ!

Let's conclude by filling all missing values and separating the feature matrix from the target

@load FillImputer
filler = machine(FillImputer(), datac)
fit!(filler)
datac = MLJ.transform(filler, datac)

y, X = unpack(datac, ==(:outcome), name->true);
X = coerce(X, autotype(X, :discrete_to_continuous));
import MLJModels ✔

A baseline model

Let's define a first sensible model and get a baseline, basic steps are:

  • one-hot-encode the categoricals

  • feed all this into a classifier

@load OneHotEncoder
MultinomialClassifier = @load MultinomialClassifier pkg="MLJLinearModels"
import MLJModels ✔
import MLJLinearModels ✔
MLJLinearModels.MultinomialClassifier

Let's have convenient handles over the training data

Xtrain = X[train,:]
ytrain = y[train];

And let's define a pipeline corresponding to the operations above

SimplePipe = Pipeline(
    OneHotEncoder(),
    MultinomialClassifier(),
    prediction_type=:probabilistic
)
mach = machine(SimplePipe, Xtrain, ytrain)
res = evaluate!(
    mach;
    resampling=Holdout(fraction_train=0.9),
    measure=cross_entropy
)
round(res.measurement[1], sigdigits=3)
0.802

This is the cross entropy on some held-out 10% of the training set. We can also just for the sake of getting a baseline, see the misclassification on the whole training data:

mcr = misclassification_rate(predict_mode(mach, Xtrain), ytrain)
println(rpad("MNC mcr:", 10), round(mcr, sigdigits=3))
MNC mcr:  0.301

That's not bad at all actually. Let's tune it a bit and see if we can get a bit better than that, not much point in going crazy, we might get a few percents but not much more.

model = SimplePipe
lambdas = range(model, :(multinomial_classifier.lambda), lower=1e-3, upper=100, scale=:log10)
tm = TunedModel(model=SimplePipe, ranges=lambdas, measure=cross_entropy)
mtm = machine(tm, Xtrain, ytrain)
fit!(mtm)
best_pipe = fitted_params(mtm).best_model
ProbabilisticPipeline(
    one_hot_encoder = OneHotEncoder(
            features = Symbol[],
            drop_last = false,
            ordered_factor = true,
            ignore = false),
    multinomial_classifier = MultinomialClassifier(
            lambda = 0.046415888336127795,
            gamma = 0.0,
            penalty = :l2,
            fit_intercept = true,
            penalize_intercept = false,
            scale_penalty_with_samples = true,
            solver = nothing),
    cache = true)

So it looks like it's useful to regularise a fair bit to get a lower cross entropy

ŷ = MLJ.predict(mtm, Xtrain)
cross_entropy(ŷ, ytrain) |> mean
0.5866213889827255

Interestingly this does not improve our missclassification rate

mcr = misclassification_rate(mode.(ŷ), ytrain)
println(rpad("MNC mcr:", 10), round(mcr, sigdigits=3))
MNC mcr:  0.244

We've probably reached the limit of a simple linear model.

Trying another model

There are lots of categoricals, so maybe it's just better to use something that deals well with that like a tree-based classifier.

XGBC = @load XGBoostClassifier
dtc = machine(XGBC(), Xtrain, ytrain)
fit!(dtc)
ŷ = MLJ.predict(dtc, Xtrain)
cross_entropy(ŷ, ytrain) |> mean
import MLJXGBoostInterface ✔
0.0248151f0

So we get a worse cross entropy but...

misclassification_rate(mode.(ŷ), ytrain)
0.0033444816053511705

a significantly better misclassification rate.

We could investigate more, do more tuning etc, but the key points of this tutorial was to show how to handle data with missing values.