Wine

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Initial data processing
1. Getting the data
2. Setting the scientific type
Getting a baseline
Visualising the classes

In this example, we consider the UCI "wine" dataset

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Let's download the data thanks to the UrlDownload.jl package and load it into a DataFrame:

using HTTP
using MLJ
using StableRNGs # for RNGs, stable over Julia versions
import DataFrames: DataFrame, describe
using UrlDownload

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
header = ["Class", "Alcool", "Malic acid", "Ash", "Alcalinity of ash",
          "Magnesium", "Total phenols", "Flavanoids",
          "Nonflavanoid phenols", "Proanthcyanins", "Color intensity",
          "Hue", "OD280/OD315 of diluted wines", "Proline"]
data = urldownload(url, true, format=:CSV, header=header);

The second argument to urldownload adds a progress meter for the download, the format helps indicate the format of the file and the header helps pass the column names which are not in the file.

df = DataFrame(data)
describe(df)

14×7 DataFrame
 Row │ variable                      mean        min     median   max      nmissing  eltype
     │ Symbol                        Float64     Real    Float64  Real     Int64     DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ Class                           1.9382      1       2.0       3            0  Int64
   2 │ Alcool                         13.0006     11.03   13.05     14.83         0  Float64
   3 │ Malic acid                      2.33635     0.74    1.865     5.8          0  Float64
   4 │ Ash                             2.36652     1.36    2.36      3.23         0  Float64
   5 │ Alcalinity of ash              19.4949     10.6    19.5      30.0          0  Float64
   6 │ Magnesium                      99.7416     70      98.0     162            0  Int64
   7 │ Total phenols                   2.29511     0.98    2.355     3.88         0  Float64
   8 │ Flavanoids                      2.02927     0.34    2.135     5.08         0  Float64
   9 │ Nonflavanoid phenols            0.361854    0.13    0.34      0.66         0  Float64
  10 │ Proanthcyanins                  1.5909      0.41    1.555     3.58         0  Float64
  11 │ Color intensity                 5.05809     1.28    4.69     13.0          0  Float64
  12 │ Hue                             0.957449    0.48    0.965     1.71         0  Float64
  13 │ OD280/OD315 of diluted wines    2.61169     1.27    2.78      4.0          0  Float64
  14 │ Proline                       746.893     278     673.5    1680            0  Int64

the target is the Class column, everything else is a feature; we can dissociate the two using the unpack function:

y, X = unpack(df, ==(:Class)); # a vector and a table

‎

Let's explore the scientific type attributed by default to the target and the features

scitype(y)

AbstractVector{Count} (alias for AbstractArray{ScientificTypesBase.Count, 1})

this should be changed as it should be considered as an ordered factor. The difference is as follows:

a Count corresponds to an integer between 0 and infinity
a OrderedFactor however is a categorical object (there are finitely many options) with ordering (1 < 2 < 3).

yc = coerce(y, OrderedFactor);

Let's now consider the features. Since this is a table, will inspect scitypes using schema, which is more user-friendly:

schema(X)

┌──────────────────────────────┬────────────┬─────────┐
│ names                        │ scitypes   │ types   │
├──────────────────────────────┼────────────┼─────────┤
│ Alcool                       │ Continuous │ Float64 │
│ Malic acid                   │ Continuous │ Float64 │
│ Ash                          │ Continuous │ Float64 │
│ Alcalinity of ash            │ Continuous │ Float64 │
│ Magnesium                    │ Count      │ Int64   │
│ Total phenols                │ Continuous │ Float64 │
│ Flavanoids                   │ Continuous │ Float64 │
│ Nonflavanoid phenols         │ Continuous │ Float64 │
│ Proanthcyanins               │ Continuous │ Float64 │
│ Color intensity              │ Continuous │ Float64 │
│ Hue                          │ Continuous │ Float64 │
│ OD280/OD315 of diluted wines │ Continuous │ Float64 │
│ Proline                      │ Count      │ Int64   │
└──────────────────────────────┴────────────┴─────────┘

So there are Continuous values (encoded as floating point) and Count values (integer). Note also that there are no missing value (otherwise one of the scientific type would have been a Union{Missing,*}). Let's check which column is what: The two variables that are encoded as Count can probably be re-interpreted; let's have a look at the Proline one to see what it looks like

X[1:5, :Proline]

5-element Vector{Int64}:
 1065
 1050
 1185
 1480
  735

This is likely representing a Continuous variable as well (it would be better to know precisely what it is but for now let's just go with the hunch). We'll do the same with :Magnesium:

Xc = coerce(X, :Proline=>Continuous, :Magnesium=>Continuous);

Finally, let's have a quick look at the mean and standard deviation of each feature to get a feel for their amplitude:

describe(Xc, :mean, :std)

13×3 DataFrame
 Row │ variable                      mean        std
     │ Symbol                        Float64     Float64
─────┼──────────────────────────────────────────────────────
   1 │ Alcool                         13.0006      0.811827
   2 │ Malic acid                      2.33635     1.11715
   3 │ Ash                             2.36652     0.274344
   4 │ Alcalinity of ash              19.4949      3.33956
   5 │ Magnesium                      99.7416     14.2825
   6 │ Total phenols                   2.29511     0.625851
   7 │ Flavanoids                      2.02927     0.998859
   8 │ Nonflavanoid phenols            0.361854    0.124453
   9 │ Proanthcyanins                  1.5909      0.572359
  10 │ Color intensity                 5.05809     2.31829
  11 │ Hue                             0.957449    0.228572
  12 │ OD280/OD315 of diluted wines    2.61169     0.70999
  13 │ Proline                       746.893     314.907

Right so it varies a fair bit which would invite to standardise the data.

Note: to complete such a first step, one could explore histograms of the various **features for instance, check that there is enough variation among the continuous **features and that there does not seem to be problems in the encoding, we cut this out **to shorten the tutorial. We could also have checked that the data is balanced.

‎

It's a multiclass classification problem with continuous inputs so a sensible start is to test two very simple classifiers to get a baseline.

We'll train two simple pipelines:

a Standardizer + KNN classifier and
a Standardizer + Multinomial classifier (logistic regression).

KNNClassifier = @load KNNClassifier
MultinomialClassifier = @load MultinomialClassifier pkg=MLJLinearModels;

knn_pipe = Standardizer() |> KNNClassifier()
multinom_pipe = Standardizer() |> MultinomialClassifier()

import NearestNeighborModels ✔
import MLJLinearModels ✔
ProbabilisticPipeline(
  standardizer = Standardizer(
        features = Symbol[], 
        ignore = false, 
        ordered_factor = false, 
        count = false), 
  multinomial_classifier = MultinomialClassifier(
        lambda = 2.220446049250313e-16, 
        gamma = 0.0, 
        penalty = :l2, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  cache = true)

Note the |> syntax, which is syntactic sugar for creating a linear Pipeline.

We can now fit this on a train split of the data setting aside 20% of the data for eventual testing.

(Xtrain, Xtest), (ytrain, ytest) =
    partition((Xc, yc), 0.8, rng=StableRNG(123), multi=true);

Let's now wrap an instance of these models with data (all hyperparameters are set to default here):

knn = machine(knn_pipe, Xtrain, ytrain)
multinom = machine(multinom_pipe, Xtrain, ytrain)

untrained Machine; does not cache data
  model: ProbabilisticPipeline(standardizer = Standardizer(features = Symbol[], …), …)
  args: 
    1:	Source @586 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @285 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{3}}

Let's train a KNNClassifier with default hyperparameters and get a baseline misclassification rate using 90% of the training data to train the model and the remaining 10% to evaluate it:

opts = (
    resampling=Holdout(fraction_train=0.9),
    measures=[log_loss, accuracy],
)
evaluate!(knn; opts...)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 0.0319      │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 1.0         │
└───┴──────────────────────┴──────────────┴─────────────┘

Now we do the same with a MultinomialClassifier

evaluate!(multinom; opts...)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss(             │ predict      │ 3.3e-6      │
│   │   tol = 2.22045e-16) │              │             │
│ B │ Accuracy()           │ predict_mode │ 1.0         │
└───┴──────────────────────┴──────────────┴─────────────┘

Both methods have perfect out-of-sample accuracy, without any tuning!

Let's check the accuracy on the test set:

fit!(knn) # train on all train data
yhat = predict_mode(knn, Xtest)
accuracy(yhat, ytest)

0.8888888888888888

Still pretty good.

fit!(multinom) # train on all train data
yhat = predict_mode(multinom, Xtest)
accuracy(yhat, ytest)

0.9444444444444444

Even better.

‎

One way to get intuition for why the dataset is so easy to classify is to project it onto a 2D space using the PCA and display the two classes to see if they are well separated; we use the arrow-syntax here (if you're on Julia <= 1.2, use the commented-out lines as you won't be able to use the arrow-syntax)

PCA = @load PCA
pca_pipe = Standardizer() |> PCA(maxoutdim=2)
pca = machine(pca_pipe, Xtrain)
fit!(pca)
W = transform(pca, Xtrain);

import MLJMultivariateStatsInterface ✔

Let's now display this using different colours for the different classes:

x1 = W.x1
x2 = W.x2

mask_1 = ytrain .== 1
mask_2 = ytrain .== 2
mask_3 = ytrain .== 3

using Plots

scatter(x1[mask_1], x2[mask_1], color="red", label="Class 1")
scatter!(x1[mask_2], x2[mask_2], color="blue", label="Class 2")
scatter!(x1[mask_3], x2[mask_3], color="yellow", label="Class 3")

xlabel!("PCA dimension 1")
ylabel!("PCA dimension 2")

From the figure it's clear why we managed to achieve such high scores with very simple classifiers. At this point it's a bit pointless to dig much deaper into parameter tuning etc.

‎

Wine

Initial data processing

Getting the data

Setting the scientific type

Getting a baseline

Visualising the classes