Wine

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Initial data processing

In this example, we consider the UCI "wine" dataset

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Getting the data

Let's download the data thanks to the UrlDownload.jl package and load it into a DataFrame:

using HTTP
using MLJ
using PyPlot
import DataFrames: DataFrame, describe
using UrlDownload

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
header = ["Class", "Alcool", "Malic acid", "Ash", "Alcalinity of ash",
          "Magnesium", "Total phenols", "Flavanoids",
          "Nonflavanoid phenols", "Proanthcyanins", "Color intensity",
          "Hue", "OD280/OD315 of diluted wines", "Proline"]
data = urldownload(url, true, format=:CSV, header=header);

The second argument to urldownload adds a progress meter for the download, the format helps indicate the format of the file and the header helps pass the column names which are not in the file.

df = DataFrame(data)
describe(df)
14×7 DataFrame
 Row │ variable                      mean        min     median   max      nmissing  eltype
     │ Symbol                        Float64     Real    Float64  Real     Int64     DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ Class                           1.9382      1       2.0       3            0  Int64
   2 │ Alcool                         13.0006     11.03   13.05     14.83         0  Float64
   3 │ Malic acid                      2.33635     0.74    1.865     5.8          0  Float64
   4 │ Ash                             2.36652     1.36    2.36      3.23         0  Float64
   5 │ Alcalinity of ash              19.4949     10.6    19.5      30.0          0  Float64
   6 │ Magnesium                      99.7416     70      98.0     162            0  Int64
   7 │ Total phenols                   2.29511     0.98    2.355     3.88         0  Float64
   8 │ Flavanoids                      2.02927     0.34    2.135     5.08         0  Float64
   9 │ Nonflavanoid phenols            0.361854    0.13    0.34      0.66         0  Float64
  10 │ Proanthcyanins                  1.5909      0.41    1.555     3.58         0  Float64
  11 │ Color intensity                 5.05809     1.28    4.69     13.0          0  Float64
  12 │ Hue                             0.957449    0.48    0.965     1.71         0  Float64
  13 │ OD280/OD315 of diluted wines    2.61169     1.27    2.78      4.0          0  Float64
  14 │ Proline                       746.893     278     673.5    1680            0  Int64

the target is the Class column, everything else is a feature; we can dissociate the two using the unpack function:

y, X = unpack(df, ==(:Class));

Setting the scientific type

Let's explore the scientific type attributed by default to the target and the features

scitype(y)
AbstractVector{Count} (alias for AbstractArray{ScientificTypesBase.Count, 1})

this should be changed as it should be considered as an ordered factor. The difference is as follows:

  • a Count corresponds to an integer between 0 and infinity

  • a OrderedFactor however is a categorical object (there are finitely many options) with ordering (1 < 2 < 3).

yc = coerce(y, OrderedFactor);

Let's now consider the features

scitype(X)
ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}}}

So there are Continuous values (encoded as floating point) and Count values (integer). Note also that there are no missing value (otherwise one of the scientific type would have been a Union{Missing,*}). Let's check which column is what:

schema(X)
┌──────────────────────────────┬────────────┬─────────┐
│ names                        │ scitypes   │ types   │
├──────────────────────────────┼────────────┼─────────┤
│ Alcool                       │ Continuous │ Float64 │
│ Malic acid                   │ Continuous │ Float64 │
│ Ash                          │ Continuous │ Float64 │
│ Alcalinity of ash            │ Continuous │ Float64 │
│ Magnesium                    │ Count      │ Int64   │
│ Total phenols                │ Continuous │ Float64 │
│ Flavanoids                   │ Continuous │ Float64 │
│ Nonflavanoid phenols         │ Continuous │ Float64 │
│ Proanthcyanins               │ Continuous │ Float64 │
│ Color intensity              │ Continuous │ Float64 │
│ Hue                          │ Continuous │ Float64 │
│ OD280/OD315 of diluted wines │ Continuous │ Float64 │
│ Proline                      │ Count      │ Int64   │
└──────────────────────────────┴────────────┴─────────┘

The two variable that are encoded as Count can probably be re-interpreted; let's have a look at the Proline one to see what it looks like

X[1:5, :Proline]
5-element Vector{Int64}:
 1065
 1050
 1185
 1480
  735

It can likely be interpreted as a Continuous as well (it would be better to know precisely what it is but for now let's just go with the hunch). We'll do the same with :Magnesium:

Xc = coerce(X, :Proline=>Continuous, :Magnesium=>Continuous);

Finally, let's have a quick look at the mean and standard deviation of each feature to get a feel for their amplitude:

describe(Xc, :mean, :std)
13×3 DataFrame
 Row │ variable                      mean        std
     │ Symbol                        Float64     Float64
─────┼──────────────────────────────────────────────────────
   1 │ Alcool                         13.0006      0.811827
   2 │ Malic acid                      2.33635     1.11715
   3 │ Ash                             2.36652     0.274344
   4 │ Alcalinity of ash              19.4949      3.33956
   5 │ Magnesium                      99.7416     14.2825
   6 │ Total phenols                   2.29511     0.625851
   7 │ Flavanoids                      2.02927     0.998859
   8 │ Nonflavanoid phenols            0.361854    0.124453
   9 │ Proanthcyanins                  1.5909      0.572359
  10 │ Color intensity                 5.05809     2.31829
  11 │ Hue                             0.957449    0.228572
  12 │ OD280/OD315 of diluted wines    2.61169     0.70999
  13 │ Proline                       746.893     314.907

Right so it varies a fair bit which would invite to standardise the data.

Note: to complete such a first step, one could explore histograms of the various features for instance, check that there is enough variation among the continuous features and that there does not seem to be problems in the encoding, we cut this out to shorten the tutorial. We could also have checked that the data is balanced.

Getting a baseline

It's a multiclass classification problem with continuous inputs so a sensible start is to test two very simple classifiers to get a baseline. We'll train two simple pipelines:

  • a Standardizer + KNN classifier and

  • a Standardizer + Multinomial classifier (logistic regression).

KNNC = @load KNNClassifier
MNC = @load MultinomialClassifier pkg=MLJLinearModels;

KnnPipe = Standardizer |> KNNC
MnPipe = Standardizer |> MNC
import NearestNeighborModels ✔
import MLJLinearModels ✔
ProbabilisticPipeline(
    standardizer = Standardizer(
            features = Symbol[],
            ignore = false,
            ordered_factor = false,
            count = false),
    multinomial_classifier = MultinomialClassifier(
            lambda = 1.0,
            gamma = 0.0,
            penalty = :l2,
            fit_intercept = true,
            penalize_intercept = false,
            scale_penalty_with_samples = true,
            solver = nothing),
    cache = true)

Note the |> syntax, which is syntactic sugar for creating a linear Pipeline from components models.

We can now fit this on a train split of the data setting aside 20% of the data for eventual testing.

train, test = partition(collect(eachindex(yc)), 0.8, shuffle=true, rng=111)
Xtrain = selectrows(Xc, train)
Xtest = selectrows(Xc, test)
ytrain = selectrows(yc, train)
ytest = selectrows(yc, test);

Let's now wrap an instance of these models with data (all hyperparameters are set to default here):

knn = machine(KnnPipe, Xtrain, ytrain)
multi = machine(MnPipe, Xtrain, ytrain)
Machine{ProbabilisticPipeline{NamedTuple{,…},…},…} trained 0 times; caches data
  model: MLJBase.ProbabilisticPipeline{NamedTuple{(:standardizer, :multinomial_classifier), Tuple{MLJModelInterface.Unsupervised, MLJModelInterface.Probabilistic}}, MLJModelInterface.predict}
  args: 
    1:	Source @166 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @129 ⏎ `AbstractVector{ScientificTypesBase.OrderedFactor{3}}`

Let's train a KNNClassifier with default hyperparameters and get a baseline misclassification rate using 90% of the training data to train the model and the remaining 10% to evaluate it:

opts = (resampling=Holdout(fraction_train=0.9), measure=cross_entropy)
res = evaluate!(knn; opts...)
round(res.measurement[1], sigdigits=3)
0.0159

Now we do the same with a MultinomialClassifier

res = evaluate!(multi; opts...)
round(res.measurement[1], sigdigits=3)
0.143

Both methods seem to offer comparable levels of performance. Let's check the misclassification over the full training set:

mcr_k = misclassification_rate(predict_mode(knn, Xtrain), ytrain)
mcr_m = misclassification_rate(predict_mode(multi, Xtrain), ytrain)
println(rpad("KNN mcr:", 10), round(mcr_k, sigdigits=3))
println(rpad("MNC mcr:", 10), round(mcr_m, sigdigits=3))
KNN mcr:  0.0352
MNC mcr:  0.0423

So here we have done no hyperparameter training and already have a misclassification rate below 5%. Clearly the problem is not very difficult.

Visualising the classes

One way to get intuition for why the dataset is so easy to classify is to project it onto a 2D space using the PCA and display the two classes to see if they are well separated; we use the arrow-syntax here (if you're on Julia <= 1.2, use the commented-out lines as you won't be able to use the arrow-syntax)

PCA = @load PCA
pca_pipe = Standardizer() |> PCA(maxoutdim=2)
pca = machine(pca_pipe, Xtrain)
fit!(pca)
W = transform(pca, Xtrain)
import MLJMultivariateStatsInterface ✔
142×2 DataFrame
 Row │ x1           x2
     │ Float64      Float64
─────┼─────────────────────────
   1 │ -1.61545     -0.531419
   2 │ -1.91617     -1.32584
   3 │  1.31861     -2.15251
   4 │  3.29083      2.5577
   5 │ -3.33859      1.28375
   6 │  4.26285      0.396268
   7 │  2.38786      0.226407
   8 │  2.27411      1.07861
   9 │ -0.848608    -1.52987
  10 │ -3.39297      1.55911
  11 │  2.98176      1.41045
  12 │  2.40565      0.187824
  13 │ -1.79828     -1.33765
  14 │ -0.929402    -1.66856
  15 │  3.97853      0.445995
  16 │  2.58186     -0.233426
  17 │  3.56248      0.703537
  18 │ -2.48208      0.738223
  19 │ -2.46816     -0.0986331
  20 │  2.26853      0.265586
  21 │ -0.00114429  -1.37526
  22 │  2.74372      2.57099
  23 │  1.7841      -1.76917
  24 │ -2.19916      1.6603
  25 │  2.59473      1.85884
  26 │ -1.54776      0.0604249
  27 │  0.565113    -0.413341
  28 │ -0.859861     0.942992
  29 │ -1.33808      0.636564
  30 │ -1.76049      1.6204
  31 │ -2.21453     -1.93054
  32 │ -2.92598      2.09371
  33 │  0.520302    -2.0037
  34 │ -2.60889      1.07627
  35 │ -2.19668      0.16492
  36 │  0.816997    -3.09718
  37 │ -1.03426     -2.59795
  38 │ -2.56998      1.72868
  39 │ -0.0809258   -2.35637
  40 │ -3.02624      1.66345
  41 │ -1.00431     -1.48789
  42 │ -2.08055      0.685692
  43 │ -0.463208    -3.81617
  44 │  2.9305       1.10028
  45 │  1.25587     -0.813509
  46 │ -3.6088       2.68099
  47 │  3.65397      1.56385
  48 │ -1.72779      0.632877
  49 │ -2.00414     -1.98395
  50 │  1.82203     -0.867345
  51 │ -2.67089      0.751295
  52 │  0.637022    -2.75445
  53 │  2.66098      1.57703
  54 │ -1.81028      0.136844
  55 │ -0.375626    -2.16806
  56 │  2.59352      0.554431
  57 │  0.828055    -2.22946
  58 │  0.615018    -1.87437
  59 │ -2.62772      1.39189
  60 │  2.76655      0.19027
  61 │  0.497352    -1.99086
  62 │  3.14582      0.512616
  63 │  0.542629    -2.3348
  64 │  1.79697     -1.33839
  65 │ -1.76847      1.68789
  66 │  2.25356      1.96044
  67 │  2.40406      0.292294
  68 │  0.351697    -2.21014
  69 │ -1.55234     -0.988828
  70 │  3.16702     -0.393755
  71 │ -1.33266      0.684502
  72 │ -1.9611       1.60654
  73 │ -2.40681      0.226138
  74 │ -3.3864       2.51306
  75 │ -0.602604     0.0263657
  76 │  0.764029    -1.11998
  77 │ -4.1835       2.06745
  78 │  0.532301    -0.887601
  79 │ -2.06066      0.982625
  80 │  2.21887      0.990764
  81 │  1.50929     -1.34489
  82 │ -0.783477    -2.39641
  83 │ -2.19229     -1.50151
  84 │ -1.34613     -2.23297
  85 │ -2.76877      0.739188
  86 │ -1.46939     -0.776651
  87 │  0.108477    -2.20487
  88 │  1.88553      1.38493
  89 │ -2.38168      1.15097
  90 │  2.64148      0.410468
  91 │ -0.904578     0.810386
  92 │  2.87202      0.461941
  93 │ -2.02969     -0.130166
  94 │ -2.02079      2.30401
  95 │  2.43867      2.19008
  96 │  0.178771    -1.26391
  97 │  1.32797      0.108855
  98 │ -3.35419      1.11068
  99 │  3.64927      0.694294
 100 │  1.18239      3.37302
 101 │  2.9241       1.6976
 102 │ -2.00573      0.96026
 103 │ -0.381727    -2.12862
 104 │ -0.399712     0.169887
 105 │ -2.7281       0.578251
 106 │ -2.19803     -0.361282
 107 │  2.73278      0.336481
 108 │  0.148203    -1.94269
 109 │ -1.54793     -1.51143
 110 │  0.544068    -2.68162
 111 │  2.46497      2.36248
 112 │ -0.958516    -2.30075
 113 │  2.94926      0.131863
 114 │  2.79786      1.96508
 115 │  2.38083     -1.42257
 116 │  2.95772      1.83661
 117 │ -1.02659      1.64243
 118 │  1.31898     -0.78317
 119 │  2.83491      1.38034
 120 │ -1.16982      0.193564
 121 │ -1.34902     -1.60331
 122 │ -1.69595     -0.334955
 123 │ -2.40753      0.965361
 124 │  0.832726    -2.46855
 125 │ -2.04835      0.647343
 126 │ -2.3556       1.2211
 127 │ -2.42512      0.882881
 128 │ -0.283123    -1.12994
 129 │  0.444681    -2.2875
 130 │  3.005        0.386058
 131 │ -1.88988      1.19682
 132 │ -2.08907      1.16244
 133 │ -1.20026     -0.121747
 134 │ -0.71731     -1.51773
 135 │ -2.42424      1.66766
 136 │  1.10387     -1.85159
 137 │  3.09517      0.210833
 138 │  1.67886      2.25708
 139 │ -2.99263      1.07494
 140 │  3.4145       1.96629
 141 │ -0.766391    -3.3572
 142 │  3.89499     -0.0425803

Let's now display this using different colours for the different classes:

x1 = W.x1
x2 = W.x2

mask_1 = ytrain .== 1
mask_2 = ytrain .== 2
mask_3 = ytrain .== 3

figure(figsize=(8, 6))
plot(x1[mask_1], x2[mask_1], linestyle="none", marker="o", color="red")
plot(x1[mask_2], x2[mask_2], linestyle="none", marker="o", color="blue")
plot(x1[mask_3], x2[mask_3], linestyle="none", marker="o", color="magenta")

xlabel("PCA dimension 1", fontsize=14)
ylabel("PCA dimension 2", fontsize=14)
legend(["Class 1", "Class 2", "Class 3"], fontsize=12)
xticks(fontsize=12)
yticks(fontsize=12)
PCA

On that figure it now becomes quite clear why we managed to achieve such high scores with very simple classifiers. At this point it's a bit pointless to dig much deaper into parameter tuning etc.

As a last step, we can report performances of the models on the test set which we set aside earlier:

perf_k = misclassification_rate(predict_mode(knn, Xtest), ytest)
perf_m = misclassification_rate(predict_mode(multi, Xtest), ytest)
println(rpad("KNN mcr:", 10), round(perf_k, sigdigits=3))
println(rpad("MNC mcr:", 10), round(perf_m, sigdigits=3))
KNN mcr:  0.111
MNC mcr:  0.0833

Pretty good for so little work!