Wine
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
In this example, we consider the UCI "wine" dataset
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Let's download the data thanks to the UrlDownload.jl package and load it into a DataFrame:
using HTTP
using MLJ
using StableRNGs # for RNGs, stable over Julia versions
import DataFrames: DataFrame, describe
using UrlDownload
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
header = ["Class", "Alcool", "Malic acid", "Ash", "Alcalinity of ash",
"Magnesium", "Total phenols", "Flavanoids",
"Nonflavanoid phenols", "Proanthcyanins", "Color intensity",
"Hue", "OD280/OD315 of diluted wines", "Proline"]
data = urldownload(url, true, format=:CSV, header=header);
The second argument to urldownload
adds a progress meter for the download, the format
helps indicate the format of the file and the header
helps pass the column names which are not in the file.
df = DataFrame(data)
describe(df)
14×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Real Float64 Real Int64 DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ Class 1.9382 1 2.0 3 0 Int64
2 │ Alcool 13.0006 11.03 13.05 14.83 0 Float64
3 │ Malic acid 2.33635 0.74 1.865 5.8 0 Float64
4 │ Ash 2.36652 1.36 2.36 3.23 0 Float64
5 │ Alcalinity of ash 19.4949 10.6 19.5 30.0 0 Float64
6 │ Magnesium 99.7416 70 98.0 162 0 Int64
7 │ Total phenols 2.29511 0.98 2.355 3.88 0 Float64
8 │ Flavanoids 2.02927 0.34 2.135 5.08 0 Float64
9 │ Nonflavanoid phenols 0.361854 0.13 0.34 0.66 0 Float64
10 │ Proanthcyanins 1.5909 0.41 1.555 3.58 0 Float64
11 │ Color intensity 5.05809 1.28 4.69 13.0 0 Float64
12 │ Hue 0.957449 0.48 0.965 1.71 0 Float64
13 │ OD280/OD315 of diluted wines 2.61169 1.27 2.78 4.0 0 Float64
14 │ Proline 746.893 278 673.5 1680 0 Int64
the target is the Class
column, everything else is a feature; we can dissociate the two using the unpack
function:
y, X = unpack(df, ==(:Class)); # a vector and a table
Let's explore the scientific type attributed by default to the target and the features
scitype(y)
AbstractVector{Count} (alias for AbstractArray{ScientificTypesBase.Count, 1})
this should be changed as it should be considered as an ordered factor. The difference is as follows:
a
Count
corresponds to an integer between 0 and infinitya
OrderedFactor
however is a categorical object (there are finitely many options) with ordering (1 < 2 < 3
).
yc = coerce(y, OrderedFactor);
Let's now consider the features. Since this is a table, will inspect scitypes using schema
, which is more user-friendly:
schema(X)
┌──────────────────────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├──────────────────────────────┼────────────┼─────────┤
│ Alcool │ Continuous │ Float64 │
│ Malic acid │ Continuous │ Float64 │
│ Ash │ Continuous │ Float64 │
│ Alcalinity of ash │ Continuous │ Float64 │
│ Magnesium │ Count │ Int64 │
│ Total phenols │ Continuous │ Float64 │
│ Flavanoids │ Continuous │ Float64 │
│ Nonflavanoid phenols │ Continuous │ Float64 │
│ Proanthcyanins │ Continuous │ Float64 │
│ Color intensity │ Continuous │ Float64 │
│ Hue │ Continuous │ Float64 │
│ OD280/OD315 of diluted wines │ Continuous │ Float64 │
│ Proline │ Count │ Int64 │
└──────────────────────────────┴────────────┴─────────┘
So there are Continuous
values (encoded as floating point) and Count
values (integer). Note also that there are no missing value (otherwise one of the scientific type would have been a Union{Missing,*}
). Let's check which column is what: The two variables that are encoded as Count
can probably be re-interpreted; let's have a look at the Proline
one to see what it looks like
X[1:5, :Proline]
5-element Vector{Int64}:
1065
1050
1185
1480
735
This is likely representing a Continuous
variable as well (it would be better to know precisely what it is but for now let's just go with the hunch). We'll do the same with :Magnesium
:
Xc = coerce(X, :Proline=>Continuous, :Magnesium=>Continuous);
Finally, let's have a quick look at the mean and standard deviation of each feature to get a feel for their amplitude:
describe(Xc, :mean, :std)
13×3 DataFrame
Row │ variable mean std
│ Symbol Float64 Float64
─────┼──────────────────────────────────────────────────────
1 │ Alcool 13.0006 0.811827
2 │ Malic acid 2.33635 1.11715
3 │ Ash 2.36652 0.274344
4 │ Alcalinity of ash 19.4949 3.33956
5 │ Magnesium 99.7416 14.2825
6 │ Total phenols 2.29511 0.625851
7 │ Flavanoids 2.02927 0.998859
8 │ Nonflavanoid phenols 0.361854 0.124453
9 │ Proanthcyanins 1.5909 0.572359
10 │ Color intensity 5.05809 2.31829
11 │ Hue 0.957449 0.228572
12 │ OD280/OD315 of diluted wines 2.61169 0.70999
13 │ Proline 746.893 314.907
Right so it varies a fair bit which would invite to standardise the data.
Note: to complete such a first step, one could explore histograms of the various **features for instance, check that there is enough variation among the continuous **features and that there does not seem to be problems in the encoding, we cut this out **to shorten the tutorial. We could also have checked that the data is balanced.
It's a multiclass classification problem with continuous inputs so a sensible start is to test two very simple classifiers to get a baseline.
We'll train two simple pipelines:
a Standardizer + KNN classifier and
a Standardizer + Multinomial classifier (logistic regression).
KNNClassifier = @load KNNClassifier
MultinomialClassifier = @load MultinomialClassifier pkg=MLJLinearModels;
knn_pipe = Standardizer() |> KNNClassifier()
multinom_pipe = Standardizer() |> MultinomialClassifier()
import NearestNeighborModels ✔
import MLJLinearModels ✔
ProbabilisticPipeline(
standardizer = Standardizer(
features = Symbol[],
ignore = false,
ordered_factor = false,
count = false),
multinomial_classifier = MultinomialClassifier(
lambda = 2.220446049250313e-16,
gamma = 0.0,
penalty = :l2,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing),
cache = true)
Note the |>
syntax, which is syntactic sugar for creating a linear Pipeline
.
We can now fit this on a train split of the data setting aside 20% of the data for eventual testing.
(Xtrain, Xtest), (ytrain, ytest) =
partition((Xc, yc), 0.8, rng=StableRNG(123), multi=true);
Let's now wrap an instance of these models with data (all hyperparameters are set to default here):
knn = machine(knn_pipe, Xtrain, ytrain)
multinom = machine(multinom_pipe, Xtrain, ytrain)
untrained Machine; does not cache data
model: ProbabilisticPipeline(standardizer = Standardizer(features = Symbol[], …), …)
args:
1: Source @586 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @285 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{3}}
Let's train a KNNClassifier with default hyperparameters and get a baseline misclassification rate using 90% of the training data to train the model and the remaining 10% to evaluate it:
opts = (
resampling=Holdout(fraction_train=0.9),
measures=[log_loss, accuracy],
)
evaluate!(knn; opts...)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 0.0319 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 1.0 │
└───┴──────────────────────┴──────────────┴─────────────┘
Now we do the same with a MultinomialClassifier
evaluate!(multinom; opts...)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 3.3e-6 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 1.0 │
└───┴──────────────────────┴──────────────┴─────────────┘
Both methods have perfect out-of-sample accuracy, without any tuning!
Let's check the accuracy on the test set:
fit!(knn) # train on all train data
yhat = predict_mode(knn, Xtest)
accuracy(yhat, ytest)
0.8888888888888888
Still pretty good.
fit!(multinom) # train on all train data
yhat = predict_mode(multinom, Xtest)
accuracy(yhat, ytest)
0.9444444444444444
Even better.
One way to get intuition for why the dataset is so easy to classify is to project it onto a 2D space using the PCA and display the two classes to see if they are well separated; we use the arrow-syntax here (if you're on Julia <= 1.2, use the commented-out lines as you won't be able to use the arrow-syntax)
PCA = @load PCA
pca_pipe = Standardizer() |> PCA(maxoutdim=2)
pca = machine(pca_pipe, Xtrain)
fit!(pca)
W = transform(pca, Xtrain);
import MLJMultivariateStatsInterface ✔
Let's now display this using different colours for the different classes:
x1 = W.x1
x2 = W.x2
mask_1 = ytrain .== 1
mask_2 = ytrain .== 2
mask_3 = ytrain .== 3
using Plots
scatter(x1[mask_1], x2[mask_1], color="red", label="Class 1")
scatter!(x1[mask_2], x2[mask_2], color="blue", label="Class 2")
scatter!(x1[mask_3], x2[mask_3], color="yellow", label="Class 3")
xlabel!("PCA dimension 1")
ylabel!("PCA dimension 2")
From the figure it's clear why we managed to achieve such high scores with very simple classifiers. At this point it's a bit pointless to dig much deaper into parameter tuning etc.