Lab 4 - Logistic Regression, LDA, QDA, KNN

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Stock market data
Caravan insurance data
1. ROC and AUC

Let's load the usual packages and the data

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, describe, select, Not
import StatsBase: countmap, cor, var
using PrettyPrinting

smarket = dataset("ISLR", "Smarket")
@show size(smarket)
@show names(smarket)

size(smarket) = (1250, 9)
names(smarket) = ["Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Today", "Direction"]

Since we often want to only show a few significant digits for the metrics etc, let's introduce a very simple function that does that:

r3(x) = round(x, sigdigits=3)
r3(pi)

3.14

Let's get a description too

describe(smarket, :mean, :std, :eltype)

9×4 DataFrame
 Row │ variable   mean       std       eltype
     │ Symbol     Union…     Union…    DataType
─────┼─────────────────────────────────────────────────────────────────
   1 │ Year       2003.02    1.40902   Float64
   2 │ Lag1       0.0038344  1.1363    Float64
   3 │ Lag2       0.0039192  1.13628   Float64
   4 │ Lag3       0.001716   1.1387    Float64
   5 │ Lag4       0.001636   1.13877   Float64
   6 │ Lag5       0.0056096  1.14755   Float64
   7 │ Volume     1.47831    0.360357  Float64
   8 │ Today      0.0031384  1.13633   Float64
   9 │ Direction                       CategoricalValue{String, UInt8}

The target variable is :Direction:

y = smarket.Direction
X = select(smarket, Not(:Direction));

We can compute all the pairwise correlations; we use Matrix so that the dataframe entries are considered as one matrix of numbers with the same type (otherwise cor won't work):

cm = X |> Matrix |> cor
round.(cm, sigdigits=1)

8×8 Matrix{Float64}:
 1.0    0.03    0.03    0.03    0.04    0.03    0.5    0.03
 0.03   1.0    -0.03   -0.01   -0.003  -0.006   0.04  -0.03
 0.03  -0.03    1.0    -0.03   -0.01   -0.004  -0.04  -0.01
 0.03  -0.01   -0.03    1.0    -0.02   -0.02   -0.04  -0.002
 0.04  -0.003  -0.01   -0.02    1.0    -0.03   -0.05  -0.007
 0.03  -0.006  -0.004  -0.02   -0.03    1.0    -0.02  -0.03
 0.5    0.04   -0.04   -0.04   -0.05   -0.02    1.0    0.01
 0.03  -0.03   -0.01   -0.002  -0.007  -0.03    0.01   1.0

Let's see what the :Volume feature looks like:

using Plots

plot(X.Volume, size=(800,600), linewidth=2, legend=false)
xlabel!("Tick number")
ylabel!("Volume")

We will now try to train models; the target :Direction has two classes: Up and Down; it needs to be interpreted as a categorical object, and we will mark it as a ordered factor to specify that 'Up' is positive and 'Down' negative (for the confusion matrix later):

y = coerce(y, OrderedFactor)
classes(y[1])

2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

Note that in this case the default order comes from the lexicographic order which happens to map to our intuition since D comes before U.

cm = countmap(y)
categories, vals = collect(keys(cm)), collect(values(cm))
Plots.bar(categories, vals, title="Bar Chart Example", legend=false)
ylabel!("Number of occurrences")

Seems pretty balanced.

Let's now try fitting a simple logistic classifier (aka logistic regression) not using :Year and :Today:

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels
X2 = select(X, Not([:Year, :Today]))
classif = machine(LogisticClassifier(), X2, y)

import MLJLinearModels ✔
untrained Machine; caches model-specific representations of data
  model: LogisticClassifier(lambda = 2.220446049250313e-16, …)
  args: 
    1:	Source @167 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @784 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

Let's fit it to the data and try to reproduce the output:

fit!(classif)
ŷ = MLJ.predict(classif, X2)
ŷ[1:3]

3-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.493, Up=>0.507)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.519, Up=>0.481)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.519, Up=>0.481)

Note that here the ŷ are scores. We can recover the average cross-entropy loss:

cross_entropy(ŷ, y) |> mean |> r3

0.691

in order to recover the class, we could use the mode and compare the misclassification rate:

ŷ = predict_mode(classif, X2)
misclassification_rate(ŷ, y) |> r3

0.478

Well that's not fantastic...

Let's visualise how we're doing building a confusion matrix, first is predicted, second is truth:

cm = confusion_matrix(ŷ, y)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ Down │  Up  │
├─────────┼──────┼──────┤
│  Down   │ 145  │ 141  │
├─────────┼──────┼──────┤
│   Up    │ 457  │ 507  │
└─────────┴──────┴──────┘

We can then compute the accuracy or precision, etc. easily for instance:

@show false_positive(cm)
@show accuracy(ŷ, y)  |> r3
@show accuracy(cm)    |> r3  # same thing
@show positive_predictive_value(ŷ, y) |> r3   # a.k.a. precision
@show recall(ŷ, y)    |> r3
@show f1score(ŷ, y)   |> r3

false_positive(cm) = 457
accuracy(ŷ, y) |> r3 = 0.522
accuracy(cm) |> r3 = 0.522
positive_predictive_value(ŷ, y) |> r3 = 0.526
recall(ŷ, y) |> r3 = 0.782
f1score(ŷ, y) |> r3 = 0.629

Let's now train on the data before 2005 and use it to predict on the rest. Let's find the row indices for which the condition holds

train = 1:findlast(X.Year .< 2005)
test = last(train)+1:length(y);

We can now just re-fit the machine that we've already defined just on those rows and predict on the test:

fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3

0.48

Well, that's not very good... Let's retrain a machine using only :Lag1 and :Lag2:

X3 = select(X2, [:Lag1, :Lag2])
classif = machine(LogisticClassifier(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3

0.56

Interesting... it has higher accuracy than the model with more features! This could be investigated further by increasing the regularisation parameter but we'll leave that aside for now.

We can use a trained machine to predict on new data:

Xnew = (Lag1 = [1.2, 1.5], Lag2 = [1.1, -0.8])
ŷ = MLJ.predict(classif, Xnew)
ŷ |> pprint

CategoricalDistributions.UnivariateFinite{ScientificTypesBase.OrderedFactor{2}, String, UInt8, Float64}[UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.521, Up=>0.479), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.504, Up=>0.496)]

Note: when specifying data, we used a simple NamedTuple; we could also have defined a dataframe or any other compatible tabular container. Note also that we retrieved the raw predictions here i.e.: a score for each class; we could have used predict_mode or indeed

mode.(ŷ)

2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Down"
 "Down"

‎

Let's do a similar thing but with a LDA model this time:

BayesianLDA = @load BayesianLDA pkg=MultivariateStats

classif = machine(BayesianLDA(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)

accuracy(ŷ, y[test]) |> r3

import MLJMultivariateStatsInterface ✔
0.56

Note: BayesianLDA is LDA using a multivariate normal model for each class with a default prior inferred from the proportions for each class in the training data. You can also use the bare LDA model which does not make these assumptions and allows using a different metric in the transformed space, see the docs for details.

LDA = @load LDA pkg=MultivariateStats
using Distances

classif = machine(LDA(dist=CosineDist()), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)

accuracy(ŷ, y[test]) |> r3

import MLJMultivariateStatsInterface ✔
0.548

‎

Bayesian QDA is available via ScikitLearn:

BayesianQDA = @load BayesianQDA pkg=MLJScikitLearnInterface

import MLJScikitLearnInterface ✔
MLJScikitLearnInterface.BayesianQDA

Using it is done in much the same way as before:

classif = machine(BayesianQDA(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)

accuracy(ŷ, y[test]) |> r3

0.599

‎

We can use K-Nearest Neighbors models via the NearestNeighbors package:

KNNClassifier = @load KNNClassifier

knnc = KNNClassifier(K=1)
classif = machine(knnc, X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3

import NearestNeighborModels ✔
0.5

Pretty bad... let's try with three neighbors

knnc.K = 3
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3

0.532

A bit better but not hugely so.

‎

The caravan dataset is part of ISLR as well:

caravan  = dataset("ISLR", "Caravan")
size(caravan)

(5822, 86)

The target variable is Purchase, effectively a categorical

purchase = caravan.Purchase
vals     = unique(purchase)

2-element Vector{String}:
 "No"
 "Yes"

Let's see how many of each we have

nl1 = sum(purchase .== vals[1])
nl2 = sum(purchase .== vals[2])
println("#$(vals[1]) ", nl1)
println("#$(vals[2]) ", nl2)

#No 5474
#Yes 348

we can also visualise this as was done before:

cm = countmap(purchase)
categories, vals = collect(keys(cm)), collect(values(cm))
bar(categories, vals, title="Bar Chart Example", legend=false)
ylabel!("Number of occurrences")

that's quite unbalanced.

Apart from the target, all other variables are numbers; we can standardize the data:

y, X = unpack(caravan, ==(:Purchase))

mstd = machine(Standardizer(), X)
fit!(mstd)
Xs = MLJ.transform(mstd, X)

var(Xs[:,1]) |> r3

1.0

Note: in MLJ, it is recommended to work with pipelines / networks when possible and not do "step-by-step" transformation and fitting of the data as this is more error prone. We do it here to stick to the ISL tutorial.

We split the data in the first 1000 rows for testing and the rest for training:

test = 1:1000
train = last(test)+1:nrows(Xs);

Let's now fit a KNN model and check the misclassification rate

classif = machine(KNNClassifier(K=3), Xs, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)

accuracy(ŷ, y[test]) |> r3

0.925

that looks good but recall the problem is very unbalanced

mean(y[test] .!= "No") |> r3

0.059

Let's fit a logistic classifier to this problem

classif = machine(LogisticClassifier(), Xs, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)

accuracy(ŷ, y[test]) |> r3

0.934

Since we have a probabilistic classifier, we can also check metrics that take scores into account such as the area under the ROC curve (AUC):

ŷ = MLJ.predict(classif, rows=test)

auc(ŷ, y[test])

0.7388551666996884

We can also display the curve itself

fprs, tprs, thresholds = roc_curve(ŷ, y[test])

plot(fprs, tprs, linewidth=2, size=(800,600))
xlabel!("False Positive Rate")
ylabel!("True Positive Rate")

‎