Lab 4 - Logistic Regression, LDA, QDA, KNN
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
Let's load the usual packages and the data
using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, describe, select, Not
import StatsBase: countmap, cor, var
using PrettyPrinting
smarket = dataset("ISLR", "Smarket")
@show size(smarket)
@show names(smarket)
size(smarket) = (1250, 9)
names(smarket) = ["Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Today", "Direction"]
Since we often want to only show a few significant digits for the metrics etc, let's introduce a very simple function that does that:
r3(x) = round(x, sigdigits=3)
r3(pi)
3.14
Let's get a description too
describe(smarket, :mean, :std, :eltype)
9×4 DataFrame
Row │ variable mean std eltype
│ Symbol Union… Union… DataType
─────┼─────────────────────────────────────────────────────────────────
1 │ Year 2003.02 1.40902 Float64
2 │ Lag1 0.0038344 1.1363 Float64
3 │ Lag2 0.0039192 1.13628 Float64
4 │ Lag3 0.001716 1.1387 Float64
5 │ Lag4 0.001636 1.13877 Float64
6 │ Lag5 0.0056096 1.14755 Float64
7 │ Volume 1.4783 0.360357 Float64
8 │ Today 0.0031384 1.13633 Float64
9 │ Direction CategoricalValue{String, UInt8}
The target variable is :Direction
:
y = smarket.Direction
X = select(smarket, Not(:Direction));
We can compute all the pairwise correlations; we use Matrix
so that the dataframe entries are considered as one matrix of numbers with the same type (otherwise cor
won't work):
cm = X |> Matrix |> cor
round.(cm, sigdigits=1)
8×8 Matrix{Float64}:
1.0 0.03 0.03 0.03 0.04 0.03 0.5 0.03
0.03 1.0 -0.03 -0.01 -0.003 -0.006 0.04 -0.03
0.03 -0.03 1.0 -0.03 -0.01 -0.004 -0.04 -0.01
0.03 -0.01 -0.03 1.0 -0.02 -0.02 -0.04 -0.002
0.04 -0.003 -0.01 -0.02 1.0 -0.03 -0.05 -0.007
0.03 -0.006 -0.004 -0.02 -0.03 1.0 -0.02 -0.03
0.5 0.04 -0.04 -0.04 -0.05 -0.02 1.0 0.01
0.03 -0.03 -0.01 -0.002 -0.007 -0.03 0.01 1.0
Let's see what the :Volume
feature looks like:
using Plots
plot(X.Volume, size=(800,600), linewidth=2, legend=false)
xlabel!("Tick number")
ylabel!("Volume")
We will now try to train models; the target :Direction
has two classes: Up
and Down
; it needs to be interpreted as a categorical object, and we will mark it as a ordered factor to specify that 'Up' is positive and 'Down' negative (for the confusion matrix later):
y = coerce(y, OrderedFactor)
classes(y[1])
2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
"Down"
"Up"
Note that in this case the default order comes from the lexicographic order which happens to map to our intuition since D
comes before U
.
cm = countmap(y)
categories, vals = collect(keys(cm)), collect(values(cm))
Plots.bar(categories, vals, title="Bar Chart Example", legend=false)
ylabel!("Number of occurrences")
Seems pretty balanced.
Let's now try fitting a simple logistic classifier (aka logistic regression) not using :Year
and :Today
:
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels
X2 = select(X, Not([:Year, :Today]))
classif = machine(LogisticClassifier(), X2, y)
import MLJLinearModels ✔
untrained Machine; caches model-specific representations of data
model: LogisticClassifier(lambda = 2.220446049250313e-16, …)
args:
1: Source @283 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @894 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}
Let's fit it to the data and try to reproduce the output:
fit!(classif)
ŷ = MLJ.predict(classif, X2)
ŷ[1:3]
3-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.493, Up=>0.507)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.519, Up=>0.481)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.519, Up=>0.481)
Note that here the ŷ
are scores. We can recover the average cross-entropy loss:
cross_entropy(ŷ, y) |> mean |> r3
0.691
in order to recover the class, we could use the mode and compare the misclassification rate:
ŷ = predict_mode(classif, X2)
misclassification_rate(ŷ, y) |> r3
0.478
Well that's not fantastic...
Let's visualise how we're doing building a confusion matrix, first is predicted, second is truth:
cm = confusion_matrix(ŷ, y)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ Down │ Up │
├─────────┼──────┼──────┤
│ Down │ 145 │ 141 │
├─────────┼──────┼──────┤
│ Up │ 457 │ 507 │
└─────────┴──────┴──────┘
We can then compute the accuracy or precision, etc. easily for instance:
@show false_positive(cm)
@show accuracy(ŷ, y) |> r3
@show accuracy(cm) |> r3 # same thing
@show positive_predictive_value(ŷ, y) |> r3 # a.k.a. precision
@show recall(ŷ, y) |> r3
@show f1score(ŷ, y) |> r3
false_positive(cm) = 457
accuracy(ŷ, y) |> r3 = 0.522
accuracy(cm) |> r3 = 0.522
positive_predictive_value(ŷ, y) |> r3 = 0.526
recall(ŷ, y) |> r3 = 0.782
f1score(ŷ, y) |> r3 = 0.629
Let's now train on the data before 2005 and use it to predict on the rest. Let's find the row indices for which the condition holds
train = 1:findlast(X.Year .< 2005)
test = last(train)+1:length(y);
We can now just re-fit the machine that we've already defined just on those rows and predict on the test:
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.48
Well, that's not very good... Let's retrain a machine using only :Lag1
and :Lag2
:
X3 = select(X2, [:Lag1, :Lag2])
classif = machine(LogisticClassifier(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.56
Interesting... it has higher accuracy than the model with more features! This could be investigated further by increasing the regularisation parameter but we'll leave that aside for now.
We can use a trained machine to predict on new data:
Xnew = (Lag1 = [1.2, 1.5], Lag2 = [1.1, -0.8])
ŷ = MLJ.predict(classif, Xnew)
ŷ |> pprint
CategoricalDistributions.UnivariateFinite{ScientificTypesBase.OrderedFactor{2}, String, UInt8, Float64}[UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.521, Up=>0.479), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(Down=>0.504, Up=>0.496)]
Note: when specifying data, we used a simple NamedTuple
; we could also have defined a dataframe or any other compatible tabular container. Note also that we retrieved the raw predictions here i.e.: a score for each class; we could have used predict_mode
or indeed
mode.(ŷ)
2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
"Down"
"Down"
Let's do a similar thing but with a LDA model this time:
BayesianLDA = @load BayesianLDA pkg=MultivariateStats
classif = machine(BayesianLDA(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
import MLJMultivariateStatsInterface ✔
0.56
Note: BayesianLDA
is LDA using a multivariate normal model for each class with a default prior inferred from the proportions for each class in the training data. You can also use the bare LDA
model which does not make these assumptions and allows using a different metric in the transformed space, see the docs for details.
LDA = @load LDA pkg=MultivariateStats
using Distances
classif = machine(LDA(dist=CosineDist()), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
import MLJMultivariateStatsInterface ✔
0.548
Bayesian QDA is available via ScikitLearn:
BayesianQDA = @load BayesianQDA pkg=MLJScikitLearnInterface
import MLJScikitLearnInterface ✔
MLJScikitLearnInterface.BayesianQDA
Using it is done in much the same way as before:
classif = machine(BayesianQDA(), X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.599
We can use K-Nearest Neighbors models via the NearestNeighbors
package:
KNNClassifier = @load KNNClassifier
knnc = KNNClassifier(K=1)
classif = machine(knnc, X3, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
import NearestNeighborModels ✔
0.5
Pretty bad... let's try with three neighbors
knnc.K = 3
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.532
A bit better but not hugely so.
The caravan dataset is part of ISLR as well:
caravan = dataset("ISLR", "Caravan")
size(caravan)
(5822, 86)
The target variable is Purchase
, effectively a categorical
purchase = caravan.Purchase
vals = unique(purchase)
2-element Vector{String}:
"No"
"Yes"
Let's see how many of each we have
nl1 = sum(purchase .== vals[1])
nl2 = sum(purchase .== vals[2])
println("#$(vals[1]) ", nl1)
println("#$(vals[2]) ", nl2)
#No 5474
#Yes 348
we can also visualise this as was done before:
cm = countmap(purchase)
categories, vals = collect(keys(cm)), collect(values(cm))
bar(categories, vals, title="Bar Chart Example", legend=false)
ylabel!("Number of occurrences")
that's quite unbalanced.
Apart from the target, all other variables are numbers; we can standardize the data:
y, X = unpack(caravan, ==(:Purchase))
mstd = machine(Standardizer(), X)
fit!(mstd)
Xs = MLJ.transform(mstd, X)
var(Xs[:,1]) |> r3
1.0
Note: in MLJ, it is recommended to work with pipelines / networks when possible and not do "step-by-step" transformation and fitting of the data as this is more error prone. We do it here to stick to the ISL tutorial.
We split the data in the first 1000 rows for testing and the rest for training:
test = 1:1000
train = last(test)+1:nrows(Xs);
Let's now fit a KNN model and check the misclassification rate
classif = machine(KNNClassifier(K=3), Xs, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.925
that looks good but recall the problem is very unbalanced
mean(y[test] .!= "No") |> r3
0.059
Let's fit a logistic classifier to this problem
classif = machine(LogisticClassifier(), Xs, y)
fit!(classif, rows=train)
ŷ = predict_mode(classif, rows=test)
accuracy(ŷ, y[test]) |> r3
0.934
Since we have a probabilistic classifier, we can also check metrics that take scores into account such as the area under the ROC curve (AUC):
ŷ = MLJ.predict(classif, rows=test)
auc(ŷ, y[test])
0.7388551666996884
We can also display the curve itself
fprs, tprs, thresholds = roc_curve(ŷ, y[test])
plot(fprs, tprs, linewidth=2, size=(800,600))
xlabel!("False Positive Rate")
ylabel!("True Positive Rate")