Lab 10 - PCA and Clustering

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Getting started

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select, Not, describe
using Random

data = dataset("datasets", "USArrests")
names(data)
5-element Vector{String}:
 "State"
 "Murder"
 "Assault"
 "UrbanPop"
 "Rape"

Let's have a look at the mean and standard deviation of each feature:

describe(data, :mean, :std)
5×3 DataFrame
 Row │ variable  mean    std
     │ Symbol    Union…  Union…
─────┼───────────────────────────
   1 │ State
   2 │ Murder    7.788   4.35551
   3 │ Assault   170.76  83.3377
   4 │ UrbanPop  65.54   14.4748
   5 │ Rape      21.232  9.36638

Let's extract the numerical component and coerce

X = select(data, Not(:State))
X = coerce(X, :UrbanPop=>Continuous, :Assault=>Continuous);

PCA pipeline

PCA is usually best done after standardization but we won't do it here:

PCA = @load PCA pkg=MultivariateStats

pca_mdl = PCA(pratio=1)
pca = machine(pca_mdl, X)
fit!(pca)
PCA
W = MLJ.transform(pca, X);
import MLJMultivariateStatsInterface ✔

W is the PCA'd data; here we've used default settings for PCA and it has recovered 2 components:

schema(W).names
(:x1, :x2, :x3, :x4)

Let's inspect the fit:

r = report(pca)
cumsum(r.principalvars ./ r.tvar)
4-element Vector{Float64}:
 0.9655342205668823
 0.9933515571990573
 0.9991510921213993
 1.0

In the second line we look at the explained variance with 1 then 2 PCA features and it seems that with 2 we almost completely recover all of the variance.

More interesting data...

Instead of just playing with toy data, let's load the orange juice data and extract only the columns corresponding to price data:

data = dataset("ISLR", "OJ")

feature_names = [
    :PriceCH, :PriceMM, :DiscCH, :DiscMM, :SalePriceMM, :SalePriceCH,
    :PriceDiff, :PctDiscMM, :PctDiscCH
]

X = select(data, feature_names);

PCA pipeline

Random.seed!(1515)

SPCA = Pipeline(
    Standardizer(),
    PCA(pratio=1-1e-4)
)

spca = machine(SPCA, X)
fit!(spca)
W = MLJ.transform(spca, X)
names(W)
6-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"

What kind of variance can we explain?

rpca = collect(values(report(spca).report_given_machine))[2]
cs = cumsum(rpca.principalvars ./ rpca.tvar)
6-element Vector{Float64}:
 0.417469674848473
 0.7233074812209765
 0.9436456234171868
 0.9997505816044248
 0.9998956501446636
 0.9999999999999998

Let's visualise this

using PyPlot

figure(figsize=(8,6))

PyPlot.bar(1:length(cs), cs)
plot(1:length(cs), cs, color="red", marker="o")

xlabel("Number of PCA features", fontsize=14)
ylabel("Ratio of explained variance", fontsize=14)
PCA explained variance

So 4 PCA features are enough to recover most of the variance.

Clustering

Random.seed!(1515)

KMeans = @load KMeans pkg=Clustering
SPCA2 = Pipeline(
    Standardizer(),
    PCA(),
    KMeans(k=3)
)

spca2 = machine(SPCA2, X)
fit!(spca2)

assignments = collect(values(report(spca2).report_given_machine))[3].assignments
mask1 = assignments .== 1
mask2 = assignments .== 2
mask3 = assignments .== 3;
import MLJClusteringInterface ✔

Now we can try visualising this

using PyPlot

figure(figsize=(8, 6))
for (m, c) in zip((mask1, mask2, mask3), ("red", "green", "blue"))
    plot(W[m, 1], W[m, 2], ls="none", marker=".", markersize=10, color=c)
end

xlabel("PCA-1", fontsize=13)
ylabel("PCA-2", fontsize=13)
legend(["Group 1", "Group 2", "Group 3"], fontsize=13)