Lab 10 - PCA and Clustering

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Getting started
PCA pipeline
More interesting data...
1. PCA pipeline
2. Clustering

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select, Not, describe
using Random

data = dataset("datasets", "USArrests")
names(data)

5-element Vector{String}:
 "State"
 "Murder"
 "Assault"
 "UrbanPop"
 "Rape"

Let's have a look at the mean and standard deviation of each feature:

describe(data, :mean, :std)

5×3 DataFrame
 Row │ variable  mean    std
     │ Symbol    Union…  Union…
─────┼───────────────────────────
   1 │ State
   2 │ Murder    7.788   4.35551
   3 │ Assault   170.76  83.3377
   4 │ UrbanPop  65.54   14.4748
   5 │ Rape      21.232  9.36638

Let's extract the numerical component and coerce

X = select(data, Not(:State))
X = coerce(X, :UrbanPop=>Continuous, :Assault=>Continuous);

‎

PCA is usually best done after standardization but we won't do it here:

PCA = @load PCA pkg=MultivariateStats

pca_mdl = PCA(variance_ratio=1)
pca = machine(pca_mdl, X)
fit!(pca)
PCA
W = MLJ.transform(pca, X);

import MLJMultivariateStatsInterface ✔

W is the PCA'd data; here we've used default settings for PCA and it has recovered 2 components:

schema(W).names

(:x1, :x2, :x3, :x4)

Let's inspect the fit:

r = report(pca)
cumsum(r.principalvars ./ r.tvar)

4-element Vector{Float64}:
 0.9655342205668822
 0.9933515571990573
 0.9991510921213993
 1.0

In the second line we look at the explained variance with 1 then 2 PCA features and it seems that with 2 we almost completely recover all of the variance.

‎

Instead of just playing with toy data, let's load the orange juice data and extract only the columns corresponding to price data:

data = dataset("ISLR", "OJ")

feature_names = [
    :PriceCH, :PriceMM, :DiscCH, :DiscMM, :SalePriceMM, :SalePriceCH,
    :PriceDiff, :PctDiscMM, :PctDiscCH
]

X = select(data, feature_names);

Random.seed!(1515)

SPCA = Pipeline(
    Standardizer(),
    PCA(variance_ratio=1-1e-4)
)

spca = machine(SPCA, X)
fit!(spca)
W = MLJ.transform(spca, X)
names(W)

6-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"

What kind of variance can we explain?

rpca = report(spca).pca
cs = cumsum(rpca.principalvars ./ rpca.tvar)

6-element Vector{Float64}:
 0.41746967484847314
 0.7233074812209768
 0.943645623417187
 0.9997505816044248
 0.9998956501446636
 0.9999999999999998

Let's visualise this

using Plots

Plots.bar(1:length(cs), cs, legend=false, size=((800,600)), ylim=(0, 1.1))
xlabel!("Number of PCA features")
ylabel!("Ratio of explained variance")
plot!(1:length(cs), cs, color="red", marker="o", linewidth=3)

So 4 PCA features are enough to recover most of the variance.

‎

Random.seed!(1515)

KMeans = @load KMeans pkg=Clustering
SPCA2 = Pipeline(
    Standardizer(),
    PCA(),
    KMeans(k=3)
)

spca2 = machine(SPCA2, X)
fit!(spca2)


assignments = report(spca2).k_means.assignments
mask1 = assignments .== 1
mask2 = assignments .== 2
mask3 = assignments .== 3;

import MLJClusteringInterface ✔

Now we can try visualising this

p = plot(size=(800,600))
legend_items = ["Group 1", "Group 2", "Group 3"]
for (i, (m, c)) in enumerate(zip((mask1, mask2, mask3), ("red", "green", "blue")))
    scatter!(p, W[m, 1], W[m, 2], color=c, label=legend_items[i])
end
plot(p)
xlabel!("PCA-1")
ylabel!("PCA-2")

‎

Lab 10 - PCA and Clustering

Getting started

PCA pipeline

More interesting data...

PCA pipeline

Clustering