Lab 10 - PCA and Clustering
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select, Not, describe
using Random
data = dataset("datasets", "USArrests")
names(data)
5-element Vector{String}:
"State"
"Murder"
"Assault"
"UrbanPop"
"Rape"
Let's have a look at the mean and standard deviation of each feature:
describe(data, :mean, :std)
5×3 DataFrame
Row │ variable mean std
│ Symbol Union… Union…
─────┼───────────────────────────
1 │ State
2 │ Murder 7.788 4.35551
3 │ Assault 170.76 83.3377
4 │ UrbanPop 65.54 14.4748
5 │ Rape 21.232 9.36638
Let's extract the numerical component and coerce
X = select(data, Not(:State))
X = coerce(X, :UrbanPop=>Continuous, :Assault=>Continuous);
PCA is usually best done after standardization but we won't do it here:
PCA = @load PCA pkg=MultivariateStats
pca_mdl = PCA(variance_ratio=1)
pca = machine(pca_mdl, X)
fit!(pca)
PCA
W = MLJ.transform(pca, X);
import MLJMultivariateStatsInterface ✔
W is the PCA'd data; here we've used default settings for PCA and it has recovered 2 components:
schema(W).names
(:x1, :x2, :x3, :x4)
Let's inspect the fit:
r = report(pca)
cumsum(r.principalvars ./ r.tvar)
4-element Vector{Float64}:
0.9655342205668822
0.9933515571990573
0.9991510921213993
1.0
In the second line we look at the explained variance with 1 then 2 PCA features and it seems that with 2 we almost completely recover all of the variance.
Instead of just playing with toy data, let's load the orange juice data and extract only the columns corresponding to price data:
data = dataset("ISLR", "OJ")
feature_names = [
:PriceCH, :PriceMM, :DiscCH, :DiscMM, :SalePriceMM, :SalePriceCH,
:PriceDiff, :PctDiscMM, :PctDiscCH
]
X = select(data, feature_names);
Random.seed!(1515)
SPCA = Pipeline(
Standardizer(),
PCA(variance_ratio=1-1e-4)
)
spca = machine(SPCA, X)
fit!(spca)
W = MLJ.transform(spca, X)
names(W)
6-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
"x5"
"x6"
What kind of variance can we explain?
rpca = report(spca).pca
cs = cumsum(rpca.principalvars ./ rpca.tvar)
6-element Vector{Float64}:
0.41746967484847314
0.7233074812209768
0.943645623417187
0.9997505816044248
0.9998956501446636
0.9999999999999998
Let's visualise this
using Plots
Plots.bar(1:length(cs), cs, legend=false, size=((800,600)), ylim=(0, 1.1))
xlabel!("Number of PCA features")
ylabel!("Ratio of explained variance")
plot!(1:length(cs), cs, color="red", marker="o", linewidth=3)
So 4 PCA features are enough to recover most of the variance.
Random.seed!(1515)
KMeans = @load KMeans pkg=Clustering
SPCA2 = Pipeline(
Standardizer(),
PCA(),
KMeans(k=3)
)
spca2 = machine(SPCA2, X)
fit!(spca2)
assignments = report(spca2).k_means.assignments
mask1 = assignments .== 1
mask2 = assignments .== 2
mask3 = assignments .== 3;
import MLJClusteringInterface ✔
Now we can try visualising this
p = plot(size=(800,600))
legend_items = ["Group 1", "Group 2", "Group 3"]
for (i, (m, c)) in enumerate(zip((mask1, mask2, mask3), ("red", "green", "blue")))
scatter!(p, W[m, 1], W[m, 2], color=c, label=legend_items[i])
end
plot(p)
xlabel!("PCA-1")
ylabel!("PCA-2")