Lab 5 - Cross validation and the bootstrap

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Getting started
1. Polynomial regression
K-Folds Cross Validation
The Bootstrap

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select
auto = dataset("ISLR", "Auto")
y, X = unpack(auto, ==(:MPG))
train, test = partition(eachindex(y), 0.5, shuffle=true, rng=444);

Note the use of rng= to seed the shuffling of indices so that the results are reproducible.

This tutorial introduces polynomial regression in a very hands-on way. A more programmatic alternative is to use MLJ's InteractionTransformer. Run doc("InteractionTransformer") for details.

LR = @load LinearRegressor pkg=MLJLinearModels

import MLJLinearModels ✔
MLJLinearModels.LinearRegressor

In this part we only build models with the Horsepower feature.

using Plots

plot(X.Horsepower, y, seriestype=:scatter, legend=false,  size=(800,600))
xlabel!("Horsepower")
ylabel!("MPG")

Let's get a baseline:

lm = LR()
mlm = machine(lm, select(X, :Horsepower), y)
fit!(mlm, rows=train)
rms(MLJ.predict(mlm, rows=test), y[test])^2

23.49399089500798

Note that we square the measure to match the results obtained in the ISL labs where the mean squared error (here we use the rms which is the square root of that).

xx = (Horsepower=range(50, 225, length=100) |> collect, )
yy = MLJ.predict(mlm, xx)

plot(X.Horsepower, y, seriestype=:scatter, legend=false,  size=(800,600))
plot!(xx.Horsepower, yy,  legend=false, linewidth=3, color=:orange)
xlabel!("Horsepower")
ylabel!("MPG")

We now want to build three polynomial models of degree 1, 2 and 3 respectively; we start by forming the corresponding feature matrix:

hp = X.Horsepower
Xhp = DataFrame(hp1=hp, hp2=hp.^2, hp3=hp.^3);

Now we can write a simple pipeline where the first step selects the features we want (and with it the degree of the polynomial) and the second is the linear regressor:

LinMod = Pipeline(
    FeatureSelector(features=[:hp1]),
    LR()
);

Then we can instantiate and fit 3 models where we specify the features each time:

lr1 = machine(LinMod, Xhp, y) # poly of degree 1 (line)
fit!(lr1, rows=train)

LinMod.feature_selector.features = [:hp1, :hp2] # poly of degree 2
lr2 = machine(LinMod, Xhp, y)
fit!(lr2, rows=train)

LinMod.feature_selector.features = [:hp1, :hp2, :hp3] # poly of degree 3
lr3 = machine(LinMod, Xhp, y)
fit!(lr3, rows=train)

trained Machine; does not cache data
  model: DeterministicPipeline(feature_selector = FeatureSelector(features = [:hp1, :hp2, :hp3], …), …)
  args: 
    1:	Source @596 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @593 ⏎ AbstractVector{ScientificTypesBase.Continuous}

Let's check the performances on the test set

get_mse(lr) = rms(MLJ.predict(lr, rows=test), y[test])^2

@show get_mse(lr1)
@show get_mse(lr2)
@show get_mse(lr3)

get_mse(lr1) = 23.49399089500798
get_mse(lr2) = 19.287175510952153
get_mse(lr3) = 19.381831638657932

Let's visualise the models

hpn  = xx.Horsepower
Xnew = DataFrame(hp1=hpn, hp2=hpn.^2, hp3=hpn.^3)

yy1 = MLJ.predict(lr1, Xnew)
yy2 = MLJ.predict(lr2, Xnew)
yy3 = MLJ.predict(lr3, Xnew)

plot(X.Horsepower, y, seriestype=:scatter, label=false,  size=(800,600))
plot!(xx.Horsepower, yy1,  label="Order 1", linewidth=3, color=:orange,)
plot!(xx.Horsepower, yy2,  label="Order 2", linewidth=3, color=:green,)
plot!(xx.Horsepower, yy3,  label="Order 3", linewidth=3, color=:red,)

xlabel!("Horsepower")
ylabel!("MPG")

‎

Let's crossvalidate over the degree of the polynomial.

Note: there's a bit of gymnastics here because MLJ doesn't directly support a polynomial regression; see our tutorial on tuning models for a gentler introduction to model tuning. The gist of the following code is to create a dataframe where each column is a power of the Horsepower feature from 1 to 10 and we build a series of regression models using incrementally more of those features (higher degree):

Xhp = DataFrame([hp.^i for i in 1:10], :auto)

cases = [[Symbol("x$j") for j in 1:i] for i in 1:10]
r = range(LinMod, :(feature_selector.features), values=cases)

tm = TunedModel(model=LinMod, ranges=r, resampling=CV(nfolds=10), measure=rms)

DeterministicTunedModel(
  model = DeterministicPipeline(
        feature_selector = FeatureSelector(features = [:hp1, :hp2, :hp3], …), 
        linear_regressor = LinearRegressor(fit_intercept = true, …), 
        cache = true), 
  tuning = RandomSearch(
        bounded = Distributions.Uniform, 
        positive_unbounded = Distributions.Gamma, 
        other = Distributions.Normal, 
        rng = Random._GLOBAL_RNG()), 
  resampling = CV(
        nfolds = 10, 
        shuffle = false, 
        rng = Random._GLOBAL_RNG()), 
  measure = RootMeanSquaredError(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = NominalRange(feature_selector.features = [:x1], [:x1, :x2], [:x1, :x2, :x3], ...), 
  selection_heuristic = MLJTuning.NaiveSelection(nothing), 
  train_best = true, 
  repeats = 1, 
  n = nothing, 
  acceleration = ComputationalResources.CPU1{Nothing}(nothing), 
  acceleration_resampling = ComputationalResources.CPU1{Nothing}(nothing), 
  check_measure = true, 
  cache = true, 
  compact_history = true, 
  logger = nothing)

Now we're left with fitting the tuned model

mtm = machine(tm, Xhp, y)
fit!(mtm)
rep = report(mtm)

res = rep.plotting

@show round.(res.measurements.^2, digits=2)
@show argmin(res.measurements)

round.(res.measurements .^ 2, digits = 2) = [20.87, 20.92, 223.53, 88.3, 20.87, 21.3, 21.3, 21.2, 21.32, 21.3]
argmin(res.measurements) = 1

So the conclusion here is that the 5th order polynomial does quite well.

In ISL they use a different seed so the results are a bit different but comparable.

Xnew = DataFrame([hpn.^i for i in 1:10], :auto)
yy5 = MLJ.predict(mtm, Xnew)

plot(X.Horsepower, y, seriestype=:scatter, legend=false,  size=(800,600))
plot!(xx.Horsepower, yy5, color=:orange, linewidth=4, legend=false)
xlabel!("Horsepower")
ylabel!("MPG")

‎

Lab 5 - Cross validation and the bootstrap

Getting started

Polynomial regression

K-Folds Cross Validation

The Bootstrap