Lab 3 - Linear Regression

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Simple linear regression

MLJ essentially serves as a unified path to many existing Julia packages each of which provides their own functionalities and models, with their own conventions.

The simple linear regression demonstrates this. Several packages offer it (beyond just using the backslash operator): here we will use MLJLinearModels but we could also have used GLM, ScikitLearn etc.

To load the model from a given package use @load ModelName pkg=PackageName

using MLJ

LinearRegressor = @load LinearRegressor pkg=MLJLinearModels
import MLJLinearModels ✔
MLJLinearModels.LinearRegressor

Note: in order to be able to load this, you must have the relevant package in your environment, if you don't, you can always add it (using Pkg; Pkg.add("MLJLinearModels")).

Let's load the boston data set

import RDatasets: dataset
import DataFrames: describe, select, Not, rename!
boston = dataset("MASS", "Boston")
first(boston, 3)
3×14 DataFrame
 Row │ Crim     Zn       Indus    Chas   NOx      Rm       Age      Dis      Rad    Tax    PTRatio  Black    LStat    MedV
     │ Float64  Float64  Float64  Int64  Float64  Float64  Float64  Float64  Int64  Int64  Float64  Float64  Float64  Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.00632     18.0     2.31      0    0.538    6.575     65.2   4.09        1    296     15.3   396.9      4.98     24.0
   2 │ 0.02731      0.0     7.07      0    0.469    6.421     78.9   4.9671      2    242     17.8   396.9      9.14     21.6
   3 │ 0.02729      0.0     7.07      0    0.469    7.185     61.1   4.9671      2    242     17.8   392.83     4.03     34.7

Let's get a feel for the data

describe(boston, :mean, :std, :eltype)
14×4 DataFrame
 Row │ variable  mean        std         eltype
     │ Symbol    Float64     Float64     DataType
─────┼────────────────────────────────────────────
   1 │ Crim        3.61352     8.60155   Float64
   2 │ Zn         11.3636     23.3225    Float64
   3 │ Indus      11.1368      6.86035   Float64
   4 │ Chas        0.06917     0.253994  Int64
   5 │ NOx         0.554695    0.115878  Float64
   6 │ Rm          6.28463     0.702617  Float64
   7 │ Age        68.5749     28.1489    Float64
   8 │ Dis         3.79504     2.10571   Float64
   9 │ Rad         9.54941     8.70726   Int64
  10 │ Tax       408.237     168.537     Int64
  11 │ PTRatio    18.4555      2.16495   Float64
  12 │ Black     356.674      91.2949    Float64
  13 │ LStat      12.6531      7.14106   Float64
  14 │ MedV       22.5328      9.1971    Float64

So there's no missing value and most variables are encoded as floating point numbers. In MLJ it's important to specify the interpretation of the features (should it be considered as a Continuous feature, as a Count, ...?), see also this tutorial section on scientific types.

Here we will just interpret the integer features as continuous as we will just use a basic linear regression:

data = coerce(boston, autotype(boston, :discrete_to_continuous));

Let's also extract the target variable (MedV):

y = data.MedV
X = select(data, Not(:MedV));

Let's declare a simple multivariate linear regression model:

mdl = LinearRegressor()
LinearRegressor(
    fit_intercept = true,
    solver = nothing)

First let's do a very simple univariate regression, in order to fit it on the data, we need to wrap it in a machine which, in MLJ, is the composition of a model and data to apply the model on:

X_uni = select(X, :LStat) # only a single feature
mach_uni = machine(mdl, X_uni, y)
fit!(mach_uni)
Machine{LinearRegressor,…} trained 1 time; caches data
  model: MLJLinearModels.LinearRegressor
  args: 
    1:	Source @861 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @548 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`

You can then retrieve the fitted parameters using fitted_params:

fp = fitted_params(mach_uni)
@show fp.coefs
@show fp.intercept
fp.coefs = [:LStat => -0.950049353757991]
fp.intercept = 34.553840879383095

You can also visualise this

using PyPlot

figure(figsize=(8,6))
plot(X.LStat, y, ls="none", marker="o")
Xnew = (LStat = collect(range(extrema(X.LStat)..., length=100)),)
plot(Xnew.LStat, MLJ.predict(mach_uni, Xnew))
Univariate regression

The multivariate case is very similar

mach = machine(mdl, X, y)
fit!(mach)

fp = fitted_params(mach)
coefs = fp.coefs
intercept = fp.intercept
for (name, val) in coefs
    println("$(rpad(name, 8)):  $(round(val, sigdigits=3))")
end
println("Intercept: $(round(intercept, sigdigits=3))")
Crim    :  -0.108
Zn      :  0.0464
Indus   :  0.0206
Chas    :  2.69
NOx     :  -17.8
Rm      :  3.81
Age     :  0.000692
Dis     :  -1.48
Rad     :  0.306
Tax     :  -0.0123
PTRatio :  -0.953
Black   :  0.00931
LStat   :  -0.525
Intercept: 36.5

You can use the machine in order to predict values as well and, for instance, compute the root mean squared error:

ŷ = MLJ.predict(mach, X)
round(rms(ŷ, y), sigdigits=4)
4.679

Let's see what the residuals look like

figure(figsize=(8,6))
res = ŷ .- y
stem(res)
Plot of the residuals

Maybe that a histogram is more appropriate here

figure(figsize=(8,6))
hist(res, density=true)
Histogram of the residuals

Interaction and transformation

Let's say we want to also consider an interaction term of lstat and age taken together. To do this, just create a new dataframe with an additional column corresponding to the interaction term:

X2 = hcat(X, X.LStat .* X.Age);

So here we have a DataFrame with one extra column corresponding to the elementwise products between :LStat and Age. DataFrame gives this a default name (:x1) which we can change:

rename!(X2, :x1 => :interaction);

Ok cool, now let's try the linear regression again

mach = machine(mdl, X2, y)
fit!(mach)
ŷ = MLJ.predict(mach, X2)
round(rms(ŷ, y), sigdigits=4)
4.676

We get slightly better results but nothing spectacular.

Let's get back to the lab where they consider regressing the target variable on lstat and lstat^2; again, it's essentially a case of defining the right DataFrame:

X3 = hcat(X.LStat, X.LStat.^2) |> MLJ.table
mach = machine(mdl, X3, y)
fit!(mach)
ŷ = MLJ.predict(mach, X3)
round(rms(ŷ, y), sigdigits=4)
5.507

which again, we can visualise:

Xnew = (LStat = Xnew.LStat, LStat2 = Xnew.LStat.^2)

figure(figsize=(8,6))
plot(X.LStat, y, ls="none", marker="o")
plot(Xnew.LStat, MLJ.predict(mach, Xnew))
Polynomial regression