Using GLM.jl

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Main author: Clarman Cruz.

This juypter lab showcases MLJ in particular using the popular GLM Julia package. We are using two datasets. One dataset was created manually for testing purposes. The other data set is the CollegeDistance dataset from the AER package in R.

We can quickly define our models in MLJ and study their results. It is very easy and consistent.

using MLJ, CategoricalArrays, PrettyPrinting
import DataFrames: DataFrame, nrow
using UrlDownload

LinearRegressor = @load LinearRegressor pkg=GLM
LinearBinaryClassifier = @load LinearBinaryClassifier pkg=GLM
import MLJGLMInterface ✔
import MLJGLMInterface ✔
MLJGLMInterface.LinearBinaryClassifier

Reading the data

The CollegeDistance dataset was stored in a CSV file. Here, we read the input file.

baseurl = "https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"

dfX = DataFrame(urldownload(baseurl * "X3.csv"))
dfYbinary = DataFrame(urldownload(baseurl * "Y3.csv"))
dfX1 = DataFrame(urldownload(baseurl * "X1.csv"))
dfY1 = DataFrame(urldownload(baseurl * "Y1.csv"));

You can have a look at those using first:

first(dfX, 3)
3×12 DataFrame
 Row │ gender   ethnicity  score    fcollege  mcollege  home     urban    unemp    wage     tuition  income   region
     │ String7  String15   Float64  String3   String3   String3  String3  Float64  Float64  Float64  String7  String7
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ male     other        39.15  yes       no        yes      yes          6.2     8.09  0.88915  high     other
   2 │ female   other        48.87  no        no        yes      yes          6.2     8.09  0.88915  low      other
   3 │ male     other        48.74  no        no        yes      yes          6.2     8.09  0.88915  low      other

same for Y:

first(dfY1, 3)
3×1 DataFrame
 Row │ Y
     │ Float64
─────┼───────────
   1 │ -2.04463
   2 │ -0.461529
   3 │  0.458262

Defining the Linear Model

Let see how many MLJ models handle our kind of target which is the y variable.

ms = models() do m
    AbstractVector{Count} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
EvoTreeCount
LinearCountRegressor
XGBoostCount

How about if the type was Continuous:

ms = models() do m
    Vector{Continuous} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
ARDRegressor
AdaBoostRegressor
BaggingRegressor
BayesianRidgeRegressor
ConstantRegressor
DecisionTreeRegressor
DecisionTreeRegressor
DeterministicConstantRegressor
DummyRegressor
ElasticNetCVRegressor
ElasticNetRegressor
ElasticNetRegressor
EpsilonSVR
EvoTreeGaussian
EvoTreeRegressor
ExtraTreesRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HuberRegressor
HuberRegressor
KNNRegressor
KNeighborsRegressor
KPLSRegressor
LADRegressor
LGBMRegressor
LarsCVRegressor
LarsRegressor
LassoCVRegressor
LassoLarsCVRegressor
LassoLarsICRegressor
LassoLarsRegressor
LassoRegressor
LassoRegressor
LinearRegressor
LinearRegressor
LinearRegressor
LinearRegressor
NeuralNetworkRegressor
NuSVR
OrthogonalMatchingPursuitCVRegressor
OrthogonalMatchingPursuitRegressor
PLSRegressor
PassiveAggressiveRegressor
QuantileRegressor
RANSACRegressor
RandomForestRegressor
RandomForestRegressor
RandomForestRegressor
RidgeCVRegressor
RidgeRegressor
RidgeRegressor
RidgeRegressor
RobustRegressor
SGDRegressor
SVMLinearRegressor
SVMNuRegressor
SVMRegressor
TheilSenRegressor
XGBoostRegressor

We can quickly define our models in MLJ. It is very easy and consistent.

X = copy(dfX1)
y = copy(dfY1)

coerce!(X, autotype(X, :string_to_multiclass))
yv = Vector(y[:, 1])

LinearRegressorPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearRegressor()
)

LinearModel = machine(LinearRegressorPipe, X, yv)
fit!(LinearModel)
fp = fitted_params(LinearModel)
(linear_regressor = (features = ["V1", "V2", "V3", "V4", "V5"],
                     coef = [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225, 0.015893883995789806],
                     intercept = 0.015893883995789806,),
 one_hot_encoder = (fitresult = OneHotEncoderResult,),
 standardizer = Dict(:V1 => (0.0024456300706479973, 1.1309193246154066), :V2 => (-0.015561621122145304, 1.1238897897565245), :V5 => (0.0077036209704558975, 1.1421493464876622), :V3 => (0.02442889884313839, 2.332713568319154), :V4 => (0.15168404285157286, 6.806065861835239)),
 machines = MLJBase.Machine[Machine{Standardizer,…}, Machine{OneHotEncoder,…}, Machine{LinearRegressor,…}],
 fitted_params_given_machine = OrderedCollections.LittleDict{Any, Any, Vector{Any}, Vector{Any}}(Machine{Standardizer,…} => Dict(:V1 => (0.0024456300706479973, 1.1309193246154066), :V2 => (-0.015561621122145304, 1.1238897897565245), :V5 => (0.0077036209704558975, 1.1421493464876622), :V3 => (0.02442889884313839, 2.332713568319154), :V4 => (0.15168404285157286, 6.806065861835239)), Machine{OneHotEncoder,…} => (fitresult = OneHotEncoderResult,), Machine{LinearRegressor,…} => (features = ["V1", "V2", "V3", "V4", "V5"], coef = [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225, 0.015893883995789806], intercept = 0.015893883995789806)),)

Reading the Output of Fitting the Linear Model

We can quickly read the results of our models in MLJ. Remember to compute the accuracy of the linear model.

ŷ = MLJ.predict(LinearModel, X)
yhatResponse = [ŷ[i,1].μ for i in 1:nrow(y)]
residuals = y .- yhatResponse
r = report(LinearModel)

k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients:  ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n yhatResponse \n ", yhatResponse[1:5])
println("\n Residuals \n ", y[1:5,1] .- yhatResponse[1:5])
println("\n Standard Error per Coefficient \n", r.linear_regressor.stderror[2:end])

 Coefficients:  [1.0207869497405524, 1.03242891546997, 0.009406292423317668, 0.026633915171207462, 0.29985915636370225, 0.015893883995789806]

 y 
 [-2.0446341129015, -0.461528671336098, 0.458261960749596, 2.2746223981481, -0.969887403007307]

 ŷ 
 Distributions.Normal{Float64}[Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=1.412005563203644, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.47362968068623923, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.7266938985590492, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=-1.8396459459760566, σ=0.9580569656804974)]

 yhatResponse 
 [-1.6915415373374758, 1.412005563203644, 0.47362968068623923, 0.7266938985590492, -1.8396459459760566]

 Residuals 
 [-0.3530925755640242, -1.8735342345397419, -0.01536771993664321, 1.547928499589051, 0.8697585429687495]

 Standard Error per Coefficient 
[0.015862782503144917, 0.01515900587321476, 0.01515667698600387, 0.016546721612329368, 0.015148210698700702]

and get the accuracy

round(rms(yhatResponse, y[:,1]), sigdigits=4)
0.9573

Defining the Logistic Model

X = copy(dfX)
y = copy(dfYbinary)

coerce!(X, autotype(X, :string_to_multiclass))
yc = CategoricalArray(y[:, 1])
yc = coerce(yc, OrderedFactor)

LinearBinaryClassifierPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearBinaryClassifier()
)

LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)
fp = fitted_params(LogisticModel)
(linear_binary_classifier = (features = ["gender__female", "ethnicity__afam", "ethnicity__hispanic", "score", "fcollege__no", "mcollege__no", "home__no", "urban__no", "unemp", "wage", "tuition", "income__high", "region__other"],
                             coef = [0.20250729378868748, 0.13075293910912905, 0.34495162493983506, 0.9977565847160846, -0.5022315102984594, -0.4785005626021653, -0.20440507809954997, -0.06922751403500084, 0.05892864973017095, -0.08344749828203228, -0.002315143333859723, 0.4617765395578657, 0.38432629581007743, -1.0766338905793653],
                             intercept = -1.0766338905793653,),
 one_hot_encoder = (fitresult = OneHotEncoderResult,),
 standardizer = Dict(:wage => (9.500506478338009, 1.3430670761078416), :unemp => (7.597214581091511, 2.763580873344848), :tuition => (0.8146082493518824, 0.33950381985971717), :score => (50.88902933684601, 8.701909614072397)),
 machines = MLJBase.Machine[Machine{Standardizer,…}, Machine{OneHotEncoder,…}, Machine{LinearBinaryClassifier{LogitLink},…}],
 fitted_params_given_machine = OrderedCollections.LittleDict{Any, Any, Vector{Any}, Vector{Any}}(Machine{Standardizer,…} => Dict(:wage => (9.500506478338009, 1.3430670761078416), :unemp => (7.597214581091511, 2.763580873344848), :tuition => (0.8146082493518824, 0.33950381985971717), :score => (50.88902933684601, 8.701909614072397)), Machine{OneHotEncoder,…} => (fitresult = OneHotEncoderResult,), Machine{LinearBinaryClassifier{LogitLink},…} => (features = ["gender__female", "ethnicity__afam", "ethnicity__hispanic", "score", "fcollege__no", "mcollege__no", "home__no", "urban__no", "unemp", "wage", "tuition", "income__high", "region__other"], coef = [0.20250729378868748, 0.13075293910912905, 0.34495162493983506, 0.9977565847160846, -0.5022315102984594, -0.4785005626021653, -0.20440507809954997, -0.06922751403500084, 0.05892864973017095, -0.08344749828203228, -0.002315143333859723, 0.4617765395578657, 0.38432629581007743, -1.0766338905793653], intercept = -1.0766338905793653)),)

Reading the Output from the Prediction of the Logistic Model

The output of the MLJ model basically contain the same information as the R version of the model.

ŷ = MLJ.predict(LogisticModel, X)
residuals = [1 - pdf(ŷ[i], y[i,1]) for i in 1:nrow(y)]
r = report(LogisticModel)

k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients:  ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n residuals \n ", residuals[1:5])
println("\n Standard Error per Coefficient \n", r.linear_binary_classifier.stderror[2:end])

 Coefficients:  [0.20250729378868748, 0.13075293910912905, 0.34495162493983506, 0.9977565847160846, -0.5022315102984594, -0.4785005626021653, -0.20440507809954997, -0.06922751403500084, 0.05892864973017095, -0.08344749828203228, -0.002315143333859723, 0.4617765395578657, 0.38432629581007743, -1.0766338905793653]

 y 
 [0, 0, 0, 0, 0]

 ŷ 
 CategoricalDistributions.UnivariateFinite{ScientificTypesBase.OrderedFactor{2}, Int64, UInt32, Float64}[UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.881, 1=>0.119), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.838, 1=>0.162), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.866, 1=>0.134), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.936, 1=>0.0637), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.944, 1=>0.056)]

 residuals 
 [0.11944603346742211, 0.16182691493524637, 0.1344573037383121, 0.06370799769022917, 0.05604680411361729]

 Standard Error per Coefficient 
[0.12260004202741961, 0.10934317995152518, 0.0466143725037294, 0.0960924372481536, 0.10743620672240194, 0.10642223545563925, 0.09190778860389343, 0.03922724536508866, 0.04118915117919154, 0.05115399636339277, 0.08454431256127869, 0.1228145565794003, 0.17884724866298368]

No logistic analysis is complete without the confusion matrix:

yMode = [mode(ŷ[i]) for i in 1:length(ŷ)]
y = coerce(y[:,1], OrderedFactor)
yMode = coerce(yMode, OrderedFactor)
confusion_matrix(yMode, y)
              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │      0      │      1      │
├─────────────┼─────────────┼─────────────┤
│      0      │    3283     │     831     │
├─────────────┼─────────────┼─────────────┤
│      1      │     236     │     389     │
└─────────────┴─────────────┴─────────────┘