Using GLM.jl

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Reading the data
Defining the Linear Model
Reading the Output of Fitting the Linear Model
Defining the Logistic Model
Reading the Output from the Prediction of the Logistic Model

Main author: Clarman Cruz.

This juypter lab showcases MLJ in particular using the popular GLM Julia package. We are using two datasets. One dataset was created manually for testing purposes. The other data set is the CollegeDistance dataset from the AER package in R.

We can quickly define our models in MLJ and study their results. It is very easy and consistent.

using MLJ, CategoricalArrays, PrettyPrinting
import DataFrames: DataFrame, nrow
using UrlDownload

LinearRegressor = @load LinearRegressor pkg=GLM
LinearBinaryClassifier = @load LinearBinaryClassifier pkg=GLM

import MLJGLMInterface ✔
import MLJGLMInterface ✔
MLJGLMInterface.LinearBinaryClassifier

The CollegeDistance dataset was stored in a CSV file. Here, we read the input file.

baseurl = "https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"

dfX = DataFrame(urldownload(baseurl * "X3.csv"))
dfYbinary = DataFrame(urldownload(baseurl * "Y3.csv"))
dfX1 = DataFrame(urldownload(baseurl * "X1.csv"))
dfY1 = DataFrame(urldownload(baseurl * "Y1.csv"));

You can have a look at those using first:

first(dfX, 3)

3×12 DataFrame
 Row │ gender   ethnicity  score    fcollege  mcollege  home     urban    unemp    wage     tuition  income   region
     │ String7  String15   Float64  String3   String3   String3  String3  Float64  Float64  Float64  String7  String7
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ male     other        39.15  yes       no        yes      yes          6.2     8.09  0.88915  high     other
   2 │ female   other        48.87  no        no        yes      yes          6.2     8.09  0.88915  low      other
   3 │ male     other        48.74  no        no        yes      yes          6.2     8.09  0.88915  low      other

same for Y:

first(dfY1, 3)

3×1 DataFrame
 Row │ Y
     │ Float64
─────┼───────────
   1 │ -2.04463
   2 │ -0.461529
   3 │  0.458262

‎

Let see how many MLJ models handle our kind of target which is the y variable.

ms = models() do m
    AbstractVector{Count} <: m.target_scitype
end
foreach(m -> println(m.name), ms)

EvoTreeCount
LinearCountRegressor
NeuralNetworkClassifier
NeuralNetworkRegressor
XGBoostCount

How about if the type was Continuous:

ms = models() do m
    Vector{Continuous} <: m.target_scitype
end
foreach(m -> println(m.name), ms)

ARDRegressor
AdaBoostRegressor
BaggingRegressor
BayesianRidgeRegressor
CatBoostRegressor
ConstantRegressor
DecisionTreeRegressor
DecisionTreeRegressor
DeterministicConstantRegressor
DummyRegressor
ElasticNetCVRegressor
ElasticNetRegressor
ElasticNetRegressor
EpsilonSVR
EvoLinearRegressor
EvoSplineRegressor
EvoTreeGaussian
EvoTreeMLE
EvoTreeRegressor
ExtraTreesRegressor
GaussianMixtureRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HistGradientBoostingRegressor
HuberRegressor
HuberRegressor
KNNRegressor
KNeighborsRegressor
KPLSRegressor
LADRegressor
LGBMRegressor
LarsCVRegressor
LarsRegressor
LassoCVRegressor
LassoLarsCVRegressor
LassoLarsICRegressor
LassoLarsRegressor
LassoRegressor
LassoRegressor
LinearRegressor
LinearRegressor
LinearRegressor
LinearRegressor
NeuralNetworkRegressor
NeuralNetworkRegressor
NuSVR
OrthogonalMatchingPursuitCVRegressor
OrthogonalMatchingPursuitRegressor
PLSRegressor
PartLS
PassiveAggressiveRegressor
QuantileRegressor
RANSACRegressor
RandomForestRegressor
RandomForestRegressor
RandomForestRegressor
RidgeCVRegressor
RidgeRegressor
RidgeRegressor
RidgeRegressor
RobustRegressor
SGDRegressor
SRRegressor
SVMLinearRegressor
SVMNuRegressor
SVMRegressor
StableForestRegressor
StableRulesRegressor
TheilSenRegressor
XGBoostRegressor

We can quickly define our models in MLJ. It is very easy and consistent.

X = copy(dfX1)
y = copy(dfY1)

coerce!(X, autotype(X, :string_to_multiclass))
yv = Vector(y[:, 1])

LinearRegressorPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearRegressor()
)

LinearModel = machine(LinearRegressorPipe, X, yv)
fit!(LinearModel)
fp = fitted_params(LinearModel)

(linear_regressor = (features = [:V1, :V2, :V3, :V4, :V5],
                     coef = [1.020786949740552, 1.03242891546997, 0.009406292423317619, 0.02663391517120747, 0.2998591563637024],
                     intercept = 0.015893883995789778,),
 one_hot_encoder = (all_features = [:V1, :V2, :V3, :V4, :V5],
                    fitted_levels_given_feature = Dict{Symbol, CategoricalArrays.CategoricalArray}(),
                    ref_name_pairs_given_feature = Dict{Symbol, Vector{Union{Pair{Missing, Symbol}, Pair{<:Unsigned, Symbol}}}}(),),
 standardizer = (features_fit = [:V1, :V2, :V3, :V4, :V5],
                 means = (0.0024456300706479973, -0.015561621122145304, 0.02442889884313839, 0.15168404285157286, 0.0077036209704558975),
                 stds = (1.1309193246154066, 1.1238897897565245, 2.332713568319154, 6.806065861835239, 1.1421493464876622),),)

‎

We can quickly read the results of our models in MLJ. Remember to compute the accuracy of the linear model.

ŷ = MLJ.predict(LinearModel, X)
yhatResponse = [ŷ[i,1].μ for i in 1:nrow(y)]
residuals = y .- yhatResponse
r = report(LinearModel)
println("\n Coefficients:  ", fp.linear_regressor.coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n yhatResponse \n ", yhatResponse[1:5])
println("\n Residuals \n ", y[1:5,1] .- yhatResponse[1:5])
println("\n Standard Error per Coefficient \n", r.linear_regressor.stderror[2:end])


 Coefficients:  [1.020786949740552, 1.03242891546997, 0.009406292423317619, 0.02663391517120747, 0.2998591563637024]

 y 
 [-2.0446341129015, -0.461528671336098, 0.458261960749596, 2.2746223981481, -0.969887403007307]

 ŷ 
 Distributions.Normal{Float64}[Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=1.4120055632036441, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.4736296806862398, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.7266938985590489, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=-1.8396459459760568, σ=0.9580569656804974)]

 yhatResponse 
 [-1.6915415373374758, 1.4120055632036441, 0.4736296806862398, 0.7266938985590489, -1.8396459459760568]

 Residuals 
 [-0.3530925755640242, -1.873534234539742, -0.015367719936643764, 1.5479284995890512, 0.8697585429687498]

 Standard Error per Coefficient 
[0.01587640310780568, 0.015862782503144917, 0.01515900587321476, 0.01515667698600387, 0.016546721612329368]

and get the accuracy

round(rms(yhatResponse, y[:,1]), sigdigits=4)

0.9573

‎

X = copy(dfX)
y = copy(dfYbinary)

coerce!(X, autotype(X, :string_to_multiclass))
yc = CategoricalArray(y[:, 1])
yc = coerce(yc, OrderedFactor)

LinearBinaryClassifierPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearBinaryClassifier()
)

LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)
fp = fitted_params(LogisticModel)

(linear_binary_classifier = (features = [:gender__female, :ethnicity__afam, :ethnicity__hispanic, :score, :fcollege__no, :mcollege__no, :home__no, :urban__no, :unemp, :wage, :tuition, :income__high, :region__other],
                             coef = [0.20250729378868748, 0.13075293910912908, 0.344951624939835, 0.9977565847160846, -0.5022315102984592, -0.4785005626021651, -0.20440507809955, -0.06922751403500071, 0.05892864973017096, -0.08344749828203225, -0.0023151433338598607, 0.46177653955786574, 0.3843262958100779],
                             intercept = -1.076633890579366,),
 one_hot_encoder = (all_features = [:gender, :ethnicity, :score, :fcollege, :mcollege, :home, :urban, :unemp, :wage, :tuition, :income, :region],
                    fitted_levels_given_feature = Dict{Symbol, CategoricalArrays.CategoricalArray}(:income => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("high"), InlineStrings.String7("low")], :ethnicity => CategoricalArrays.CategoricalValue{InlineStrings.String15, UInt32}[InlineStrings.String15("afam"), InlineStrings.String15("hispanic"), InlineStrings.String15("other")], :fcollege => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :home => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :urban => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :region => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("other"), InlineStrings.String7("west")], :mcollege => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :gender => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("female"), InlineStrings.String7("male")]),
                    ref_name_pairs_given_feature = Dict{Symbol, Vector{Union{Pair{Missing, Symbol}, Pair{<:Unsigned, Symbol}}}}(:income => [0x00000001 => :income__high], :ethnicity => [0x00000001 => :ethnicity__afam, 0x00000002 => :ethnicity__hispanic], :fcollege => [0x00000001 => :fcollege__no], :home => [0x00000001 => :home__no], :urban => [0x00000001 => :urban__no], :region => [0x00000001 => :region__other], :mcollege => [0x00000001 => :mcollege__no], :gender => [0x00000001 => :gender__female]),),
 standardizer = (features_fit = [:score, :unemp, :wage, :tuition],
                 means = (50.88902933684601, 7.597214581091511, 9.500506478338009, 0.8146082493518824),
                 stds = (8.701909614072397, 2.763580873344848, 1.3430670761078416, 0.33950381985971717),),)

‎

The output of the MLJ model basically contain the same information as the R version of the model.

ŷ = MLJ.predict(LogisticModel, X)
residuals = [1 - pdf(ŷ[i], y[i,1]) for i in 1:nrow(y)]
r = report(LogisticModel)

println("\n Coefficients:  ", fp.linear_binary_classifier.coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n residuals \n ", residuals[1:5])
println("\n Standard Error per Coefficient \n", r.linear_binary_classifier.stderror[2:end])


 Coefficients:  [0.20250729378868748, 0.13075293910912908, 0.344951624939835, 0.9977565847160846, -0.5022315102984592, -0.4785005626021651, -0.20440507809955, -0.06922751403500071, 0.05892864973017096, -0.08344749828203225, -0.0023151433338598607, 0.46177653955786574, 0.3843262958100779]

 y 
 [0, 0, 0, 0, 0]

 ŷ 
 CategoricalDistributions.UnivariateFinite{ScientificTypesBase.OrderedFactor{2}, Int64, UInt32, Float64}[UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.881, 1=>0.119), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.838, 1=>0.162), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.866, 1=>0.134), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.936, 1=>0.0637), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.944, 1=>0.056)]

 residuals 
 [0.11944603346742211, 0.16182691493524637, 0.13445730373831222, 0.06370799769022906, 0.05604680411361729]

 Standard Error per Coefficient 
[0.07542967234030676, 0.1226000420274196, 0.10934317995152516, 0.04661437250372939, 0.09609243724815357, 0.10743620672240191, 0.10642223545563922, 0.09190778860389334, 0.03922724536508867, 0.04118915117919154, 0.05115399636339277, 0.08454431256127866, 0.1228145565794004]

No logistic analysis is complete without the confusion matrix:

yMode = [mode(ŷ[i]) for i in 1:length(ŷ)]
y = coerce(y[:,1], OrderedFactor)
yMode = coerce(yMode, OrderedFactor)
confusion_matrix(yMode, y)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  0   │  1   │
├─────────┼──────┼──────┤
│    0    │ 3283 │ 831  │
├─────────┼──────┼──────┤
│    1    │ 236  │ 389  │
└─────────┴──────┴──────┘

‎

Using GLM.jl

Reading the data

Defining the Linear Model

Reading the Output of Fitting the Linear Model

Defining the Logistic Model

Reading the Output from the Prediction of the Logistic Model