Using GLM.jl
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
Main author: Clarman Cruz.
This juypter lab showcases MLJ in particular using the popular GLM Julia package. We are using two datasets. One dataset was created manually for testing purposes. The other data set is the CollegeDistance dataset from the AER package in R.
We can quickly define our models in MLJ and study their results. It is very easy and consistent.
using MLJ, CategoricalArrays, PrettyPrinting
import DataFrames: DataFrame, nrow
using UrlDownload
LinearRegressor = @load LinearRegressor pkg=GLM
LinearBinaryClassifier = @load LinearBinaryClassifier pkg=GLM
import MLJGLMInterface ✔
import MLJGLMInterface ✔
MLJGLMInterface.LinearBinaryClassifier
The CollegeDistance dataset was stored in a CSV file. Here, we read the input file.
baseurl = "https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"
dfX = DataFrame(urldownload(baseurl * "X3.csv"))
dfYbinary = DataFrame(urldownload(baseurl * "Y3.csv"))
dfX1 = DataFrame(urldownload(baseurl * "X1.csv"))
dfY1 = DataFrame(urldownload(baseurl * "Y1.csv"));
You can have a look at those using first
:
first(dfX, 3)
3×12 DataFrame
Row │ gender ethnicity score fcollege mcollege home urban unemp wage tuition income region
│ String7 String15 Float64 String3 String3 String3 String3 Float64 Float64 Float64 String7 String7
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ male other 39.15 yes no yes yes 6.2 8.09 0.88915 high other
2 │ female other 48.87 no no yes yes 6.2 8.09 0.88915 low other
3 │ male other 48.74 no no yes yes 6.2 8.09 0.88915 low other
same for Y:
first(dfY1, 3)
3×1 DataFrame
Row │ Y
│ Float64
─────┼───────────
1 │ -2.04463
2 │ -0.461529
3 │ 0.458262
Let see how many MLJ models handle our kind of target which is the y variable.
ms = models() do m
AbstractVector{Count} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
EvoTreeCount
LinearCountRegressor
NeuralNetworkClassifier
NeuralNetworkRegressor
XGBoostCount
How about if the type was Continuous:
ms = models() do m
Vector{Continuous} <: m.target_scitype
end
foreach(m -> println(m.name), ms)
ARDRegressor
AdaBoostRegressor
BaggingRegressor
BayesianRidgeRegressor
CatBoostRegressor
ConstantRegressor
DecisionTreeRegressor
DecisionTreeRegressor
DeterministicConstantRegressor
DummyRegressor
ElasticNetCVRegressor
ElasticNetRegressor
ElasticNetRegressor
EpsilonSVR
EvoLinearRegressor
EvoSplineRegressor
EvoTreeGaussian
EvoTreeMLE
EvoTreeRegressor
ExtraTreesRegressor
GaussianMixtureRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HistGradientBoostingRegressor
HuberRegressor
HuberRegressor
KNNRegressor
KNeighborsRegressor
KPLSRegressor
LADRegressor
LGBMRegressor
LarsCVRegressor
LarsRegressor
LassoCVRegressor
LassoLarsCVRegressor
LassoLarsICRegressor
LassoLarsRegressor
LassoRegressor
LassoRegressor
LinearRegressor
LinearRegressor
LinearRegressor
LinearRegressor
NeuralNetworkRegressor
NeuralNetworkRegressor
NuSVR
OrthogonalMatchingPursuitCVRegressor
OrthogonalMatchingPursuitRegressor
PLSRegressor
PartLS
PassiveAggressiveRegressor
QuantileRegressor
RANSACRegressor
RandomForestRegressor
RandomForestRegressor
RandomForestRegressor
RidgeCVRegressor
RidgeRegressor
RidgeRegressor
RidgeRegressor
RobustRegressor
SGDRegressor
SRRegressor
SVMLinearRegressor
SVMNuRegressor
SVMRegressor
StableForestRegressor
StableRulesRegressor
TheilSenRegressor
XGBoostRegressor
We can quickly define our models in MLJ. It is very easy and consistent.
X = copy(dfX1)
y = copy(dfY1)
coerce!(X, autotype(X, :string_to_multiclass))
yv = Vector(y[:, 1])
LinearRegressorPipe = Pipeline(
Standardizer(),
OneHotEncoder(drop_last = true),
LinearRegressor()
)
LinearModel = machine(LinearRegressorPipe, X, yv)
fit!(LinearModel)
fp = fitted_params(LinearModel)
(linear_regressor = (features = [:V1, :V2, :V3, :V4, :V5],
coef = [1.020786949740552, 1.03242891546997, 0.009406292423317619, 0.02663391517120747, 0.2998591563637024],
intercept = 0.015893883995789778,),
one_hot_encoder = (all_features = [:V1, :V2, :V3, :V4, :V5],
fitted_levels_given_feature = Dict{Symbol, CategoricalArrays.CategoricalArray}(),
ref_name_pairs_given_feature = Dict{Symbol, Vector{Union{Pair{Missing, Symbol}, Pair{<:Unsigned, Symbol}}}}(),),
standardizer = (features_fit = [:V1, :V2, :V3, :V4, :V5],
means = (0.0024456300706479973, -0.015561621122145304, 0.02442889884313839, 0.15168404285157286, 0.0077036209704558975),
stds = (1.1309193246154066, 1.1238897897565245, 2.332713568319154, 6.806065861835239, 1.1421493464876622),),)
We can quickly read the results of our models in MLJ. Remember to compute the accuracy of the linear model.
ŷ = MLJ.predict(LinearModel, X)
yhatResponse = [ŷ[i,1].μ for i in 1:nrow(y)]
residuals = y .- yhatResponse
r = report(LinearModel)
println("\n Coefficients: ", fp.linear_regressor.coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n yhatResponse \n ", yhatResponse[1:5])
println("\n Residuals \n ", y[1:5,1] .- yhatResponse[1:5])
println("\n Standard Error per Coefficient \n", r.linear_regressor.stderror[2:end])
Coefficients: [1.020786949740552, 1.03242891546997, 0.009406292423317619, 0.02663391517120747, 0.2998591563637024]
y
[-2.0446341129015, -0.461528671336098, 0.458261960749596, 2.2746223981481, -0.969887403007307]
ŷ
Distributions.Normal{Float64}[Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=1.4120055632036441, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.4736296806862398, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.7266938985590489, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=-1.8396459459760568, σ=0.9580569656804974)]
yhatResponse
[-1.6915415373374758, 1.4120055632036441, 0.4736296806862398, 0.7266938985590489, -1.8396459459760568]
Residuals
[-0.3530925755640242, -1.873534234539742, -0.015367719936643764, 1.5479284995890512, 0.8697585429687498]
Standard Error per Coefficient
[0.01587640310780568, 0.015862782503144917, 0.01515900587321476, 0.01515667698600387, 0.016546721612329368]
and get the accuracy
round(rms(yhatResponse, y[:,1]), sigdigits=4)
0.9573
X = copy(dfX)
y = copy(dfYbinary)
coerce!(X, autotype(X, :string_to_multiclass))
yc = CategoricalArray(y[:, 1])
yc = coerce(yc, OrderedFactor)
LinearBinaryClassifierPipe = Pipeline(
Standardizer(),
OneHotEncoder(drop_last = true),
LinearBinaryClassifier()
)
LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)
fp = fitted_params(LogisticModel)
(linear_binary_classifier = (features = [:gender__female, :ethnicity__afam, :ethnicity__hispanic, :score, :fcollege__no, :mcollege__no, :home__no, :urban__no, :unemp, :wage, :tuition, :income__high, :region__other],
coef = [0.20250729378868748, 0.13075293910912908, 0.344951624939835, 0.9977565847160846, -0.5022315102984592, -0.4785005626021651, -0.20440507809955, -0.06922751403500071, 0.05892864973017096, -0.08344749828203225, -0.0023151433338598607, 0.46177653955786574, 0.3843262958100779],
intercept = -1.076633890579366,),
one_hot_encoder = (all_features = [:gender, :ethnicity, :score, :fcollege, :mcollege, :home, :urban, :unemp, :wage, :tuition, :income, :region],
fitted_levels_given_feature = Dict{Symbol, CategoricalArrays.CategoricalArray}(:income => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("high"), InlineStrings.String7("low")], :ethnicity => CategoricalArrays.CategoricalValue{InlineStrings.String15, UInt32}[InlineStrings.String15("afam"), InlineStrings.String15("hispanic"), InlineStrings.String15("other")], :fcollege => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :home => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :urban => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :region => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("other"), InlineStrings.String7("west")], :mcollege => CategoricalArrays.CategoricalValue{InlineStrings.String3, UInt32}[InlineStrings.String3("no"), InlineStrings.String3("yes")], :gender => CategoricalArrays.CategoricalValue{InlineStrings.String7, UInt32}[InlineStrings.String7("female"), InlineStrings.String7("male")]),
ref_name_pairs_given_feature = Dict{Symbol, Vector{Union{Pair{Missing, Symbol}, Pair{<:Unsigned, Symbol}}}}(:income => [0x00000001 => :income__high], :ethnicity => [0x00000001 => :ethnicity__afam, 0x00000002 => :ethnicity__hispanic], :fcollege => [0x00000001 => :fcollege__no], :home => [0x00000001 => :home__no], :urban => [0x00000001 => :urban__no], :region => [0x00000001 => :region__other], :mcollege => [0x00000001 => :mcollege__no], :gender => [0x00000001 => :gender__female]),),
standardizer = (features_fit = [:score, :unemp, :wage, :tuition],
means = (50.88902933684601, 7.597214581091511, 9.500506478338009, 0.8146082493518824),
stds = (8.701909614072397, 2.763580873344848, 1.3430670761078416, 0.33950381985971717),),)
The output of the MLJ model basically contain the same information as the R version of the model.
ŷ = MLJ.predict(LogisticModel, X)
residuals = [1 - pdf(ŷ[i], y[i,1]) for i in 1:nrow(y)]
r = report(LogisticModel)
println("\n Coefficients: ", fp.linear_binary_classifier.coef)
println("\n y \n ", y[1:5,1])
println("\n ŷ \n ", ŷ[1:5])
println("\n residuals \n ", residuals[1:5])
println("\n Standard Error per Coefficient \n", r.linear_binary_classifier.stderror[2:end])
Coefficients: [0.20250729378868748, 0.13075293910912908, 0.344951624939835, 0.9977565847160846, -0.5022315102984592, -0.4785005626021651, -0.20440507809955, -0.06922751403500071, 0.05892864973017096, -0.08344749828203225, -0.0023151433338598607, 0.46177653955786574, 0.3843262958100779]
y
[0, 0, 0, 0, 0]
ŷ
CategoricalDistributions.UnivariateFinite{ScientificTypesBase.OrderedFactor{2}, Int64, UInt32, Float64}[UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.881, 1=>0.119), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.838, 1=>0.162), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.866, 1=>0.134), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.936, 1=>0.0637), UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.944, 1=>0.056)]
residuals
[0.11944603346742211, 0.16182691493524637, 0.13445730373831222, 0.06370799769022906, 0.05604680411361729]
Standard Error per Coefficient
[0.07542967234030676, 0.1226000420274196, 0.10934317995152516, 0.04661437250372939, 0.09609243724815357, 0.10743620672240191, 0.10642223545563922, 0.09190778860389334, 0.03922724536508867, 0.04118915117919154, 0.05115399636339277, 0.08454431256127866, 0.1228145565794004]
No logistic analysis is complete without the confusion matrix:
yMode = [mode(ŷ[i]) for i in 1:length(ŷ)]
y = coerce(y[:,1], OrderedFactor)
yMode = coerce(yMode, OrderedFactor)
confusion_matrix(yMode, y)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ 0 │ 1 │
├─────────┼──────┼──────┤
│ 0 │ 3283 │ 831 │
├─────────┼──────┼──────┤
│ 1 │ 236 │ 389 │
└─────────┴──────┴──────┘