Crabs with XGBoost

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

First steps
XGBoost machine

This example is inspired from this post showing how to use XGBoost.

MLJ provides a built-in function to load the Crabs dataset:

using MLJ
using StatsBase
using Random
using Plots
import DataFrames
import StableRNGs.StableRNG


X, y = @load_crabs # a table and a vector
X = DataFrames.DataFrame(X)
@show size(X)
@show y[1:3]
first(X, 3)

size(X) = (200, 5)
y[1:3] = CategoricalArrays.CategoricalValue{String, UInt32}["B", "B", "B"]
3×5 DataFrame
 Row │ FL       RW       CL       CW       BD
     │ Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────
   1 │     8.1      6.7     16.1     19.0      7.0
   2 │     8.8      7.7     18.1     20.8      7.4
   3 │     9.2      7.8     19.0     22.4      7.7

schema(X)

┌───────┬────────────┬─────────┐
│ names │ scitypes   │ types   │
├───────┼────────────┼─────────┤
│ FL    │ Continuous │ Float64 │
│ RW    │ Continuous │ Float64 │
│ CL    │ Continuous │ Float64 │
│ CW    │ Continuous │ Float64 │
│ BD    │ Continuous │ Float64 │
└───────┴────────────┴─────────┘

We are looking at a classification problem with the following classes:

levels(y)

2-element Vector{String}:
 "B"
 "O"

It's not a very big dataset so we will likely overfit it badly using something as sophisticated as XGBoost but it will do for a demonstration. Since our data set is ordered by target class, we'll be sure to create shuffled train/test index sets:

train, test = partition(collect(eachindex(y)), 0.70, rng=StableRNG(123))
XGBC = @load XGBoostClassifier
xgb_model = XGBC()

import MLJXGBoostInterface ✔
XGBoostClassifier(
  test = 1, 
  num_round = 100, 
  booster = "gbtree", 
  disable_default_eval_metric = 0, 
  eta = 0.3, 
  num_parallel_tree = 1, 
  gamma = 0.0, 
  max_depth = 6, 
  min_child_weight = 1.0, 
  max_delta_step = 0.0, 
  subsample = 1.0, 
  colsample_bytree = 1.0, 
  colsample_bylevel = 1.0, 
  colsample_bynode = 1.0, 
  lambda = 1.0, 
  alpha = 0.0, 
  tree_method = "auto", 
  sketch_eps = 0.03, 
  scale_pos_weight = 1.0, 
  updater = nothing, 
  refresh_leaf = 1, 
  process_type = "default", 
  grow_policy = "depthwise", 
  max_leaves = 0, 
  max_bin = 256, 
  predictor = "cpu_predictor", 
  sample_type = "uniform", 
  normalize_type = "tree", 
  rate_drop = 0.0, 
  one_drop = 0, 
  skip_drop = 0.0, 
  feature_selector = "cyclic", 
  top_k = 0, 
  tweedie_variance_power = 1.5, 
  objective = "automatic", 
  base_score = 0.5, 
  watchlist = nothing, 
  nthread = 12, 
  importance_type = "gain", 
  seed = nothing, 
  validate_parameters = false, 
  eval_metric = String[])

Let's check whether the training and is balanced, StatsBase.countmap is useful for that:

countmap(y[train])

Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Int64} with 2 entries:
  "B" => 70
  "O" => 70

which is pretty balanced. You could check the same on the test set and full set and the same comment would still hold.

‎

Wrap a machine around an XGBoost model (XGB) and the data:

xgb  = XGBC()
mach = machine(xgb, X, y)

untrained Machine; caches model-specific representations of data
  model: XGBoostClassifier(test = 1, …)
  args: 
    1:	Source @654 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @983 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}

We will tune it varying the number of rounds used and generate a learning curve

r = range(xgb, :num_round, lower=50, upper=500)
curve = learning_curve(
    mach,
    range=r,
    resolution=50,
    measure=brier_loss,
)

(parameter_name = "num_round",
 parameter_scale = :linear,
 parameter_values = [50, 59, 68, 78, 87, 96, 105, 114, 123, 133, 142, 151, 160, 169, 179, 188, 197, 206, 215, 224, 234, 243, 252, 261, 270, 280, 289, 298, 307, 316, 326, 335, 344, 353, 362, 371, 381, 390, 399, 408, 417, 427, 436, 445, 454, 463, 472, 482, 491, 500],
 measurements = [0.9827865362167358, 0.9765130877494812, 0.9802314043045044, 0.9803095459938049, 0.9777001738548279, 0.9799514412879944, 0.9762154817581177, 0.9763209223747253, 0.9742343425750732, 0.9760051965713501, 0.9764391779899597, 0.9756509065628052, 0.980719268321991, 0.9772397875785828, 0.9772177934646606, 0.9768736958503723, 0.9761089086532593, 0.9740495085716248, 0.9750213623046875, 0.9736446738243103, 0.9707552790641785, 0.9720619320869446, 0.9704257845878601, 0.9698981046676636, 0.969187319278717, 0.9682750701904297, 0.9691286683082581, 0.9673327803611755, 0.9684506058692932, 0.9694636464118958, 0.9687978625297546, 0.9675275087356567, 0.9685811996459961, 0.9674876928329468, 0.9682658314704895, 0.9671816229820251, 0.9675007462501526, 0.9683974981307983, 0.9670326113700867, 0.9688385128974915, 0.967542827129364, 0.9679813385009766, 0.9694725275039673, 0.9683115482330322, 0.9692115783691406, 0.9678464531898499, 0.9694718718528748, 0.9693232178688049, 0.9685226082801819, 0.9692760705947876],)

Let's have a look

plot(curve.parameter_values, curve.measurements)
xlabel!("Number of rounds", fontsize=14)
ylabel!("Brier loss", fontsize=14)

Not a lot of improvement after 300 rounds.

xgb.num_round = 300;

Let's now tune the maximum depth of each tree and the minimum child weight in the boosting.

r1 = range(xgb, :max_depth, lower=3, upper=10)
r2 = range(xgb, :min_child_weight, lower=0, upper=5)

tuned_model = TunedModel(
    xgb,
    tuning=Grid(resolution=8),
    resampling=CV(rng=11),
    ranges=[r1,r2],
    measure=brier_loss,
)
mach = machine(tuned_model, X, y)
fit!(mach, rows=train)

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = XGBoostClassifier(test = 1, …), …)
  args: 
    1:	Source @211 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @226 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}

Let's visualize details about the tuning:

plot(mach)

Let's extract the optimal model and inspect its parameters:

xgb = fitted_params(mach).best_model
@show xgb.max_depth
@show xgb.min_child_weight

xgb.max_depth = 3
xgb.min_child_weight = 1.4285714285714286

‎

Let's examine the effect of gamma. This time we'll use a visual approach:

mach = machine(xgb, X, y)
curve = learning_curve(
    mach,
    range= range(xgb, :gamma, lower=0, upper=10),
    resolution=30,
    measure=brier_loss,
);

plot(curve.parameter_values, curve.measurements)
xlabel!("gamma", fontsize=14)
ylabel!("Brier loss", fontsize=14)

The following choice looks about optimal:

xgb.gamma = 3.8

3.8

performance.

‎

Let's next examine the effect of subsample and colsample_bytree:

r1 = range(xgb, :subsample, lower=0.6, upper=1.0)
r2 = range(xgb, :colsample_bytree, lower=0.6, upper=1.0)

tuned_model = TunedModel(
    xgb,
    tuning=Grid(resolution=8),
    resampling=CV(rng=234),
    ranges=[r1,r2],
    measure=brier_loss,
)
mach = machine(tuned_model, X, y)
fit!(mach, rows=train)

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = XGBoostClassifier(test = 1, …), …)
  args: 
    1:	Source @968 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @747 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}

plot(mach)

Let's retrieve the best models:

xgb = fitted_params(mach).best_model
@show xgb.subsample
@show xgb.colsample_bytree

xgb.subsample = 0.6571428571428571
xgb.colsample_bytree = 1.0

We could continue with more fine tuning but given how small the dataset is, it doesn't make much sense. How does it fare on the test set?

ŷ = predict_mode(mach, rows=test)
round(accuracy(ŷ, y[test]), sigdigits=3)

0.817

Not too bad.

‎

Crabs with XGBoost

First steps

XGBoost machine

More tuning (1)

More tuning (2)

More tuning (3)