Crabs with XGBoost
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This example is inspired from this post showing how to use XGBoost.
MLJ provides a built-in function to load the Crabs dataset:
using MLJ
using StatsBase
using Random
using Plots
import DataFrames
import StableRNGs.StableRNG
X, y = @load_crabs # a table and a vector
X = DataFrames.DataFrame(X)
@show size(X)
@show y[1:3]
first(X, 3)
size(X) = (200, 5)
y[1:3] = CategoricalArrays.CategoricalValue{String, UInt32}["B", "B", "B"]
3×5 DataFrame
Row │ FL RW CL CW BD
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 8.1 6.7 16.1 19.0 7.0
2 │ 8.8 7.7 18.1 20.8 7.4
3 │ 9.2 7.8 19.0 22.4 7.7
schema(X)
┌───────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────┼────────────┼─────────┤
│ FL │ Continuous │ Float64 │
│ RW │ Continuous │ Float64 │
│ CL │ Continuous │ Float64 │
│ CW │ Continuous │ Float64 │
│ BD │ Continuous │ Float64 │
└───────┴────────────┴─────────┘
We are looking at a classification problem with the following classes:
levels(y)
2-element Vector{String}:
"B"
"O"
It's not a very big dataset so we will likely overfit it badly using something as sophisticated as XGBoost but it will do for a demonstration. Since our data set is ordered by target class, we'll be sure to create shuffled train/test index sets:
train, test = partition(collect(eachindex(y)), 0.70, rng=StableRNG(123))
XGBC = @load XGBoostClassifier
xgb_model = XGBC()
import MLJXGBoostInterface ✔
XGBoostClassifier(
test = 1,
num_round = 100,
booster = "gbtree",
disable_default_eval_metric = 0,
eta = 0.3,
num_parallel_tree = 1,
gamma = 0.0,
max_depth = 6,
min_child_weight = 1.0,
max_delta_step = 0.0,
subsample = 1.0,
colsample_bytree = 1.0,
colsample_bylevel = 1.0,
colsample_bynode = 1.0,
lambda = 1.0,
alpha = 0.0,
tree_method = "auto",
sketch_eps = 0.03,
scale_pos_weight = 1.0,
updater = nothing,
refresh_leaf = 1,
process_type = "default",
grow_policy = "depthwise",
max_leaves = 0,
max_bin = 256,
predictor = "cpu_predictor",
sample_type = "uniform",
normalize_type = "tree",
rate_drop = 0.0,
one_drop = 0,
skip_drop = 0.0,
feature_selector = "cyclic",
top_k = 0,
tweedie_variance_power = 1.5,
objective = "automatic",
base_score = 0.5,
watchlist = nothing,
nthread = 12,
importance_type = "gain",
seed = nothing,
validate_parameters = false,
eval_metric = String[])
Let's check whether the training and is balanced, StatsBase.countmap
is useful for that:
countmap(y[train])
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Int64} with 2 entries:
"B" => 70
"O" => 70
which is pretty balanced. You could check the same on the test set and full set and the same comment would still hold.
Wrap a machine around an XGBoost model (XGB) and the data:
xgb = XGBC()
mach = machine(xgb, X, y)
untrained Machine; caches model-specific representations of data
model: XGBoostClassifier(test = 1, …)
args:
1: Source @654 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @983 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}
We will tune it varying the number of rounds used and generate a learning curve
r = range(xgb, :num_round, lower=50, upper=500)
curve = learning_curve(
mach,
range=r,
resolution=50,
measure=brier_loss,
)
(parameter_name = "num_round",
parameter_scale = :linear,
parameter_values = [50, 59, 68, 78, 87, 96, 105, 114, 123, 133, 142, 151, 160, 169, 179, 188, 197, 206, 215, 224, 234, 243, 252, 261, 270, 280, 289, 298, 307, 316, 326, 335, 344, 353, 362, 371, 381, 390, 399, 408, 417, 427, 436, 445, 454, 463, 472, 482, 491, 500],
measurements = [0.9827865362167358, 0.9765130877494812, 0.9802314043045044, 0.9803095459938049, 0.9777001738548279, 0.9799514412879944, 0.9762154817581177, 0.9763209223747253, 0.9742343425750732, 0.9760051965713501, 0.9764391779899597, 0.9756509065628052, 0.980719268321991, 0.9772397875785828, 0.9772177934646606, 0.9768736958503723, 0.9761089086532593, 0.9740495085716248, 0.9750213623046875, 0.9736446738243103, 0.9707552790641785, 0.9720619320869446, 0.9704257845878601, 0.9698981046676636, 0.969187319278717, 0.9682750701904297, 0.9691286683082581, 0.9673327803611755, 0.9684506058692932, 0.9694636464118958, 0.9687978625297546, 0.9675275087356567, 0.9685811996459961, 0.9674876928329468, 0.9682658314704895, 0.9671816229820251, 0.9675007462501526, 0.9683974981307983, 0.9670326113700867, 0.9688385128974915, 0.967542827129364, 0.9679813385009766, 0.9694725275039673, 0.9683115482330322, 0.9692115783691406, 0.9678464531898499, 0.9694718718528748, 0.9693232178688049, 0.9685226082801819, 0.9692760705947876],)
Let's have a look
plot(curve.parameter_values, curve.measurements)
xlabel!("Number of rounds", fontsize=14)
ylabel!("Brier loss", fontsize=14)
Not a lot of improvement after 300 rounds.
xgb.num_round = 300;
Let's now tune the maximum depth of each tree and the minimum child weight in the boosting.
r1 = range(xgb, :max_depth, lower=3, upper=10)
r2 = range(xgb, :min_child_weight, lower=0, upper=5)
tuned_model = TunedModel(
xgb,
tuning=Grid(resolution=8),
resampling=CV(rng=11),
ranges=[r1,r2],
measure=brier_loss,
)
mach = machine(tuned_model, X, y)
fit!(mach, rows=train)
trained Machine; does not cache data
model: ProbabilisticTunedModel(model = XGBoostClassifier(test = 1, …), …)
args:
1: Source @211 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @226 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}
Let's visualize details about the tuning:
plot(mach)
Let's extract the optimal model and inspect its parameters:
xgb = fitted_params(mach).best_model
@show xgb.max_depth
@show xgb.min_child_weight
xgb.max_depth = 3
xgb.min_child_weight = 1.4285714285714286
Let's examine the effect of gamma
. This time we'll use a visual approach:
mach = machine(xgb, X, y)
curve = learning_curve(
mach,
range= range(xgb, :gamma, lower=0, upper=10),
resolution=30,
measure=brier_loss,
);
plot(curve.parameter_values, curve.measurements)
xlabel!("gamma", fontsize=14)
ylabel!("Brier loss", fontsize=14)
The following choice looks about optimal:
xgb.gamma = 3.8
3.8
performance.
Let's next examine the effect of subsample
and colsample_bytree
:
r1 = range(xgb, :subsample, lower=0.6, upper=1.0)
r2 = range(xgb, :colsample_bytree, lower=0.6, upper=1.0)
tuned_model = TunedModel(
xgb,
tuning=Grid(resolution=8),
resampling=CV(rng=234),
ranges=[r1,r2],
measure=brier_loss,
)
mach = machine(tuned_model, X, y)
fit!(mach, rows=train)
trained Machine; does not cache data
model: ProbabilisticTunedModel(model = XGBoostClassifier(test = 1, …), …)
args:
1: Source @968 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @747 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}
plot(mach)
Let's retrieve the best models:
xgb = fitted_params(mach).best_model
@show xgb.subsample
@show xgb.colsample_bytree
xgb.subsample = 0.6571428571428571
xgb.colsample_bytree = 1.0
We could continue with more fine tuning but given how small the dataset is, it doesn't make much sense. How does it fare on the test set?
ŷ = predict_mode(mach, rows=test)
round(accuracy(ŷ, y[test]), sigdigits=3)
0.817
Not too bad.