Boston with LightGBM
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
Main author: Yaqub Alwan (IQVIA).
using MLJ
import DataFrames
import Statistics
import StableRNGs.StableRNG
LGBMRegressor = @load LGBMRegressor pkg=LightGBM
import LightGBM ✔
LightGBM.MLJInterface.LGBMRegressor
Let us try LightGBM out by doing a regression task on the Boston house prices dataset. This is a commonly used dataset so there is a loader built into MLJ.
Here, the objective is to show how LightGBM can do better than a Linear Regressor with minimal effort.
We start out by taking a quick peek at the data itself and its statistical properties.
features, targets = @load_boston
features = DataFrames.DataFrame(features);
@show size(features)
@show targets[1:3]
first(features, 3)
size(features) = (506, 12)
targets[1:3] = [24.0, 21.6, 34.7]
3×12 DataFrame
Row │ Crim Zn Indus NOx Rm Age Dis Rad Tax PTRatio Black LStat
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0.538 6.575 65.2 4.09 1.0 296.0 15.3 396.9 4.98
2 │ 0.02731 0.0 7.07 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.9 9.14
3 │ 0.02729 0.0 7.07 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
schema(features)
┌─────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├─────────┼────────────┼─────────┤
│ Crim │ Continuous │ Float64 │
│ Zn │ Continuous │ Float64 │
│ Indus │ Continuous │ Float64 │
│ NOx │ Continuous │ Float64 │
│ Rm │ Continuous │ Float64 │
│ Age │ Continuous │ Float64 │
│ Dis │ Continuous │ Float64 │
│ Rad │ Continuous │ Float64 │
│ Tax │ Continuous │ Float64 │
│ PTRatio │ Continuous │ Float64 │
│ Black │ Continuous │ Float64 │
│ LStat │ Continuous │ Float64 │
└─────────┴────────────┴─────────┘
We can also describe the dataframe
DataFrames.describe(features)
12×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Float64 Float64 Float64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Crim 3.61352 0.00632 0.25651 88.9762 0 Float64
2 │ Zn 11.3636 0.0 0.0 100.0 0 Float64
3 │ Indus 11.1368 0.46 9.69 27.74 0 Float64
4 │ NOx 0.554695 0.385 0.538 0.871 0 Float64
5 │ Rm 6.28463 3.561 6.2085 8.78 0 Float64
6 │ Age 68.5749 2.9 77.5 100.0 0 Float64
7 │ Dis 3.79504 1.1296 3.20745 12.1265 0 Float64
8 │ Rad 9.54941 1.0 5.0 24.0 0 Float64
9 │ Tax 408.237 187.0 330.0 711.0 0 Float64
10 │ PTRatio 18.4555 12.6 19.05 22.0 0 Float64
11 │ Black 356.674 0.32 391.44 396.9 0 Float64
12 │ LStat 12.6531 1.73 11.36 37.97 0 Float64
Do the usual train/test partitioning. This is important so we can estimate generalisation.
train, test = partition(eachindex(targets), 0.70, rng=StableRNG(52))
([52, 17, 330, 191, 265, 172, 19, 481, 94, 493, 490, 463, 430, 282, 169, 408, 437, 89, 234, 72, 130, 461, 135, 398, 287, 351, 50, 363, 182, 346, 99, 80, 203, 255, 60, 366, 459, 87, 71, 359, 390, 200, 21, 142, 483, 369, 304, 505, 377, 121, 29, 37, 20, 65, 258, 281, 133, 362, 116, 426, 27, 211, 489, 165, 55, 420, 256, 149, 98, 495, 6, 69, 53, 260, 434, 318, 128, 168, 160, 70, 131, 90, 181, 331, 187, 321, 157, 223, 343, 110, 427, 198, 144, 348, 44, 176, 350, 113, 31, 482, 36, 303, 190, 396, 189, 380, 152, 251, 415, 245, 45, 114, 237, 141, 132, 136, 78, 283, 353, 16, 117, 263, 344, 63, 262, 424, 306, 11, 312, 372, 277, 43, 123, 86, 308, 416, 57, 289, 497, 405, 284, 272, 103, 236, 32, 502, 356, 75, 414, 28, 302, 208, 407, 143, 423, 59, 368, 466, 115, 197, 334, 473, 174, 242, 428, 462, 317, 111, 469, 84, 1, 153, 393, 498, 127, 231, 247, 79, 421, 381, 222, 455, 374, 9, 66, 395, 224, 18, 345, 139, 300, 202, 34, 192, 339, 24, 432, 352, 299, 354, 243, 328, 313, 292, 218, 269, 503, 4, 285, 394, 35, 389, 47, 291, 41, 436, 413, 442, 148, 496, 384, 173, 166, 48, 409, 288, 367, 81, 225, 338, 42, 112, 323, 324, 171, 446, 315, 464, 229, 319, 364, 491, 73, 137, 422, 365, 227, 38, 105, 271, 207, 74, 92, 298, 209, 278, 453, 275, 379, 470, 204, 444, 22, 332, 39, 458, 457, 163, 250, 178, 460, 201, 196, 314, 186, 268, 106, 316, 228, 468, 164, 124, 270, 406, 297, 118, 307, 341, 97, 7, 376, 474, 257, 467, 371, 62, 274, 412, 438, 439, 445, 311, 327, 241, 399, 403, 215, 125, 375, 180, 183, 146, 122, 388, 161, 488, 431, 238, 401, 476, 235, 433, 177, 452, 220, 296, 101, 383, 61, 456, 261, 46, 440, 342, 309, 167, 232, 23, 417, 347, 109, 355, 322, 253, 216, 120, 244, 26, 397, 170, 162, 67, 230, 154], [233, 385, 30, 91, 140, 335, 25, 358, 254, 226, 279, 156, 301, 214, 492, 96, 76, 249, 259, 400, 280, 83, 449, 159, 305, 88, 337, 3, 477, 199, 185, 494, 184, 402, 425, 392, 360, 378, 295, 479, 104, 441, 108, 290, 58, 273, 56, 294, 193, 10, 325, 478, 386, 49, 326, 82, 54, 206, 349, 129, 194, 107, 480, 475, 472, 195, 77, 252, 485, 248, 499, 155, 15, 219, 188, 382, 100, 504, 239, 387, 221, 443, 447, 179, 210, 418, 64, 150, 410, 465, 404, 12, 102, 276, 5, 212, 119, 486, 8, 501, 51, 310, 320, 340, 429, 95, 240, 451, 213, 175, 500, 293, 138, 333, 448, 145, 357, 134, 471, 205, 68, 126, 329, 391, 14, 33, 336, 370, 264, 217, 411, 13, 266, 454, 435, 267, 147, 286, 40, 2, 158, 373, 419, 85, 487, 450, 506, 151, 246, 484, 93, 361])
Let us investigate some of the commonly tweaked LightGBM parameters. We start with looking at a learning curve for number of boostings.
lgb = LGBMRegressor() #initialised a model with default params
mach = machine(lgb, features[train, :], targets[train, 1])
curve = learning_curve(
mach,
resampling=CV(nfolds=5),
range=range(lgb, :num_iterations, lower=2, upper=500),
resolution=60,
measure=rms,
)
using Plots
dims = (600, 370)
plt = plot(curve.parameter_values, curve.measurements, size=dims)
xlabel!("Number of rounds", fontsize=14)
ylabel!("RMSE", fontsize=14)
// Image matching '/assets/end-to-end/boston-lgbm/code/lgbm_hp1.svg' not found. //
It looks like that we don't need to go much past 100 boosts
Since LightGBM is a gradient based learning method, we also have a learning rate parameter which controls the size of gradient updates.
Let us look at a learning curve for this parameter too
lgb = LGBMRegressor() #initialised a model with default params
mach = machine(lgb, features[train, :], targets[train, 1])
curve = learning_curve(
mach,
resampling=CV(nfolds=5),
range=range(lgb, :learning_rate, lower=1e-3, upper=1, scale=:log),
resolution=60,
measure=rms,
)
plot(
curve.parameter_values,
curve.measurements,
size=dims,
xscale =:log10,
)
xlabel!("Learning rate (log scale)", fontsize=14)
ylabel!("RMSE", fontsize=14)
// Image matching '/assets/end-to-end/boston-lgbm/code/lgbm_hp2.svg' not found. //
It seems like near 0.5 is a reasonable place. Bearing in mind that for lower values of learning rate we possibly require more boosting in order to converge, so the default value of 100 might not be sufficient for convergence. We leave this as an exercise to the reader. We can still try to tune this parameter, however.
Finally let us check number of datapoints required to produce a leaf in an individual tree. This parameter controls the complexity of individual learner trees, and too low a value might lead to overfitting.
lgb = LGBMRegressor() #initialised a model with default params
mach = machine(lgb, features[train, :], targets[train, 1])
untrained Machine; caches model-specific representations of data
model: LGBMRegressor(boosting = gbdt, …)
args:
1: Source @079 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @223 ⏎ AbstractVector{ScientificTypesBase.Continuous}
dataset is small enough and the lower and upper sets the tree to have certain number of leaves
curve = learning_curve(
mach,
resampling=CV(nfolds=5),
range=range(lgb, :min_data_in_leaf, lower=1, upper=50),
measure=rms,
)
plot(curve.parameter_values, curve.measurements, size=dims)
xlabel!("Min data in leaf", fontsize=14)
ylabel!("RMSE", fontsize=14)
// Image matching '/assets/end-to-end/boston-lgbm/code/lgbm_hp3.svg' not found. //
It does not seem like there is a huge risk for overfitting, and lower is better for this parameter.
Using the learning curves above we can select some small-ish ranges to jointly search for the best combinations of these parameters via cross validation.
r1 = range(lgb, :num_iterations, lower=50, upper=100)
r2 = range(lgb, :min_data_in_leaf, lower=2, upper=10)
r3 = range(lgb, :learning_rate, lower=1e-1, upper=1e0)
tuned_model = TunedModel(
lgb,
tuning=RandomSearch(),
resampling=CV(rng=StableRNG(123)),
ranges=[r1,r2,r3],
measure=rms,
n=100,
)
mach = machine(tuned_model, features, targets)
fit!(mach, rows=train);
Let's see what the cross validation best model parameters turned out to be?
best_model = fitted_params(mach).best_model
@show best_model.learning_rate
@show best_model.min_data_in_leaf
@show best_model.num_iterations
best_model.learning_rate = 0.3320079815368962
best_model.min_data_in_leaf = 7
best_model.num_iterations = 82
Great, and now let's predict using the held out data (predicting using the TunedModel
machine uses the best model trained on all the train
data):
predictions = MLJ.predict(mach, rows=test)
rms_score = round(rms(predictions, targets[test, 1]), sigdigits=4)
@show rms_score
rms_score = 3.606