LGBMRegressor
LGBMRegressor
A model type for constructing a LightGBM regressor, based on LightGBM.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
LGBMRegressor = @load LGBMRegressor pkg=LightGBM
Do model = LGBMRegressor()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in LGBMRegressor(boosting=...)
.
LightGBM, short for light gradient-boosting machine, is a framework for gradient boosting based on decision tree algorithms and used for classification, regression and other machine learning tasks, with a focus on performance and scalability. This model in particular is used for various types of regression tasks.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X, y)
Here:
X
is any table of input features (eg, aDataFrame
) whose columns are of scitypeContinuous
; check the column scitypes withschema(X)
; alternatively,X
is anyAbstractMatrix
withContinuous
elements; check the scitype withscitype(X)
.- y is a vector of targets whose items are of scitype
Continuous
. Check the scitype withscitype(y)
.
Train the machine using fit!(mach, rows=...)
.
Operations
predict(mach, Xnew)
: return predictions of the target given new featuresXnew
, which should have the same scitype asX
above.
Hyper-parameters
See https://lightgbm.readthedocs.io/en/v3.3.5/Parameters.html.
Currently, the following parameters and their defaults are supported:
boosting::String = "gbdt"
,num_iterations::Int = 100::(_ >= 0)
,learning_rate::Float64 = 0.1::(_ > 0.)
,num_leaves::Int = 31::(1 < _ <= 131072)
,max_depth::Int = -1
,tree_learner::String = "serial"
,histogram_pool_size::Float64 = -1.0
,min_data_in_leaf::Int = 20::(_ >= 0)
,min_sum_hessian_in_leaf::Float64 = 1e-3::(_ >= 0.0)
,max_delta_step::Float64 = 0.0
,lambda_l1::Float64 = 0.0::(_ >= 0.0)
,lambda_l2::Float64 = 0.0::(_ >= 0.0)
,min_gain_to_split::Float64 = 0.0::(_ >= 0.0)
,feature_fraction::Float64 = 1.0::(0.0 < _ <= 1.0)
,feature_fraction_bynode::Float64 = 1.0::(0.0 < _ <= 1.0)
,feature_fraction_seed::Int = 2
,bagging_fraction::Float64 = 1.0::(0.0 < _ <= 1.0)
,pos_bagging_fraction::Float64 = 1.0::(0.0 < _ <= 1.0)
,neg_bagging_fraction::Float64 = 1.0::(0.0 < _ <= 1.0)
,bagging_freq::Int = 0::(_ >= 0)
,bagging_seed::Int = 3
,early_stopping_round::Int = 0
,extra_trees::Bool = false
,extra_seed::Int = 6
,max_bin::Int = 255::(_ > 1)
,bin_construct_sample_cnt = 200000::(_ > 0)
,drop_rate::Float64 = 0.1::(0.0 <= _ <= 1.0)
,max_drop::Int = 50
,skip_drop:: Float64 = 0.5::(0.0 <= _ <= 1)
,xgboost_dart_mode::Bool = false
,uniform_drop::Bool = false
,drop_seed::Int = 4
,top_rate::Float64 = 0.2::(0.0 <= _ <= 1.0)
,other_rate::Float64 = 0.1::(0.0 <= _ <= 1.0)
,min_data_per_group::Int = 100::(_ > 0)
,max_cat_threshold::Int = 32::(_ > 0)
,cat_l2::Float64 = 10.0::(_ >= 0)
,cat_smooth::Float64 = 10.0::(_ >= 0)
,objective::String = "regression"
,categorical_feature::Vector{Int} = Vector{Int}()
,data_random_seed::Int = 1
,is_sparse::Bool = true
,is_unbalance::Bool = false
,boost_from_average::Bool = true
,scale_pos_weight::Float64 = 1.0
,use_missing::Bool = true
,linear_tree::Bool = false
,feature_pre_filter::Bool = true
,alpha::Float64 = 0.9::(_ > 0.0 )
,metric::Vector{String} = ["l2"]
,metric_freq::Int = 1::(_ > 0)
,is_provide_training_metric::Bool = false
,eval_at::Vector{Int} = Vector{Int}([1, 2, 3, 4, 5])::(all(_ .> 0))
,num_machines::Int = 1::(_ > 0)
,num_threads::Int = 0::(_ >= 0)
,local_listen_port::Int = 12400::(_ > 0)
,time_out::Int = 120::(_ > 0)
,machine_list_file::String = ""
,save_binary::Bool = false
,device_type::String = "cpu"
,gpu_use_dp::Bool = false
,gpu_platform_id::Int = -1
,gpu_device_id::Int = -1
,num_gpu::Int = 1
,force_col_wise::Bool = false
,force_row_wise::Bool = false
,truncate_booster::Bool = true
.
Fitted parameters
The fields of fitted_params(mach)
are:
fitresult
: Fitted model information, contains aLGBMRegression
object, an empty vector, and the regressor with all its parameters
Report
The fields of report(mach)
are:
training_metrics
: A dictionary containing all training metrics.importance
: Anamedtuple
containing:gain
: The total gain of each split used by the modelsplit
: The number of times each feature is used by the model.
Examples
using DataFrames
using MLJ
## load the model
LGBMRegressor = @load LGBMRegressor pkg=LightGBM
X, y = @load_boston ## a table and a vector
X = DataFrame(X)
train, test = partition(collect(eachindex(y)), 0.70, shuffle=true)
first(X, 3)
lgb = LGBMRegressor() ## initialise a model with default params
mach = machine(lgb, X[train, :], y[train]) |> fit!
predict(mach, X[test, :])
## access feature importances
model_report = report(mach)
gain_importance = model_report.importance.gain
split_importance = model_report.importance.split
See also LightGBM.jl and the unwrapped model type LightGBM.LGBMRegression