Combining Resamplers
Resampling methods can be combined sequentially or in parallel, along with a classification model, to yield hybrid or ensemble models that may be even more powerful than using the classification model with only one of the individual resamplers.
Sequential Resampling
MLJBalancing.jl allows chaining an arbitrary number of resamplers from Imbalance.jl (also called balancers) with classification models from MLJ via BalancedModel. This makes it possible to use BalancedModel to form hybrid resampling methods that combine oversampling and under-sampling methods in a linear pipeline such as SMOTE-Tomek and SMOTE-ENN.
Construct the resampler and classification models
SMOTE = @load SMOTE pkg=Imbalance verbosity=0
TomekUndersampler = @load TomekUndersampler pkg=Imbalance verbosity=0
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
oversampler = SMOTE(k=5, ratios=1.0, rng=42)
undersampler = TomekUndersampler(min_ratios=0.5, rng=42)
logistic_model = LogisticClassifier()Wrap them all in BalancedModel
balanced_model = BalancedModel(model=logistic_model,
balancer1=oversampler, balancer2=undersampler)Here training data will be passed to balancer1 then balancer2, whose output is used to train the classifier model. In prediction, the resamplers balancer1 and blancer2 are bypassed and in general. At this point, they behave like one single MLJ model that can be fit, validated or fine-tuned like any other.
In general, there can be any number of balancers, and the user can give the balancers arbitrary names.
Parallel Resampling with Balanced Bagging
MLJBalancing.jl also offers an implementation of bagging over probabilistic classifiers where the majority class is randomly undersampled T times down to the size of the minority class then a model is trained on each of the T undersampled datasets. The predictions are then aggregated by averaging. This is offered via BalancedBaggingClassifier and can be only used for binary classification.
BalancedBaggingClassifier(model=nothing, T=0, rng = Random.default_rng(),)Arguments
model::Probabilistic: A probabilistic classification model that implements theMLJModelInterfaceT::Integer=0: The number of bags to be used in the ensemble. If not given, will be set as the ratio between the frequency of the majority and minority classes.rng::Union{AbstractRNG, Integer}=default_rng(): Either anAbstractRNGobject or anInteger
seed to be used with Xoshiro
Example
using MLJ
using Imbalance
using MLJBalancing
X, y = generate_imbalanced_data(100, 5; cat_feats_num_vals = [3, 2],
probs = [0.9, 0.1],
type = "ColTable",
rng=42)
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
logistic_model = LogisticClassifier()
bagging_model = BalancedBaggingClassifier(model=logistic_model, T=10, rng=Random.Xoshiro(42))Now you can fit, predict, cross-validate and finetune it like any other probabilistic MLJ model where X must be a table input (e.g., a dataframe).
mach = machine(bagging_model, X, y)
fit!(mach)
pred = predict(mach, X)