Combining Resamplers
Resampling methods can be combined sequentially or in parallel, along with a classification model, to yield hybrid or ensemble models that may be even more powerful than using the classification model with only one of the individual resamplers.
Sequential Resampling
MLJBalancing.jl
allows chaining an arbitrary number of resamplers from Imbalance.jl
(also called balancers) with classification models from MLJ
via BalancedModel
. This makes it possible to use BalancedModel
to form hybrid resampling methods that combine oversampling and under-sampling methods in a linear pipeline such as SMOTE-Tomek
and SMOTE-ENN
.
Construct the resampler and classification models
SMOTE = @load SMOTE pkg=Imbalance verbosity=0
TomekUndersampler = @load TomekUndersampler pkg=Imbalance verbosity=0
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
oversampler = SMOTE(k=5, ratios=1.0, rng=42)
undersampler = TomekUndersampler(min_ratios=0.5, rng=42)
logistic_model = LogisticClassifier()
Wrap them all in BalancedModel
balanced_model = BalancedModel(model=logistic_model,
balancer1=oversampler, balancer2=undersampler)
Here training data will be passed to balancer1
then balancer2
, whose output is used to train the classifier model
. In prediction, the resamplers balancer1
and blancer2
are bypassed and in general. At this point, they behave like one single MLJ
model that can be fit, validated or fine-tuned like any other.
In general, there can be any number of balancers, and the user can give the balancers arbitrary names.
Parallel Resampling with Balanced Bagging
MLJBalancing.jl
also offers an implementation of bagging over probabilistic classifiers where the majority class is randomly undersampled T
times down to the size of the minority class then a model is trained on each of the T
undersampled datasets. The predictions are then aggregated by averaging. This is offered via BalancedBaggingClassifier
and can be only used for binary classification.
BalancedBaggingClassifier(model=nothing, T=0, rng = Random.default_rng(),)
Arguments
model::Probabilistic
: A probabilistic classification model that implements theMLJModelInterface
T::Integer=0
: The number of bags to be used in the ensemble. If not given, will be set as the ratio between the frequency of the majority and minority classes.rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used with Xoshiro
Example
using MLJ
using Imbalance
using MLJBalancing
X, y = generate_imbalanced_data(100, 5; cat_feats_num_vals = [3, 2],
probs = [0.9, 0.1],
type = "ColTable",
rng=42)
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
logistic_model = LogisticClassifier()
bagging_model = BalancedBaggingClassifier(model=logistic_model, T=10, rng=Random.Xoshiro(42))
Now you can fit, predict, cross-validate and finetune it like any other probabilistic MLJ model where X
must be a table input (e.g., a dataframe).
mach = machine(bagging_model, X, y)
fit!(mach)
pred = predict(mach, X)