Oversampling Algorithms
The following table portrays the supported oversampling algorithms, whether the mechanism repeats or generates data and the supported types of data.
Oversampling Method | Mechanism | Supported Data Types |
---|---|---|
Random Oversampler | Repeat existing data | Continuous and/or nominal |
Random Walk Oversampler | Generate synthetic data | Continuous and/or nominal |
ROSE | Generate synthetic data | Continuous |
SMOTE | Generate synthetic data | Continuous |
Borderline SMOTE1 | Generate synthetic data | Continuous |
SMOTE-N | Generate synthetic data | Nominal |
SMOTE-NC | Generate synthetic data | Continuous and nominal |
Random Oversampler
Imbalance.random_oversample
— Functionrandom_oversample(
X, y;
ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Naively oversample a dataset by randomly repeating existing observations with replacement.
Positional Arguments
X
: A matrix of real numbers or a table with element scitypes that subtypeUnion{Finite, Infinite}
. Elements in nominal columns should subtypeFinite
(i.e., have scitypeOrderedFactor
orMulticlass
) and elements in continuous columns should subtypeInfinite
(i.e., have scitypeCount
orContinuous
).y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply random oversampling
Xover, yover = random_oversample(X, y; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the RandomOversampler
model and pass the positional arguments X, y
to the transform
method.
using MLJ
RandomOversampler = @load RandomOversampler pkg=Imbalance
# Wrap the model in a machine
oversampler = RandomOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate Random Oversampler model
oversampler = RandomOversampler(y_ind; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
The reapply(oversampler, Xy, cache)
method from TableTransforms
simply falls back to apply(oversample, Xy)
and the revert(oversampler, Xy, cache)
reverts the transform by removing the oversampled observations from the table.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
Random Walk Oversampler
Imbalance.random_walk_oversample
— Functionrandom_walk_oversample(
X, y, cat_inds;
ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Oversamples a dataset using random walk oversampling as presented in [1].
Positional Arguments
X
: A matrix of floats or a table with element scitypes that subtypeUnion{Finite, Infinite}
. Elements in nominal columns should subtypeFinite
(i.e., have scitypeOrderedFactor
orMulticlass
) and elements in continuous columns should subtypeInfinite
(i.e., have scitypeCount
orContinuous
).y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
cat_inds::AbstractVector{<:Int}
: A vector of the indices of the nominal features. Supplied only ifX
is a matrix. Otherwise, they are inferred from the table's scitypes.
Keyword Arguments
ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)
# apply random walk oversampling
Xover, yover = random_walk_oversample(X, y;
ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the RandomWalkOversampling
model and pass the positional arguments (excluding cat_inds
) to the transform
method.
using MLJ
RandomWalkOversampler = @load RandomWalkOversampler pkg=Imbalance
# Wrap the model in a machine
oversampler = RandomWalkOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser. Note that only Table
input is supported by the MLJ interface for this method.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite or continuous scitypes
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)
# Initiate Random Walk Oversampler model
oversampler = RandomWalkOversampler(y_ind;
ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 25, 4-20.
ROSE
Imbalance.rose
— Functionrose(
X, y;
s=1.0, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Oversamples a dataset using ROSE
(Random Oversampling Examples) algorithm to correct for class imbalance as presented in [1]
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
s::float=1.0
: A parameter that proportionally controls the bandwidth of the Gaussian kernelratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply ROSE
Xover, yover = rose(X, y; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the ROSE
model and pass the positional arguments X, y
to the transform
method.
using MLJ
ROSE = @load ROSE pkg=Imbalance
# Wrap the model in a machine
oversampler = ROSE(s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate ROSE model
oversampler = ROSE(y_ind; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
The reapply(oversampler, Xy, cache)
method from TableTransforms
simply falls back to apply(oversample, Xy)
and the revert(oversampler, Xy, cache)
reverts the transform by removing the oversampled observations from the table.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] G Menardi, N. Torelli, “Training and assessing classification rules with imbalanced data,” Data Mining and Knowledge Discovery, 28(1), pp.92-122, 2014.
SMOTE
Imbalance.smote
— Functionsmote(
X, y;
k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Oversamples a dataset using SMOTE
(Synthetic Minority Oversampling Techniques) algorithm to correct for class imbalance as presented in [1]
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
k::Integer=5
: Number of nearest neighbors to consider in the algorithm. Should be within the range0 < k < n
where n is the number of observations in the smallest class.
ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply SMOTE
Xover, yover = smote(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the SMOTE
model and pass the positional arguments X, y
to the transform
method.
using MLJ
SMOTE = @load SMOTE pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTE(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate SMOTE model
oversampler = SMOTE(y_ind; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
The reapply(oversampler, Xy, cache)
method from TableTransforms
simply falls back to apply(oversample, Xy)
and the revert(oversampler, Xy, cache)
reverts the transform by removing the oversampled observations from the table.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.
Borderline SMOTE1
Imbalance.borderline_smote1
— Functionborderline_smote1(
X, y;
m=5, k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true, verbosity=1
)
Description
Oversamples a dataset using borderline SMOTE1 algorithm to correct for class imbalance as presented in [1]
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
m::Integer=5
: The number of neighbors to consider while checking the BorderlineSMOTE1 condition. Should be within the range0 < m < N
where N is the number of observations in the data. It will be automatically set toN-1
ifN ≤ m
.k::Integer=5
: Number of nearest neighbors to consider in the SMOTE part of the algorithm. Should be within the range0 < k < n
where n is the number of observations in the smallest class. It will be automatically set tol-1
for any class withl
points wherel ≤ k
.ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
verbosity::Integer=1
: Whenever higher than0
info regarding the points that will participate in oversampling is logged.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 1000, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
stds=[0.1 0.1 0.1], min_sep=0.01, class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 200 (40.8%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 310 (63.3%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%)
# apply BorderlineSMOTE1
Xover, yover = borderline_smote1(X, y; m = 3,
k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 392 (80.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 441 (90.0%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the BorderlineSMOTE1
model and pass the positional arguments X, y
to the transform
method.
using MLJ
BorderlineSMOTE1 = @load BorderlineSMOTE1 pkg=Imbalance
# Wrap the model in a machine
oversampler = BorderlineSMOTE1(m=3, k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 1000
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], min_sep=0.01, insert_y=y_ind, rng=42)
# Initiate BorderlineSMOTE1 Oversampler model
oversampler = BorderlineSMOTE1(y_ind; m=3, k=5,
ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)
The reapply(oversampler, Xy, cache)
method from TableTransforms
simply falls back to apply(oversample, Xy)
and the revert(oversampler, Xy, cache)
reverts the transform by removing the oversampled observations from the table.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In D.S. Huang, X.-P. Zhang, & G.-B. Huang (Eds.), Advances in Intelligent Computing (pp. 878-887). Springer.
SMOTE-N
Imbalance.smoten
— Functionsmoten(
X, y;
k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Oversamples a dataset using SMOTE-N
(Synthetic Minority Oversampling Techniques-Nominal) algorithm to correct for class imbalance as presented in [1]. This is a variant of SMOTE
to deal with datasets where all features are nominal.
Positional Arguments
X
: A matrix of integers or a table with element scitypes that subtypeFinite
. That is, for table inputs each column should have eitherOrderedFactor
orMulticlass
as the element scitype.y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
k::Integer=5
: Number of nearest neighbors to consider in the algorithm. Should be within the range0 < k < n
where n is the number of observations in the smallest class.
ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 0
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Count, Count)
# coerce to a finite scitype (multiclass or ordered factor)
X = coerce(X, autotype(X, :few_to_finite))
# apply SMOTEN
Xover, yover = smoten(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the SMOTEN
model and pass the positional arguments X, y
to the transform
method.
using MLJ
SMOTEN = @load SMOTEN pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTEN(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 0
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite scitypes
Xy = coerce(Xy, :Column1=>Multiclass, :Column2=>Multiclass, :Column3=>Multiclass)
# Initiate SMOTEN model
oversampler = SMOTEN(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
The reapply(oversampler, Xy, cache)
method from TableTransforms
simply falls back to apply(oversample, Xy)
and the revert(oversampler, Xy, cache)
reverts the transform by removing the oversampled observations from the table.
Illustration
A full basic example can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.
SMOTE-NC
Imbalance.smotenc
— Functionsmotenc(
X, y, split_ind;
k=5, ratios=1.0, knn_tree="Brute", rng=default_rng(),
try_preserve_type=true
)
Description
Oversamples a dataset using SMOTE-NC
(Synthetic Minority Oversampling Techniques-Nominal Continuous) algorithm to correct for class imbalance as presented in [1]. This is a variant of SMOTE
to deal with datasets with both nominal and continuous features.
Positional Arguments
X
: A matrix of floats or a table with element scitypes that subtypeUnion{Finite, Infinite}
. Elements in nominal columns should subtypeFinite
(i.e., have scitypeOrderedFactor
orMulticlass
) and elements in continuous columns should subtypeInfinite
(i.e., have scitypeCount
orContinuous
).y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
cat_inds::AbstractVector{<:Int}
: A vector of the indices of the nominal features. Supplied only ifX
is a matrix. Otherwise, they are inferred from the table's scitypes.
Keyword Arguments
k::Integer=5
: Number of nearest neighbors to consider in the algorithm. Should be within the range0 < k < n
where n is the number of observations in the smallest class.
ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
knn_tree
: Decides the tree used in KNN computations. Either"Brute"
or"Ball"
. BallTree can be much faster but may lead to inaccurate results.rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Example
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)
# apply SMOTE-NC
Xover, yover = smotenc(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the SMOTENC
model and pass the positional arguments (excluding cat_inds
) to the transform
method.
using MLJ
SMOTENC = @load SMOTENC pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTENC(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser. Note that only Table
input is supported by the MLJ interface for this method.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite or continuous scitypes
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)
# Initiate SMOTENC model
oversampler = SMOTENC(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.