Undersampling Algorithms
The following table portrays the supported undersampling algorithms, whether the mechanism deletes or generates new data and the supported types of data.
Undersampling Method | Mechanism | Supported Data Types |
---|---|---|
Random Undersampler | Delete existing data as needed | Continuous and/or nominal |
Cluster Undersampler | Generate new data or delete existing data | Continuous |
Edited Nearest Neighbors Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
Tomek Links Undersampler | Delete existing data meeting certain conditions (cleaning) | Continuous |
Random Undersampler
Imbalance.random_undersample
— Functionrandom_undersample(
X, y;
ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
Description
Naively undersample a dataset by randomly deleting existing observations.
Positional Arguments
X
: A matrix of real numbers or a table with element scitypes that subtypeUnion{Finite, Infinite}
. Elements in nominal columns should subtypeFinite
(i.e., have scitypeOrderedFactor
orMulticlass
) and elements in continuous columns should subtypeInfinite
(i.e., have scitypeCount
orContinuous
).y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
ratios=1.0
: A parameter that controls the amount of undersampling to be done for each class- Can be a float and in this case each class will be undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float ratio for that class
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
X_under
: A matrix or table that includes the data after undersampling depending on whether the inputX
is a matrix or table respectivelyy_under
: An abstract vector of labels corresponding toX_under
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%)
# apply randomundersampling
X_under, y_under = random_undersample(X, y; ratios=Dict(0=>1.0, 1=> 1.0, 2=>1.0),
rng=42)
julia> Imbalance.checkbalance(y_under; ref="minority")
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the RandomUndersampler
model and pass the positional arguments X, y
to the transform
method.
using MLJ
RandomUndersampler = @load RandomUndersampler pkg=Imbalance
# Wrap the model in a machine
undersampler = RandomUndersampler(ratios=Dict(0=>1.0, 1=> 1.0, 2=>1.0),
rng=42)
mach = machine(undersampler)
# Provide the data to transform (there is nothing to fit)
X_under, y_under = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate Random Undersampler model
undersampler = RandomUndersampler(y_ind; ratios=Dict(0=>1.0, 1=>1.0, 2=>1.0), rng=42)
Xy_under = Xy |> undersampler
Xy_under, cache = TableTransforms.apply(undersampler, Xy) # equivalently
The reapply(undersampler, Xy, cache)
method from TableTransforms
simply falls back to apply(undersample, Xy)
and the revert(undersampler, Xy, cache)
is not supported.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
Cluster Undersampler
Imbalance.cluster_undersample
— Functioncluster_undersample(
X, y;
mode= "nearest", ratios = 1.0, maxiter = 100,
rng=default_rng(), try_preserve_type=true
)
Description
Undersample a dataset using clustering undersampling as presented in [1] using K-means as the clustering algorithm.
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
mode::AbstractString="nearest
: If"center"
then the undersampled data will consist of the centriods of each cluster found; meanwhile, if"nearest"
then it will consist of the nearest neighbor of each centroid.ratios=1.0
: A parameter that controls the amount of undersampling to be done for each class- Can be a float and in this case each class will be undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float ratio for that class
maxiter::Integer=100
: Maximum number of iterations to run K-meansrng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
X_under
: A matrix or table that includes the data after undersampling depending on whether the inputX
is a matrix or table respectivelyy_under
: An abstract vector of labels corresponding toX_under
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%)
# apply cluster_undersampling
X_under, y_under = cluster_undersample(X, y; mode="nearest",
ratios=Dict(0=>1.0, 1=> 1.0, 2=>1.0), rng=42)
julia> Imbalance.checkbalance(y_under; ref="minority")
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the ClusterUndersampler
model and pass the positional arguments X, y
to the transform
method.
using MLJ
ClusterUndersampler = @load ClusterUndersampler pkg=Imbalance
# Wrap the model in a machine
undersampler = ClusterUndersampler(mode="nearest",
ratios=Dict(0=>1.0, 1=> 1.0, 2=>1.0), rng=42)
mach = machine(undersampler)
# Provide the data to transform (there is nothing to fit)
X_under, y_under = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate ClusterUndersampler model
undersampler = ClusterUndersampler(y_ind; mode="nearest",
ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xy_under = Xy |> undersampler
Xy_under, cache = TableTransforms.apply(undersampler, Xy) # equivalently
The reapply(undersampler, Xy, cache)
method from TableTransforms
simply falls back to apply(undersample, Xy)
and the revert(undersampler, Xy, cache)
is not supported.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] Wei-Chao, L., Chih-Fong, T., Ya-Han, H., & Jing-Shang, J. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409–410, 17–26.
Edited Nearest Neighbors Undersampler
Imbalance.enn_undersample
— Functionenn_undersample(
X, y; k = 5, keep_condition = "mode",
min_ratios = 1.0, force_min_ratios = false,
rng = default_rng(), try_preserve_type=true
)
Description
Undersample a dataset by removing points that violate a certain condition such as belonging to a different class compared to the majority of the neighbors, as proposed in [1].
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
k::Integer=5
: Number of nearest neighbors to consider in the algorithm. Should be within the range0 < k < n
where n is the number of observations in the data. It will be automatically set ton-1
ifn ≤ k
.
keep_condition::AbstractString="mode"
: The condition that leads to removing a point upon violation. Takes one of"exists"
,"mode"
,"only mode"
and"all"
"exists"
: the point has at least one neighbor from the same class"mode"
: the class of the point is one of the most frequent classes of the neighbors (there may be many)"only mode"
: the class of the point is the single most frequent class of the neighbors"all"
: the class of the point is the same as all the neighbors
min_ratios=1.0
: A parameter that controls the maximum amount of undersampling to be done for each class. If this algorithm cleans the data to an extent that this is violated, some of the cleaned points will be revived randomly so that it is satisfied.- Can be a float and in this case each class will be at most undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float minimum ratio for that class
force_min_ratios=false
: Iftrue
, and this algorithm cleans the data such that the ratios for each class exceed those specified inmin_ratios
then further undersampling will be perform so that the final ratios are equal tomin_ratios
.
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
X_under
: A matrix or table that includes the data after undersampling depending on whether the inputX
is a matrix or table respectivelyy_under
: An abstract vector of labels corresponding toX_under
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)
julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%)
# apply enn undersampling
X_under, y_under = enn_undersample(X, y; k=3, keep_condition="only mode",
min_ratios=0.5, rng=42)
julia> Imbalance.checkbalance(y_under; ref="minority")
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10 (100.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10 (100.0%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24 (240.0%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the ENNUndersampler
model and pass the positional arguments X, y
to the transform
method.
using MLJ
ENNUndersampler = @load ENNUndersampler pkg=Imbalance
# Wrap the model in a machine
undersampler = ENNUndersampler(k=3, keep_condition="only mode", min_ratios=0.5, rng=42)
mach = machine(undersampler)
# Provide the data to transform (there is nothing to fit)
X_under, y_under = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)
# Initiate ENN Undersampler model
undersampler = ENNUndersampler(y_ind; k=3, keep_condition="only mode", rng=42)
Xy_under = Xy |> undersampler
Xy_under, cache = TableTransforms.apply(undersampler, Xy) # equivalently
The reapply(undersampler, Xy, cache)
method from TableTransforms
simply falls back to apply(undersample, Xy)
and the revert(undersampler, Xy, cache)
is not supported.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] Dennis L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, pages 408–421, 1972.
Tomek Links Undersampler
Imbalance.tomek_undersample
— Functiontomek_undersample(
X, y;
min_ratios = 1.0, force_min_ratios = false,
rng = default_rng(), try_preserve_type=true
)
Description
Undersample a dataset by removing ("cleaning") any point that is part of a tomek link in the data. Tomek links are presented in [1].
Positional Arguments
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Keyword Arguments
min_ratios=1.0
: A parameter that controls the maximum amount of undersampling to be done for each class. If this algorithm cleans the data to an extent that this is violated, some of the cleaned points will be revived randomly so that it is satisfied.- Can be a float and in this case each class will be at most undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float minimum ratio for that class
force_min_ratios=false
: Iftrue
, and this algorithm cleans the data such that the ratios for each class exceed those specified inmin_ratios
then further undersampling will be perform so that the final ratios are equal tomin_ratios
.
rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Returns
X_under
: A matrix or table that includes the data after undersampling depending on whether the inputX
is a matrix or table respectivelyy_under
: An abstract vector of labels corresponding toX_under
Example
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)
julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%)
# apply tomek undersampling
X_under, y_under = tomek_undersample(X, y; min_ratios=1.0, rng=42)
julia> Imbalance.checkbalance(y_under; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 22 (115.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 36 (189.5%)
MLJ Model Interface
Simply pass the keyword arguments while initiating the TomekUndersampler
model and pass the positional arguments X, y
to the transform
method.
using MLJ
TomekUndersampler = @load TomekUndersampler pkg=Imbalance
# Wrap the model in a machine
undersampler = TomekUndersampler(min_ratios=1.0, rng=42)
mach = machine(undersampler)
# Provide the data to transform (there is nothing to fit)
X_under, y_under = transform(mach, X, y)
You can read more about this MLJ
interface by accessing it from MLJ's model browser.
TableTransforms Interface
This interface assumes that the input is one table Xy
and that y
is one of the columns. Hence, an integer y_ind
must be specified to the constructor to specify which column y
is followed by other keyword arguments. Only Xy
is provided while applying the transform.
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)
# Initiate TomekUndersampler model
undersampler = TomekUndersampler(y_ind; min_ratios=1.0, rng=42)
Xy_under = Xy |> undersampler
Xy_under, cache = TableTransforms.apply(undersampler, Xy) # equivalently
The reapply(undersampler, Xy, cache)
method from TableTransforms
simply falls back to apply(undersample, Xy)
and the revert(undersampler, Xy, cache)
is not supported.
Illustration
A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.
References
[1] Ivan Tomek. Two modifications of cnn. IEEE Trans. Systems, Man and Cybernetics, 6:769–772, 1976.