Oversampling Algorithms

The following table portrays the supported oversampling algorithms, whether the mechanism repeats or generates data and the supported types of data.

Oversampling MethodMechanismSupported Data Types
Random OversamplerRepeat existing dataContinuous and/or nominal
Random Walk OversamplerGenerate synthetic dataContinuous and/or nominal
ROSEGenerate synthetic dataContinuous
SMOTEGenerate synthetic dataContinuous
Borderline SMOTE1Generate synthetic dataContinuous
SMOTE-NGenerate synthetic dataNominal
SMOTE-NCGenerate synthetic dataContinuous and nominal

Random Oversampler

Imbalance.random_oversampleFunction
random_oversample(
    X, y; 
    ratios=1.0, rng=default_rng(), 
    try_preserve_type=true
)

Description

Naively oversample a dataset by randomly repeating existing observations with replacement.

Positional Arguments

  • X: A matrix of real numbers or a table with element scitypes that subtype Union{Finite, Infinite}. Elements in nominal columns should subtype Finite (i.e., have scitype OrderedFactor or Multiclass) and elements in continuous columns should subtype Infinite (i.e., have scitype Count or Continuous).

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Keyword Arguments

  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class
    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, rng=42)    

julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

# apply random oversampling
Xover, yover = random_oversample(X, y; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the RandomOversampler model and pass the positional arguments X, y to the transform method.

using MLJ
RandomOversampler = @load RandomOversampler pkg=Imbalance

# Wrap the model in a machine
oversampler = RandomOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features; 
                                 class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)

# Initiate Random Oversampler model
oversampler = RandomOversampler(y_ind; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler                    
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

The reapply(oversampler, Xy, cache) method from TableTransforms simply falls back to apply(oversample, Xy) and the revert(oversampler, Xy, cache) reverts the transform by removing the oversampled observations from the table.

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

source

Random Walk Oversampler

Imbalance.random_walk_oversampleFunction
random_walk_oversample(
	X, y, cat_inds;
	ratios=1.0, rng=default_rng(),
	try_preserve_type=true
)

Description

Oversamples a dataset using random walk oversampling as presented in [1].

Positional Arguments

  • X: A matrix of floats or a table with element scitypes that subtype Union{Finite, Infinite}. Elements in nominal columns should subtype Finite (i.e., have scitype OrderedFactor or Multiclass) and elements in continuous columns should subtype Infinite (i.e., have scitype Count or Continuous).

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

  • cat_inds::AbstractVector{<:Int}: A vector of the indices of the nominal features. Supplied only if X is a matrix. Otherwise, they are inferred from the table's scitypes.

Keyword Arguments

  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class
    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance
using ScientificTypes

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]

# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, num_vals_per_category, rng=42)    
								                  
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)

# apply random walk oversampling
Xover, yover = random_walk_oversample(X, y; 
                                      ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the RandomWalkOversampling model and pass the positional arguments (excluding cat_inds) to the transform method.

using MLJ
RandomWalkOversampler = @load RandomWalkOversampler pkg=Imbalance

# Wrap the model in a machine
oversampler = RandomWalkOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser. Note that only Table input is supported by the MLJ interface for this method.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using ScientificTypes
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2

# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
                                class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
                                 rng=42)  

# Table must have only finite or continuous scitypes                                
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)

# Initiate Random Walk Oversampler model
oversampler = RandomWalkOversampler(y_ind;
                                    ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler                               
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 25, 4-20.

source

ROSE

Imbalance.roseFunction
rose(
    X, y; 
    s=1.0, ratios=1.0, rng=default_rng(),
    try_preserve_type=true
)

Description

Oversamples a dataset using ROSE (Random Oversampling Examples) algorithm to correct for class imbalance as presented in [1]

Positional Arguments

  • X: A matrix or table of floats where each row is an observation from the dataset

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Keyword Arguments

  • s::float=1.0: A parameter that proportionally controls the bandwidth of the Gaussian kernel

  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class

    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, rng=42)  

julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

# apply ROSE
Xover, yover = rose(X, y; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the ROSE model and pass the positional arguments X, y to the transform method.

using MLJ
ROSE = @load ROSE pkg=Imbalance

# Wrap the model in a machine
oversampler = ROSE(s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features; 
                                 class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)

# Initiate ROSE model
oversampler = ROSE(y_ind; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler                              
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

The reapply(oversampler, Xy, cache) method from TableTransforms simply falls back to apply(oversample, Xy) and the revert(oversampler, Xy, cache) reverts the transform by removing the oversampled observations from the table.

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] G Menardi, N. Torelli, “Training and assessing classification rules with imbalanced data,” Data Mining and Knowledge Discovery, 28(1), pp.92-122, 2014.

source

SMOTE

Imbalance.smoteFunction
smote(
    X, y;
    k=5, ratios=1.0, rng=default_rng(),
    try_preserve_type=true
)

Description

Oversamples a dataset using SMOTE (Synthetic Minority Oversampling Techniques) algorithm to correct for class imbalance as presented in [1]

Positional Arguments

  • X: A matrix or table of floats where each row is an observation from the dataset

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Keyword Arguments

  • k::Integer=5: Number of nearest neighbors to consider in the algorithm. Should be within the range 0 < k < n where n is the number of observations in the smallest class.
  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class
    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, rng=42)    

julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

# apply SMOTE
Xover, yover = smote(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the SMOTE model and pass the positional arguments X, y to the transform method.

using MLJ
SMOTE = @load SMOTE pkg=Imbalance

# Wrap the model in a machine
oversampler = SMOTE(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features; 
                                 class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)

# Initiate SMOTE model
oversampler = SMOTE(y_ind; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler                              
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

The reapply(oversampler, Xy, cache) method from TableTransforms simply falls back to apply(oversample, Xy) and the revert(oversampler, Xy, cache) reverts the transform by removing the oversampled observations from the table.

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

source

Borderline SMOTE1

Imbalance.borderline_smote1Function
borderline_smote1(
    X, y;
    m=5, k=5, ratios=1.0, rng=default_rng(),
    try_preserve_type=true, verbosity=1
)

Description

Oversamples a dataset using borderline SMOTE1 algorithm to correct for class imbalance as presented in [1]

Positional Arguments

  • X: A matrix or table of floats where each row is an observation from the dataset

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Keyword Arguments

  • m::Integer=5: The number of neighbors to consider while checking the BorderlineSMOTE1 condition. Should be within the range 0 < m < N where N is the number of observations in the data. It will be automatically set to N-1 if N ≤ m.

  • k::Integer=5: Number of nearest neighbors to consider in the SMOTE part of the algorithm. Should be within the range 0 < k < n where n is the number of observations in the smallest class. It will be automatically set to l-1 for any class with l points where l ≤ k.

  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class

    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
  • verbosity::Integer=1: Whenever higher than 0 info regarding the points that will participate in oversampling is logged.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 1000, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                stds=[0.1 0.1 0.1], min_sep=0.01, class_probs, rng=42)            

julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 200 (40.8%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 310 (63.3%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%) 

# apply BorderlineSMOTE1
Xover, yover = borderline_smote1(X, y; m = 3, 
               k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 392 (80.0%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 441 (90.0%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the BorderlineSMOTE1 model and pass the positional arguments X, y to the transform method.

using MLJ
BorderlineSMOTE1 = @load BorderlineSMOTE1 pkg=Imbalance

# Wrap the model in a machine
oversampler = BorderlineSMOTE1(m=3, k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 1000
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features; 
                                 class_probs=[0.5, 0.2, 0.3], min_sep=0.01, insert_y=y_ind, rng=42)

# Initiate BorderlineSMOTE1 Oversampler model
oversampler = BorderlineSMOTE1(y_ind; m=3, k=5, 
              ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler                              
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    

The reapply(oversampler, Xy, cache) method from TableTransforms simply falls back to apply(oversample, Xy) and the revert(oversampler, Xy, cache) reverts the transform by removing the oversampled observations from the table.

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In D.S. Huang, X.-P. Zhang, & G.-B. Huang (Eds.), Advances in Intelligent Computing (pp. 878-887). Springer.

source

SMOTE-N

Imbalance.smotenFunction
smoten(
    X, y;
    k=5, ratios=1.0, rng=default_rng(),
    try_preserve_type=true
)

Description

Oversamples a dataset using SMOTE-N (Synthetic Minority Oversampling Techniques-Nominal) algorithm to correct for class imbalance as presented in [1]. This is a variant of SMOTE to deal with datasets where all features are nominal.

Positional Arguments

  • X: A matrix of integers or a table with element scitypes that subtype Finite. That is, for table inputs each column should have either OrderedFactor or Multiclass as the element scitype.

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Keyword Arguments

  • k::Integer=5: Number of nearest neighbors to consider in the algorithm. Should be within the range 0 < k < n where n is the number of observations in the smallest class.
  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class
    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.
  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance
using ScientificTypes

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows = 100
num_continuous_feats = 0
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]

# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, num_vals_per_category, rng=42)                      
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

julia> ScientificTypes.schema(X).scitypes
(Count, Count)

# coerce to a finite scitype (multiclass or ordered factor)
X = coerce(X, autotype(X, :few_to_finite))

# apply SMOTEN
Xover, yover = smoten(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the SMOTEN model and pass the positional arguments X, y to the transform method.

using MLJ
SMOTEN = @load SMOTEN pkg=Imbalance

# Wrap the model in a machine
oversampler = SMOTEN(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using ScientificTypes
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 100
num_continuous_feats = 0
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
                                class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
                                 rng=42)  

# Table must have only finite scitypes                                
Xy = coerce(Xy, :Column1=>Multiclass, :Column2=>Multiclass, :Column3=>Multiclass)

# Initiate SMOTEN model
oversampler = SMOTEN(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler                               
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

The reapply(oversampler, Xy, cache) method from TableTransforms simply falls back to apply(oversample, Xy) and the revert(oversampler, Xy, cache) reverts the transform by removing the oversampled observations from the table.

Illustration

A full basic example can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

source

SMOTE-NC

Imbalance.smotencFunction
smotenc(
    X, y, split_ind;
    k=5, ratios=1.0, knn_tree="Brute", rng=default_rng(),
    try_preserve_type=true
)

Description

Oversamples a dataset using SMOTE-NC (Synthetic Minority Oversampling Techniques-Nominal Continuous) algorithm to correct for class imbalance as presented in [1]. This is a variant of SMOTE to deal with datasets with both nominal and continuous features.

SMOTE-NC Assumes Continuous Features Exist

SMOTE-NC will not work if the dataset is purely nominal. In that case, refer to SMOTE-N instead. Meanwhile, if the dataset is purely continuous then it's equivalent to the standard SMOTE.

Positional Arguments

  • X: A matrix of floats or a table with element scitypes that subtype Union{Finite, Infinite}. Elements in nominal columns should subtype Finite (i.e., have scitype OrderedFactor or Multiclass) and elements in continuous columns should subtype Infinite (i.e., have scitype Count or Continuous).

  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

  • cat_inds::AbstractVector{<:Int}: A vector of the indices of the nominal features. Supplied only if X is a matrix. Otherwise, they are inferred from the table's scitypes.

Keyword Arguments

  • k::Integer=5: Number of nearest neighbors to consider in the algorithm. Should be within the range 0 < k < n where n is the number of observations in the smallest class.
  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class
    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • knn_tree: Decides the tree used in KNN computations. Either "Brute" or "Ball". BallTree can be much faster but may lead to inaccurate results.

  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.

  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Returns

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively

  • yover: An abstract vector of labels corresponding to Xover

Example

using Imbalance
using ScientificTypes

# set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]

# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, num_vals_per_category, rng=42)                      
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)

# apply SMOTE-NC
Xover, yover = smotenc(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 

MLJ Model Interface

Simply pass the keyword arguments while initiating the SMOTENC model and pass the positional arguments (excluding cat_inds) to the transform method.

using MLJ
SMOTENC = @load SMOTENC pkg=Imbalance

# Wrap the model in a machine
oversampler = SMOTENC(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)

# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)

You can read more about this MLJ interface by accessing it from MLJ's model browser. Note that only Table input is supported by the MLJ interface for this method.

TableTransforms Interface

This interface assumes that the input is one table Xy and that y is one of the columns. Hence, an integer y_ind must be specified to the constructor to specify which column y is followed by other keyword arguments. Only Xy is provided while applying the transform.

using Imbalance
using ScientificTypes
using Imbalance.TableTransforms

# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
                                class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
                                 rng=42)  

# Table must have only finite or continuous scitypes                                
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)

# Initiate SMOTENC model
oversampler = SMOTENC(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler                               
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)    # equivalently

Illustration

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

References

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

source