TomekUndersampler

Initiate a tomek undersampling model with the given hyper-parameters.

TomekUndersampler

A model type for constructing a tomek undersampler, based on Imbalance.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

TomekUndersampler = @load TomekUndersampler pkg=Imbalance

Do model = TomekUndersampler() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TomekUndersampler(min_ratios=...).

TomekUndersampler undersamples by removing any point that is part of a tomek link in the data. As defined in, Ivan Tomek. Two modifications of cnn. IEEE Trans. Systems, Man and Cybernetics, 6:769–772, 1976.

Training data

In MLJ or MLJBase, wrap the model in a machine by mach = machine(model)

There is no need to provide any data here because the model is a static transformer.

Likewise, there is no need to fit!(mach).

For default values of the hyper-parameters, model can be constructed by model = TomekUndersampler()

Hyperparameters

  • min_ratios=1.0: A parameter that controls the maximum amount of undersampling to be done for each class. If this algorithm cleans the data to an extent that this is violated, some of the cleaned points will be revived randomly so that it is satisfied.

    • Can be a float and in this case each class will be at most undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
    • Can be a dictionary mapping each class label to the float minimum ratio for that class
  • force_min_ratios=false: If true, and this algorithm cleans the data such that the ratios for each class exceed those specified in min_ratios then further undersampling will be perform so that the final ratios are equal to min_ratios.

  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.

  • try_preserve_type::Bool=true: When true, the function will try to not change the type of the input table (e.g., DataFrame). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

Transform Inputs

  • X: A matrix or table of floats where each row is an observation from the dataset
  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Transform Outputs

  • X_under: A matrix or table that includes the data after undersampling depending on whether the input X is a matrix or table respectively
  • y_under: An abstract vector of labels corresponding to X_under

Operations

  • transform(mach, X, y): resample the data X and y using TomekUndersampler, returning both the new and original observations

Example

using MLJ
import Imbalance

## set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 100, 5
## generate a table and categorical vector accordingly
X, y = Imbalance.generate_imbalanced_data(num_rows, num_continuous_feats; 
                                min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)   

julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%) 

## load TomekUndersampler model type:
TomekUndersampler = @load TomekUndersampler pkg=Imbalance

## Underample the majority classes to  sizes relative to the minority class:
tomek_undersampler = TomekUndersampler(min_ratios=1.0, rng=42)
mach = machine(tomek_undersampler)
X_under, y_under = transform(mach, X, y)

julia> Imbalance.checkbalance(y_under; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 22 (115.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 36 (189.5%)