RandomWalkOversampler

Initiate a RandomWalkOversampler model with the given hyper-parameters.

RandomWalkOversampler

A model type for constructing a random walk oversampler, based on Imbalance.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

RandomWalkOversampler = @load RandomWalkOversampler pkg=Imbalance

Do model = RandomWalkOversampler() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in RandomWalkOversampler(ratios=...).

RandomWalkOversampler implements the random walk oversampling algorithm to correct for class imbalance as in Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 25, 4-20.

Training data

In MLJ or MLJBase, wrap the model in a machine by

mach = machine(model)

There is no need to provide any data here because the model is a static transformer.

Likewise, there is no need to fit!(mach).

For default values of the hyper-parameters, model can be constructed by

model = RandomWalkOversampler()

Hyperparameters

  • ratios=1.0: A parameter that controls the amount of oversampling to be done for each class

    • Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
    • Can be a dictionary mapping each class label to the float ratio for that class
  • rng::Union{AbstractRNG, Integer}=default_rng(): Either an AbstractRNG object or an Integer seed to be used with Xoshiro if the Julia VERSION supports it. Otherwise, uses MersenneTwister`.

Transform Inputs

  • X: A table with element scitypes that subtype Union{Finite, Infinite}. Elements in nominal columns should subtype Finite (i.e., have scitype OrderedFactor or Multiclass) and
 elements in continuous columns should subtype `Infinite` (i.e., have 
 [scitype](https://juliaai.github.io/ScientificTypes.jl/) `Count` or `Continuous`).
  • y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Transform Outputs

  • Xover: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input X is a matrix or table respectively
  • yover: An abstract vector of labels corresponding to Xover

Operations

  • transform(mach, X, y): resample the data X and y using RandomWalkOversampler, returning both the new and original observations

Example

using MLJ
using ScientificTypes
import Imbalance

## set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows = 100
num_continuous_feats = 3
## want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]

## generate a table and categorical vector accordingly
X, y = Imbalance.generate_imbalanced_data(num_rows, num_continuous_feats; 
                                          class_probs, num_vals_per_category, rng=42)                      
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%) 


julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
## coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)

## load RandomWalkOversampler model type:
RandomWalkOversampler = @load RandomWalkOversampler pkg=Imbalance

## oversample the minority classes to  sizes relative to the majority class:
oversampler = RandomWalkOversampler(ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
mach = machine(oversampler)
Xover, yover = transform(mach, X, y)

julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)