ClusterUndersampler

Initiate a cluster undersampling model with the given hyper-parameters.

ClusterUndersampler

A model type for constructing a cluster undersampler, based on Imbalance.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

ClusterUndersampler = @load ClusterUndersampler pkg=Imbalance

Do model = ClusterUndersampler() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ClusterUndersampler(mode=...).

ClusterUndersampler implements clustering undersampling as presented in Wei-Chao, L., Chih-Fong, T., Ya-Han, H., & Jing-Shang, J. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409–410, 17–26. with K-means as the clustering algorithm.

Training data

In MLJ or MLJBase, wrap the model in a machine by mach = machine(model)

There is no need to provide any data here because the model is a static transformer.

Likewise, there is no need to fit!(mach).

For default values of the hyper-parameters, model can be constructed with model = ClusterUndersampler().

Hyperparameters

mode::AbstractString="nearest: If "center" then the undersampled data will consist of the centriods of

each cluster found; if `"nearest"` then it will consist of the nearest neighbor of each centroid.

ratios=1.0: A parameter that controls the amount of undersampling to be done for each class
- Can be a float and in this case each class will be undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float ratio for that class
maxiter::Integer=100: Maximum number of iterations to run K-means
rng::Integer=42: Random number generator seed. Must be an integer.

Transform Inputs

X: A matrix or table of floats where each row is an observation from the dataset
y: An abstract vector of labels (e.g., strings) that correspond to the observations in X

Transform Outputs

X_under: A matrix or table that includes the data after undersampling depending on whether the input X is a matrix or table respectively
y_under: An abstract vector of labels corresponding to X_under

Operations

transform(mach, X, y): resample the data X and y using ClusterUndersampler, returning the undersampled versions

Example

using MLJ
import Imbalance

## set probability of each class
class_probs = [0.5, 0.2, 0.3]                         
num_rows, num_continuous_feats = 100, 5
## generate a table and categorical vector accordingly
X, y = Imbalance.generate_imbalanced_data(num_rows, num_continuous_feats; 
                                class_probs, rng=42)   
                                                    
julia> Imbalance.checkbalance(y; ref="minority")
 1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%) 
 2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%) 
 0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%) 

## load cluster_undersampling
ClusterUndersampler = @load ClusterUndersampler pkg=Imbalance

## wrap the model in a machine
undersampler = ClusterUndersampler(mode="nearest", 
                                   ratios=Dict(0=>1.0, 1=> 1.0, 2=>1.0), rng=42)
mach = machine(undersampler)

## provide the data to transform (there is nothing to fit)
X_under, y_under = transform(mach, X, y)

                                       
julia> Imbalance.checkbalance(y_under; ref="minority")
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%) 
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%) 
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)