ENNUndersampler
Initiate a ENN undersampling model with the given hyper-parameters.
ENNUndersampler
A model type for constructing a enn undersampler, based on Imbalance.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
ENNUndersampler = @load ENNUndersampler pkg=Imbalance
Do model = ENNUndersampler()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ENNUndersampler(k=...)
.
ENNUndersampler
undersamples a dataset by removing ("cleaning") points that violate a certain condition such as having a different class compared to the majority of the neighbors as proposed in Dennis L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, pages 408–421, 1972.
Training data
In MLJ or MLJBase, wrap the model in a machine by mach = machine(model)
There is no need to provide any data here because the model is a static transformer.
Likewise, there is no need to fit!(mach)
.
For default values of the hyper-parameters, model can be constructed by model = ENNUndersampler()
Hyperparameters
k::Integer=5
: Number of nearest neighbors to consider in the algorithm. Should be within the range0 < k < n
where n is the number of observations in the smallest class.keep_condition::AbstractString="mode"
: The condition that leads to cleaning a point upon violation. Takes one of"exists"
,"mode"
,"only mode"
and"all"
- `"exists"`: the point has at least one neighbor from the same class
- `"mode"`: the class of the point is one of the most frequent classes of the neighbors (there may be many)
- `"only mode"`: the class of the point is the single most frequent class of the neighbors
- `"all"`: the class of the point is the same as all the neighbors
min_ratios=1.0
: A parameter that controls the maximum amount of undersampling to be done for each class. If this algorithm cleans the data to an extent that this is violated, some of the cleaned points will be revived randomly so that it is satisfied.- Can be a float and in this case each class will be at most undersampled to the size of the minority class times the float. By default, all classes are undersampled to the size of the minority class
- Can be a dictionary mapping each class label to the float minimum ratio for that class
force_min_ratios=false
: Iftrue
, and this algorithm cleans the data such that the ratios for each class exceed those specified inmin_ratios
then further undersampling will be perform so that the final ratios are equal tomin_ratios
.rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.try_preserve_type::Bool=true
: Whentrue
, the function will try to not change the type of the input table (e.g.,DataFrame
). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.
Transform Inputs
X
: A matrix or table of floats where each row is an observation from the datasety
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Transform Outputs
X_under
: A matrix or table that includes the data after undersampling depending on whether the inputX
is a matrix or table respectivelyy_under
: An abstract vector of labels corresponding toX_under
Operations
transform(mach, X, y)
: resample the dataX
andy
using ENNUndersampler, returning the undersampled versions
Example
using MLJ
import Imbalance
## set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
## generate a table and categorical vector accordingly
X, y = Imbalance.generate_imbalanced_data(num_rows, num_continuous_feats;
min_sep=0.01, stds=[3.0 3.0 3.0], class_probs, rng=42)
julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (100.0%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (173.7%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (252.6%)
## load ENN model type:
ENNUndersampler = @load ENNUndersampler pkg=Imbalance
## underample the majority classes to sizes relative to the minority class:
undersampler = ENNUndersampler(min_ratios=0.5, rng=42)
mach = machine(undersampler)
X_under, y_under = transform(mach, X, y)
julia> Imbalance.checkbalance(y_under; ref="minority")
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10 (100.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10 (100.0%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24 (240.0%)