SMOTENC
Initiate a SMOTENC model with the given hyper-parameters.
SMOTENC
A model type for constructing a smotenc, based on Imbalance.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
SMOTENC = @load SMOTENC pkg=Imbalance
Do model = SMOTENC()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in SMOTENC(k=...)
.
SMOTENC
implements the SMOTENC algorithm to correct for class imbalance as in N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.
Training data
In MLJ or MLJBase, wrap the model in a machine by
mach = machine(model)
There is no need to provide any data here because the model is a static transformer.
Likewise, there is no need to fit!(mach)
.
For default values of the hyper-parameters, model can be constructed by
model = SMOTENC()
Hyperparameters
k=5
: Number of nearest neighbors to consider in the SMOTENC algorithm. Should be within the range[1, n - 1]
, wheren
is the number of observations; otherwise set to the nearest of these two values.ratios=1.0
: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class
knn_tree
: Decides the tree used in KNN computations. Either"Brute"
or"Ball"
. BallTree can be much faster but may lead to inaccurate results.rng::Union{AbstractRNG, Integer}=default_rng()
: Either anAbstractRNG
object or anInteger
seed to be used withXoshiro
if the JuliaVERSION
supports it. Otherwise, uses MersenneTwister`.
Transform Inputs
X
: A table with element scitypes that subtypeUnion{Finite, Infinite}
. Elements in nominal columns should subtypeFinite
(i.e., have scitypeOrderedFactor
orMulticlass
) and elements in continuous columns should subtypeInfinite
(i.e., have scitypeCount
orContinuous
).y
: An abstract vector of labels (e.g., strings) that correspond to the observations inX
Transform Outputs
Xover
: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the inputX
is a matrix or table respectivelyyover
: An abstract vector of labels corresponding toXover
Operations
transform(mach, X, y)
: resample the dataX
andy
using SMOTENC, returning both the new and original observations
Example
using MLJ
using ScientificTypes
import Imbalance
## set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 3
## want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
## generate a table and categorical vector accordingly
X, y = Imbalance.generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
## coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)
## load SMOTE-NC
SMOTENC = @load SMOTENC pkg=Imbalance
## wrap the model in a machine
oversampler = SMOTENC(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
## provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)