# Oversampling Algorithms

The following table portrays the supported oversampling algorithms, whether the mechanism repeats or generates data and the supported types of data.

Oversampling Method | Mechanism | Supported Data Types |
---|---|---|

Random Oversampler | Repeat existing data | Continuous and/or nominal |

Random Walk Oversampler | Generate synthetic data | Continuous and/or nominal |

ROSE | Generate synthetic data | Continuous |

SMOTE | Generate synthetic data | Continuous |

Borderline SMOTE1 | Generate synthetic data | Continuous |

SMOTE-N | Generate synthetic data | Nominal |

SMOTE-NC | Generate synthetic data | Continuous and nominal |

## Random Oversampler

`Imbalance.random_oversample`

— Function```
random_oversample(
X, y;
ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Naively oversample a dataset by randomly repeating existing observations with replacement.

**Positional Arguments**

`X`

: A matrix of real numbers or a table with element scitypes that subtype`Union{Finite, Infinite}`

. Elements in nominal columns should subtype`Finite`

(i.e., have scitype`OrderedFactor`

or`Multiclass`

) and elements in continuous columns should subtype`Infinite`

(i.e., have scitype`Count`

or`Continuous`

).`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

**Keyword Arguments**

`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply random oversampling
Xover, yover = random_oversample(X, y; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `RandomOversampler`

model and pass the positional arguments `X, y`

to the `transform`

method.

```
using MLJ
RandomOversampler = @load RandomOversampler pkg=Imbalance
# Wrap the model in a machine
oversampler = RandomOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser.

**TableTransforms Interface**

This interface assumes that the input is one table `Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate Random Oversampler model
oversampler = RandomOversampler(y_ind; ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

The `reapply(oversampler, Xy, cache)`

method from `TableTransforms`

simply falls back to `apply(oversample, Xy)`

and the `revert(oversampler, Xy, cache)`

reverts the transform by removing the oversampled observations from the table.

**Illustration**

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

## Random Walk Oversampler

`Imbalance.random_walk_oversample`

— Function```
random_walk_oversample(
X, y, cat_inds;
ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Oversamples a dataset using random walk oversampling as presented in [1].

**Positional Arguments**

`X`

: A matrix of floats or a table with element scitypes that subtype`Union{Finite, Infinite}`

. Elements in nominal columns should subtype`Finite`

(i.e., have scitype`OrderedFactor`

or`Multiclass`

) and elements in continuous columns should subtype`Infinite`

(i.e., have scitype`Count`

or`Continuous`

).`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

`cat_inds::AbstractVector{<:Int}`

: A vector of the indices of the nominal features. Supplied only if`X`

is a matrix. Otherwise, they are inferred from the table's scitypes.

**Keyword Arguments**

`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)
# apply random walk oversampling
Xover, yover = random_walk_oversample(X, y;
ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `RandomWalkOversampling`

model and pass the positional arguments (excluding `cat_inds`

) to the `transform`

method.

```
using MLJ
RandomWalkOversampler = @load RandomWalkOversampler pkg=Imbalance
# Wrap the model in a machine
oversampler = RandomWalkOversampler(ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser. Note that only `Table`

input is supported by the MLJ interface for this method.

**TableTransforms Interface**

This interface assumes that the input is one table `Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite or continuous scitypes
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)
# Initiate Random Walk Oversampler model
oversampler = RandomWalkOversampler(y_ind;
ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

**Illustration**

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

**References**

[1] Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 25, 4-20.

## ROSE

`Imbalance.rose`

— Function```
rose(
X, y;
s=1.0, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Oversamples a dataset using `ROSE`

(Random Oversampling Examples) algorithm to correct for class imbalance as presented in [1]

**Positional Arguments**

`X`

: A matrix or table of floats where each row is an observation from the dataset`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

**Keyword Arguments**

`s::float=1.0`

: A parameter that proportionally controls the bandwidth of the Gaussian kernel`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a float and in this case each class will be oversampled to the size of the majority class times the float. By default, all classes are oversampled to the size of the majority class
- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply ROSE
Xover, yover = rose(X, y; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `ROSE`

model and pass the positional arguments `X, y`

to the `transform`

method.

```
using MLJ
ROSE = @load ROSE pkg=Imbalance
# Wrap the model in a machine
oversampler = ROSE(s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser.

**TableTransforms Interface**

This interface assumes that the input is one table `Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate ROSE model
oversampler = ROSE(y_ind; s=0.3, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

The `reapply(oversampler, Xy, cache)`

method from `TableTransforms`

simply falls back to `apply(oversample, Xy)`

and the `revert(oversampler, Xy, cache)`

reverts the transform by removing the oversampled observations from the table.

**Illustration**

A full basic example along with an animation can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

**References**

[1] G Menardi, N. Torelli, “Training and assessing classification rules with imbalanced data,” Data Mining and Knowledge Discovery, 28(1), pp.92-122, 2014.

## SMOTE

`Imbalance.smote`

— Function```
smote(
X, y;
k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Oversamples a dataset using `SMOTE`

(Synthetic Minority Oversampling Techniques) algorithm to correct for class imbalance as presented in [1]

**Positional Arguments**

`X`

: A matrix or table of floats where each row is an observation from the dataset`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

**Keyword Arguments**

`k::Integer=5`

: Number of nearest neighbors to consider in the algorithm. Should be within the range`0 < k < n`

where n is the number of observations in the smallest class.

`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 100, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
# apply SMOTE
Xover, yover = smote(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `SMOTE`

model and pass the positional arguments `X, y`

to the `transform`

method.

```
using MLJ
SMOTE = @load SMOTE pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTE(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser.

**TableTransforms Interface**

`Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 200
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], insert_y=y_ind, rng=42)
# Initiate SMOTE model
oversampler = SMOTE(y_ind; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

The `reapply(oversampler, Xy, cache)`

method from `TableTransforms`

simply falls back to `apply(oversample, Xy)`

and the `revert(oversampler, Xy, cache)`

reverts the transform by removing the oversampled observations from the table.

**Illustration**

**References**

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

## Borderline SMOTE1

`Imbalance.borderline_smote1`

— Function```
borderline_smote1(
X, y;
m=5, k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true, verbosity=1
)
```

**Description**

Oversamples a dataset using borderline SMOTE1 algorithm to correct for class imbalance as presented in [1]

**Positional Arguments**

`X`

: A matrix or table of floats where each row is an observation from the dataset`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

**Keyword Arguments**

`m::Integer=5`

: The number of neighbors to consider while checking the BorderlineSMOTE1 condition. Should be within the range`0 < m < N`

where N is the number of observations in the data. It will be automatically set to`N-1`

if`N ≤ m`

.`k::Integer=5`

: Number of nearest neighbors to consider in the SMOTE part of the algorithm. Should be within the range`0 < k < n`

where n is the number of observations in the smallest class. It will be automatically set to`l-1`

for any class with`l`

points where`l ≤ k`

.`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

`verbosity::Integer=1`

: Whenever higher than`0`

info regarding the points that will participate in oversampling is logged.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows, num_continuous_feats = 1000, 5
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
stds=[0.1 0.1 0.1], min_sep=0.01, class_probs, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 200 (40.8%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 310 (63.3%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%)
# apply BorderlineSMOTE1
Xover, yover = borderline_smote1(X, y; m = 3,
k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 392 (80.0%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 441 (90.0%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 490 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `BorderlineSMOTE1`

model and pass the positional arguments `X, y`

to the `transform`

method.

```
using MLJ
BorderlineSMOTE1 = @load BorderlineSMOTE1 pkg=Imbalance
# Wrap the model in a machine
oversampler = BorderlineSMOTE1(m=3, k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser.

**TableTransforms Interface**

`Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 1000
num_features = 5
y_ind = 3
Xy, _ = generate_imbalanced_data(num_rows, num_features;
class_probs=[0.5, 0.2, 0.3], min_sep=0.01, insert_y=y_ind, rng=42)
# Initiate BorderlineSMOTE1 Oversampler model
oversampler = BorderlineSMOTE1(y_ind; m=3, k=5,
ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy)
```

`reapply(oversampler, Xy, cache)`

method from `TableTransforms`

simply falls back to `apply(oversample, Xy)`

and the `revert(oversampler, Xy, cache)`

reverts the transform by removing the oversampled observations from the table.

**Illustration**

**References**

[1] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In D.S. Huang, X.-P. Zhang, & G.-B. Huang (Eds.), Advances in Intelligent Computing (pp. 878-887). Springer.

## SMOTE-N

`Imbalance.smoten`

— Function```
smoten(
X, y;
k=5, ratios=1.0, rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Oversamples a dataset using `SMOTE-N`

(Synthetic Minority Oversampling Techniques-Nominal) algorithm to correct for class imbalance as presented in [1]. This is a variant of `SMOTE`

to deal with datasets where all features are nominal.

**Positional Arguments**

`X`

: A matrix of integers or a table with element scitypes that subtype`Finite`

. That is, for table inputs each column should have either`OrderedFactor`

or`Multiclass`

as the element scitype.`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

**Keyword Arguments**

`k::Integer=5`

: Number of nearest neighbors to consider in the algorithm. Should be within the range`0 < k < n`

where n is the number of observations in the smallest class.

`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a dictionary mapping each class label to the float ratio for that class

`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 0
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Count, Count)
# coerce to a finite scitype (multiclass or ordered factor)
X = coerce(X, autotype(X, :few_to_finite))
# apply SMOTEN
Xover, yover = smoten(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `SMOTEN`

model and pass the positional arguments `X, y`

to the `transform`

method.

```
using MLJ
SMOTEN = @load SMOTEN pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTEN(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser.

**TableTransforms Interface**

`Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 0
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite scitypes
Xy = coerce(Xy, :Column1=>Multiclass, :Column2=>Multiclass, :Column3=>Multiclass)
# Initiate SMOTEN model
oversampler = SMOTEN(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

`reapply(oversampler, Xy, cache)`

method from `TableTransforms`

simply falls back to `apply(oversample, Xy)`

and the `revert(oversampler, Xy, cache)`

reverts the transform by removing the oversampled observations from the table.

**Illustration**

A full basic example can be found here. You may find more practical examples in the tutorial section which also explains running code on Google Colab.

**References**

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

## SMOTE-NC

`Imbalance.smotenc`

— Function```
smotenc(
X, y, split_ind;
k=5, ratios=1.0, knn_tree="Brute", rng=default_rng(),
try_preserve_type=true
)
```

**Description**

Oversamples a dataset using `SMOTE-NC`

(Synthetic Minority Oversampling Techniques-Nominal Continuous) algorithm to correct for class imbalance as presented in [1]. This is a variant of `SMOTE`

to deal with datasets with both nominal and continuous features.

**Positional Arguments**

`X`

: A matrix of floats or a table with element scitypes that subtype`Union{Finite, Infinite}`

. Elements in nominal columns should subtype`Finite`

(i.e., have scitype`OrderedFactor`

or`Multiclass`

) and elements in continuous columns should subtype`Infinite`

(i.e., have scitype`Count`

or`Continuous`

).`y`

: An abstract vector of labels (e.g., strings) that correspond to the observations in`X`

`cat_inds::AbstractVector{<:Int}`

: A vector of the indices of the nominal features. Supplied only if`X`

is a matrix. Otherwise, they are inferred from the table's scitypes.

**Keyword Arguments**

`k::Integer=5`

: Number of nearest neighbors to consider in the algorithm. Should be within the range`0 < k < n`

where n is the number of observations in the smallest class.

`ratios=1.0`

: A parameter that controls the amount of oversampling to be done for each class- Can be a dictionary mapping each class label to the float ratio for that class

`knn_tree`

: Decides the tree used in KNN computations. Either`"Brute"`

or`"Ball"`

. BallTree can be much faster but may lead to inaccurate results.`rng::Union{AbstractRNG, Integer}=default_rng()`

: Either an`AbstractRNG`

object or an`Integer`

seed to be used with`Xoshiro`

if the Julia`VERSION`

supports it. Otherwise, uses MersenneTwister`.

`try_preserve_type::Bool=true`

: When`true`

, the function will try to not change the type of the input table (e.g.,`DataFrame`

). However, for some tables, this may not succeed, and in this case, the table returned will be a column table (named-tuple of vectors). This parameter is ignored if the input is a matrix.

**Returns**

`Xover`

: A matrix or table that includes original data and the new observations due to oversampling. depending on whether the input`X`

is a matrix or table respectively`yover`

: An abstract vector of labels corresponding to`Xover`

**Example**

```
using Imbalance
using ScientificTypes
# set probability of each class
class_probs = [0.5, 0.2, 0.3]
num_rows = 100
num_continuous_feats = 3
# want two categorical features with three and two possible values respectively
num_vals_per_category = [3, 2]
# generate a table and categorical vector accordingly
X, y = generate_imbalanced_data(num_rows, num_continuous_feats;
class_probs, num_vals_per_category, rng=42)
julia> Imbalance.checkbalance(y)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 (39.6%)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 33 (68.8%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
julia> ScientificTypes.schema(X).scitypes
(Continuous, Continuous, Continuous, Continuous, Continuous)
# coerce nominal columns to a finite scitype (multiclass or ordered factor)
X = coerce(X, :Column4=>Multiclass, :Column5=>Multiclass)
# apply SMOTE-NC
Xover, yover = smotenc(X, y; k = 5, ratios = Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng = 42)
julia> Imbalance.checkbalance(yover)
2: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 (79.2%)
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 43 (89.6%)
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 (100.0%)
```

**MLJ Model Interface**

Simply pass the keyword arguments while initiating the `SMOTENC`

model and pass the positional arguments (excluding `cat_inds`

) to the `transform`

method.

```
using MLJ
SMOTENC = @load SMOTENC pkg=Imbalance
# Wrap the model in a machine
oversampler = SMOTENC(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)
mach = machine(oversampler)
# Provide the data to transform (there is nothing to fit)
Xover, yover = transform(mach, X, y)
```

You can read more about this `MLJ`

interface by accessing it from MLJ's model browser. Note that only `Table`

input is supported by the MLJ interface for this method.

**TableTransforms Interface**

`Xy`

and that `y`

is one of the columns. Hence, an integer `y_ind`

must be specified to the constructor to specify which column `y`

is followed by other keyword arguments. Only `Xy`

is provided while applying the transform.

```
using Imbalance
using ScientificTypes
using Imbalance.TableTransforms
# Generate imbalanced data
num_rows = 100
num_continuous_feats = 3
y_ind = 2
# generate a table and categorical vector accordingly
Xy, _ = generate_imbalanced_data(num_rows, num_continuous_feats; insert_y=y_ind,
class_probs= [0.5, 0.2, 0.3], num_vals_per_category=[3, 2],
rng=42)
# Table must have only finite or continuous scitypes
Xy = coerce(Xy, :Column2=>Multiclass, :Column5=>Multiclass, :Column6=>Multiclass)
# Initiate SMOTENC model
oversampler = SMOTENC(y_ind; k=5, ratios=Dict(1=>1.0, 2=> 0.9, 3=>0.9), rng=42)
Xyover = Xy |> oversampler
# equivalently if TableTransforms is used
Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently
```

**Illustration**

**References**

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.