Extras

Generate Imbalanced Data

Imbalance.generate_imbalanced_dataFunction
generate_imbalanced_data(
    num_rows, num_continuous_feats;
    means=nothing, min_sep=1.0, stds=nothing,
    num_vals_per_category = [],
    class_probs = [0.8, 0.2],
    type= "ColTable", insert_y= nothing,
    rng= default_rng(),
)

Generate num_rows observations with target y respecting given probabilities of each class. Supports generating continuous features with a specific mean and variance and categorical features given the number of levels in each variable.

Arguments

  • num_rows::Integer: Number of observations to generate
  • num_continuous_feats::Integer: Number of continuous features to generate
  • means::AbstractVector=nothing: A vector of means for each continuous feature (must be as long as num_continuous_feats). If nothing, then will be set randomly
  • min_sep::AbstractFloat=1.0: Minimum distance between any two randomly chosen means. Will have no effect if the means are given.
  • stds::AbstractVector=nothing: A vector of standard deviations for each continuous feature (must be as long as num_continuous_feats). If nothing, then will be set randomly
  • num_vals_per_category::AbstractVector=[]: A vector of the number of levels of each extra categorical feature. the number of categorical features is inferred from this.
  • class_probs::AbstractVector{<:AbstractFloat}=[0.8, 0.2]: A vector of probabilities of each class. The number of classes is inferred from this vector.
  • type::AbstractString="ColTable": Can be "Matrix" or "ColTable". In the latter case, a named-tuple of vectors is returned.
  • insert_y::Integer=nothing: If not nothing, insert the class labels column at the given index in the table
  • rng::Union{AbstractRNG, Integer}=default_rng(): Random number generator. If integer then used as seed in Random.Xoshiro(seed) if the Julia VERSION supports it. Otherwise, uses Random.MersenneTwister(seed).

Returns

  • X:: A column table or matrix with generated imbalanced data with num_rows rows and num_continuous_feats + length(num_vals_per_category) columns. If insert_y is specified as in integer then y is also inserted at the specified index as an extra column.
  • y::CategoricalArray: An abstract vector of class labels with labels $0$, $1$, $2$, ..., $k-1$ where k=length(class_probs)

Example

using Imbalance
using Plots

num_rows = 500
num_features = 2
# generating continuous features given mean and std
X, y = generate_imbalanced_data(
	num_rows,
	num_features;
	means = [1.0, 4.0, [7.0 9.0]],
	stds = [1.0, [0.5 0.8], 2.0],
	class_probs=[0.5, 0.2, 0.3],
	type="Matrix",
	rng = 42,
)

p = plot()
[scatter!(p, X[:, 1][y.==yi], X[:, 2][y.==yi], label = "$y=yi$") for yi in unique(y)]

julia> plot(p)

generated data

# generating continuous features with random mean and std
X, y = generate_imbalanced_data(
	num_rows,
	num_features;
    min_sep=0.3,      
	class_probs=[0.5, 0.2, 0.3],
	type="Matrix",
	rng = 33,
)

p = plot()
[scatter!(p, X[:, 1][y.==yi], X[:, 2][y.==yi], label = "$y=yi$") for yi in unique(y)]

julia> plot(p)

generated data

num_rows = 500
num_features = 2
X, y = generate_imbalanced_data(
	num_rows,
	num_features;
    num_vals_per_category = [3, 5, 2],
	class_probs=[0.9, 0.1],
	insert_y=4,
	type="ColTable",
	rng = 33,
)

julia> X
(Column1 = [0.883, 0.9, 0.577  …  0.887,],
 Column2 = [0.578, 0.718, 0.378  …  0.573,],
 Column3 = [2.0, 2.0, 3.0, …  2.0,],
 Column4 = [0.0, 0.0, 0.0, …  0.0,],
 Column5 = [2.0, 3.0, 4.0, …  4.0,],
 Column6 = [1.0, 1.0, 2.0, …  1.0,],)
source

Check Balance of Data

Imbalance.checkbalanceFunction
checkbalance(y; reference="majority")

A visual version of StatsBase.countmap that returns nothing and prints how many observations in the dataset belong to each class and their percentage relative to the size of majority or minority class.

Arguments

  • y::AbstractVector: A vector of categorical values to test for imbalance
  • reference="majority": Either "majority" or "minority" and decides whether the percentage should be relative to the size of majority or minority class.

Example

num_rows = 50000
num_features = 2
X, y = generate_imbalanced_data(
	num_rows,
	num_features;
	class_probs=[0.8, 0.2],
	type="Matrix",
	rng = 42,
)

julia> Imbalance.checkbalance(y; ref="majority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 10034 (25.1%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 39966 (100.0%) 

julia> Imbalance.checkbalance(y; ref="minority")
1: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 10034 (100.0%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 39966 (398.3%) 
source