Generating Synthetic Data
Here synthetic data means artificially generated data, with no reference to a "real world" data set. Not to be confused "fake data" obtained by resampling from a distribution fit to some actual real data.
MLJ has a set of functions - make_blobs, make_circles, make_moons and make_regression (closely resembling functions in scikit-learn of the same name) - for generating synthetic data sets. These are useful for testing machine learning models (e.g., testing user-defined composite models; see Composing Models)
Generating Gaussian blobs
MLJBase.make_blobs — Function
X, y = make_blobs(n=100, p=2; kwargs...)Generate Gaussian blobs for clustering and classification problems.
Return value
By default, a table X with p columns (features) and n rows (observations), together with a corresponding vector of n Multiclass target observations y, indicating blob membership.
Keyword arguments
shuffle=true: whether to shuffle the resulting points,centers=3: either a number of centers or ac x pmatrix withcpre-determined centers,cluster_std=1.0: the standard deviation(s) of each blob,center_box=(-10. => 10.): the limits of thep-dimensional cube within which the cluster centers are drawn if they are not provided,eltype=Float64: machine type of points (any subtype ofAbstractFloat).rng=Random.GLOBAL_RNG: anyAbstractRNGobject, or integer to seed aMersenneTwister(for reproducibility).as_table=true: whether to return the points as a table (true) or a matrix (false). Iffalsethe targetyhas integer element type.
Example
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])sourceusing MLJ, DataFrames
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
dfBlobs = DataFrame(X)
dfBlobs.y = y
first(dfBlobs, 3)| Row | x1 | x2 | x3 | y |
|---|---|---|---|---|
| Float64 | Float64 | Float64 | Cat… | |
| 1 | -1.09355 | -6.62103 | 1.52951 | 2 |
| 2 | -10.2376 | 3.60171 | -1.14085 | 1 |
| 3 | -8.51697 | 5.55341 | -0.432054 | 1 |
using VegaLite
dfBlobs |> @vlplot(:point, x=:x1, y=:x2, color = :"y:n") dfBlobs |> @vlplot(:point, x=:x1, y=:x3, color = :"y:n") Generating concentric circles
MLJBase.make_circles — Function
X, y = make_circles(n=100; kwargs...)Generate n labeled points close to two concentric circles for classification and clustering models.
Return value
By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the smaller or larger circle, respectively.
Keyword arguments
shuffle=true: whether to shuffle the resulting points,noise=0: standard deviation of the Gaussian noise added to the data,factor=0.8: ratio of the smaller radius over the larger one,eltype=Float64: machine type of points (any subtype ofAbstractFloat).rng=Random.GLOBAL_RNG: anyAbstractRNGobject, or integer to seed aMersenneTwister(for reproducibility).as_table=true: whether to return the points as a table (true) or a matrix (false). Iffalsethe targetyhas integer element type.
Example
X, y = make_circles(100; noise=0.5, factor=0.3)sourceusing MLJ, DataFrames
X, y = make_circles(100; noise=0.05, factor=0.3)
dfCircles = DataFrame(X)
dfCircles.y = y
first(dfCircles, 3)| Row | x1 | x2 | y |
|---|---|---|---|
| Float64 | Float64 | Cat… | |
| 1 | 0.839304 | -0.510724 | 1 |
| 2 | 0.0150521 | -0.334344 | 0 |
| 3 | 0.20397 | -0.247111 | 0 |
using VegaLite
dfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n") Sampling from two interleaved half-circles
MLJBase.make_moons — Function
make_moons(n::Int=100; kwargs...)Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.
Return value
By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the left or right semi-circle.
Keyword arguments
shuffle=true: whether to shuffle the resulting points,noise=0.1: standard deviation of the Gaussian noise added to the data,xshift=1.0: horizontal translation of the second center with respect to the first one.yshift=0.3: vertical translation of the second center with respect to the first one.eltype=Float64: machine type of points (any subtype ofAbstractFloat).rng=Random.GLOBAL_RNG: anyAbstractRNGobject, or integer to seed aMersenneTwister(for reproducibility).as_table=true: whether to return the points as a table (true) or a matrix (false). Iffalsethe targetyhas integer element type.
Example
X, y = make_moons(100; noise=0.5)sourceusing MLJ, DataFrames
X, y = make_moons(100; noise=0.05)
dfHalfCircles = DataFrame(X)
dfHalfCircles.y = y
first(dfHalfCircles, 3)| Row | x1 | x2 | y |
|---|---|---|---|
| Float64 | Float64 | Cat… | |
| 1 | 0.463095 | 0.86277 | 0 |
| 2 | -0.686093 | 0.809959 | 0 |
| 3 | -0.263066 | 0.965137 | 0 |
using VegaLite
dfHalfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n") Regression data generated from noisy linear models
MLJBase.make_regression — Function
make_regression(n, p; kwargs...)Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.
Return value
By default, a tuple (X, y) where table X has p columns and n rows (observations), together with a corresponding vector of n Continuous target observations y.
Keywords
intercept=true: Whether to generate data from a model with intercept.n_targets=1: Number of columns in the target.sparse=0: Proportion of the generating weight vector that is sparse.noise=0.1: Standard deviation of the Gaussian noise added to the response (target).outliers=0: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied ifbinaryisfalse.)as_table=true: WhetherX(andy, ifn_targets > 1) should be a table or a matrix.eltype=Float64: Element type forXandy. Must subtypeAbstractFloat.binary=false: Whether the target should be binarized (via a sigmoid).eltype=Float64: machine type of points (any subtype ofAbstractFloat).rng=Random.GLOBAL_RNG: anyAbstractRNGobject, or integer to seed aMersenneTwister(for reproducibility).as_table=true: whether to return the points as a table (true) or a matrix (false).
Example
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)sourceusing MLJ, DataFrames
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
dfRegression = DataFrame(X)
dfRegression.y = y
first(dfRegression, 3)| Row | x1 | x2 | x3 | x4 | x5 | y |
|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.0993396 | 0.406961 | -0.41877 | -0.993276 | 1.42414 | 0.117619 |
| 2 | -0.879567 | 0.446859 | 1.17152 | -0.0379617 | 0.204908 | -0.441979 |
| 3 | 0.514809 | 1.36118 | -0.411285 | -1.61638 | 1.15077 | 0.594742 |