Datasets

Standard datasets

To add a new dataset assuming it has a header and is, at path data/newdataset.csv

Start by loading it with CSV:

fpath = joinpath("datadir", "newdataset.csv")
data = CSV.read(fpath, copycols=true,
                categorical=true)

Load it with DelimitedFiles and Tables

data_raw, data_header = readdlm(fpath, ',', header=true)
data_table = Tables.table(data_raw; header=Symbol.(vec(data_header)))

Retrieve the conversions:

for (n, st) in zip(names(data), scitype_union.(eachcol(data)))
    println(":$n=>$st,")
end

Copy and paste the result in a coerce

data_table = coerce(data_table, ...)

MLJBase.load_dataset — Method

load_dataset(fpath, coercions)

Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.

source

MLJBase.load_sunspots — Method

Load a well-known sunspot time series (table with one column). https://www.sws.bom.gov.au/Educational/2/3/6

source

MLJBase.@load_ames — Macro

Load the full version of the well-known Ames Housing task.

source

MLJBase.@load_boston — Macro

Load a well-known public regression dataset with Continuous features.

source

MLJBase.@load_crabs — Macro

Load a well-known crab classification dataset with nominal features.

source

MLJBase.@load_iris — Macro

Load a well-known public classification task with nominal features.

source

MLJBase.@load_reduced_ames — Macro

Load a reduced version of the well-known Ames Housing task

source

MLJBase.@load_smarket — Macro

Load S&P Stock Market dataset, as used in An Introduction to Statistical Learning with applications in R, by Witten et al (2013), Springer-Verlag, New York.

source

MLJBase.@load_sunspots — Macro

Load a well-known sunspot time series (single table with one column).

source

Synthetic datasets

MLJBase.augment_X — Method

augment_X(X, fit_intercept)

Given a matrix X, append a column of ones if fit_intercept is true. See make_regression.

source

MLJBase.finalize_Xy — Method

finalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)

Internal function to finalize the make_* functions.

source

MLJBase.make_blobs — Function

X, y = make_blobs(n=100, p=2; kwargs...)

Generate Gaussian blobs for clustering and classification problems.

Return value

By default, a table X with p columns (features) and n rows (observations), together with a corresponding vector of n Multiclass target observations y, indicating blob membership.

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
centers=3: either a number of centers or a c x p matrix with c pre-determined centers,
cluster_std=1.0: the standard deviation(s) of each blob,
center_box=(-10. => 10.): the limits of the p-dimensional cube within which the cluster centers are drawn if they are not provided,
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])

source

MLJBase.make_circles — Function

X, y = make_circles(n=100; kwargs...)

Generate n labeled points close to two concentric circles for classification and clustering models.

Return value

By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the smaller or larger circle, respectively.

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
noise=0: standard deviation of the Gaussian noise added to the data,
factor=0.8: ratio of the smaller radius over the larger one,
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_circles(100; noise=0.5, factor=0.3)

source

MLJBase.make_moons — Function

make_moons(n::Int=100; kwargs...)

Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.

Return value

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
noise=0.1: standard deviation of the Gaussian noise added to the data,
xshift=1.0: horizontal translation of the second center with respect to the first one.
yshift=0.3: vertical translation of the second center with respect to the first one.
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_moons(100; noise=0.5)

source

MLJBase.make_regression — Function

make_regression(n, p; kwargs...)

Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.

Return value

By default, a tuple (X, y) where table X has p columns and n rows (observations), together with a corresponding vector of n Continuous target observations y.

Keywords

intercept=true: Whether to generate data from a model with intercept.
n_targets=1: Number of columns in the target.
sparse=0: Proportion of the generating weight vector that is sparse.
noise=0.1: Standard deviation of the Gaussian noise added to the response (target).
outliers=0: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied if binary is false.)
as_table=true: Whether X (and y, if n_targets > 1) should be a table or a matrix.
eltype=Float64: Element type for X and y. Must subtype AbstractFloat.
binary=false: Whether the target should be binarized (via a sigmoid).
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false).

Example

X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)

source

MLJBase.outlify! — Method

outlify!(rng, y, s)

Add outliers to portion s of vector y.

source

MLJBase.runif_ab — Method

runif_ab(rng, n, p, a, b)

Internal function to generate n points in [a, b]ᵖ uniformly at random.

source

MLJBase.sigmoid — Method

sigmoid(x)

Return the sigmoid computed in a numerically stable way: $σ(x) = 1/(1+\exp(-x))$

source

MLJBase.sparsify! — Method

sparsify!(rng, θ, s)

Make portion s of vector θ exactly 0.

source

Utility functions

MLJBase.complement — Method

complement(folds, i)

The complement of the ith fold of folds in the concatenation of all elements of folds. Here folds is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.

complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]

source

MLJBase.corestrict — Method

corestrict(X, folds, i)

The restriction of X, a vector, matrix or table, to the complement of the ith fold of folds, where folds is a tuple of vectors of row indices.

The method is curried, so that corestrict(folds, i) is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i).

Example

folds = ([1, 2], [3, 4, 5],  [6,])
corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]

source

MLJBase.partition — Method

partition(X, fractions...;
          shuffle=nothing,
          rng=Random.GLOBAL_RNG,
          stratify=nothing,
          multi=false)

Splits the vector, matrix or table X into a tuple of objects of the same type, whose vertical concatenation is X. The number of rows in each component of the return value is determined by the corresponding fractions of length(nrows(X)), where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.

For synchronized partitioning of multiple objects, use the multi=true option.

julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])

julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])

julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])

julia> X, y = make_blobs() # a table and vector
julia> Xtrain, Xtest = partition(X, 0.8, stratify=y)

Here's an example of synchronized partitioning of multiple objects:

julia> (Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)

Keywords

shuffle=nothing: if set to true, shuffles the rows before taking fractions.
rng=Random.GLOBAL_RNG: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing is interpreted as true.
stratify=nothing: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle cannot be false.
multi=false: if true then X is expected to be a tuple of objects sharing a common length, which are each partitioned separately using the same specified fractions and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).

source

MLJBase.restrict — Method

restrict(X, folds, i)

The restriction of X, a vector, matrix or table, to the ith fold of folds, where folds is a tuple of vectors of row indices.

The method is curried, so that restrict(folds, i) is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i).

Example

folds = ([1, 2], [3, 4, 5],  [6,])
restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]

See also corestrict

source

MLJBase.skipinvalid — Method

skipinvalid(A, B)

For vectors A and B of the same length, return a tuple of vectors (A[mask], B[mask]) where mask[i] is true if and only if A[i] and B[i] are both valid (non-missing and non-NaN). Can also called on other iterators of matching length, such as arrays, but always returns a vector. Does not remove Missing from the element types if present in the original iterators.

source

MLJBase.skipinvalid — Method

skipinvalid(itr)

Return an iterator over the elements in itr skipping missing and NaN values. Behaviour is similar to skipmissing.

source

MLJBase.unpack — Method

unpack(table, f1, f2, ... fk;
       wrap_singles=false,
       shuffle=false,
       rng::Union{AbstractRNG,Int,Nothing}=nothing,
       coerce_options...)

Horizontally split any Tables.jl compatible table into smaller tables or vectors by making column selections determined by the predicates f1, f2, ..., fk. Selection from the column names is without replacement. A predicate is any object f such that f(name) is true or false for each column name::Symbol of table.

Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.

julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
2×4 DataFrame
 Row │ x      y     z        w
     │ Int64  Char  Float64  String
─────┼──────────────────────────────
   1 │     1  a        10.0  A
   2 │     2  b        20.0  B

julia> Z, XY, W = unpack(table, ==(:z), !=(:w));
julia> Z
2-element Vector{Float64}:
 10.0
 20.0

julia> XY
2×2 DataFrame
 Row │ x      y
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b

julia> W  # the column(s) left over
2-element Vector{String}:
 "A"
 "B"

Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true.

If coerce_options are specified then table is first replaced with coerce(table, coerce_options). See ScientificTypes.coerce for details.

If shuffle=true then the rows of table are first shuffled, using the global RNG, unless rng is specified; if rng is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng is specified then shuffle=true is implicit.

source