Datasets

    Standard datasets

    To add a new dataset assuming it has a header and is, at path data/newdataset.csv

    Start by loading it with CSV:

    fpath = joinpath("datadir", "newdataset.csv")
    data = CSV.read(fpath, copycols=true,
                    categorical=true)

    Load it with DelimitedFiles and Tables

    data_raw, data_header = readdlm(fpath, ',', header=true)
    data_table = Tables.table(data_raw; header=Symbol.(vec(data_header)))

    Retrieve the conversions:

    for (n, st) in zip(names(data), scitype_union.(eachcol(data)))
        println(":$n=>$st,")
    end

    Copy and paste the result in a coerce

    data_table = coerce(data_table, ...)
    MLJBase.load_datasetMethod

    load_dataset(fpath, coercions)

    Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.

    source

    Synthetic datasets

    MLJBase.make_blobsFunction
    make_blobs(n=100, p=2; kwargs...)

    Generate gaussian blobs with n examples of p features. The function returns a n x p matrix with the samples and a n integer vector indicating the membership of each point.

    Keyword arguments

    • shuffle=true: whether to shuffle the resulting points,
    • centers=3: either a number of centers or a c x p matrix with c pre-determined centers,
    • cluster_std=1.0: the standard deviation(s) of each blob,
    • center_box=(-10. => 10.): the limits of the p-dimensional cube within which the cluster centers are drawn if they are not provided,
    • as_table=true: whether to return the points as a table (true) or a
    				matrix (false). If true, the target vector is a
    				categorical vector.
    • eltype=Float64: to specify another type for the points, can be any

    subtype of AbstractFloat.

    • rng=nothing: specify a number to make the points reproducible.

    Example

    X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
    source
    MLJBase.make_circlesFunction
    make_circles(n=100; kwargs...)

    Generate n points along two circumscribed circles returning the n x 2 matrix of points and a vector of membership (0, 1) depending on whether the points are on the smaller circle (0) or the larger one (1).

    Keyword arguments

    • shuffle=true: whether to shuffle the resulting points,
    • noise=0: standard deviation of the gaussian noise added to the data,
    • factor=0.8: ratio of the smaller radius over the larger one,
    • as_table=true: whether to return the points as a table (true) or a
    				matrix (false). If true, the target vector is a
    				categorical vector.
    • eltype=Float64: to specify another type for the points, can be any

    subtype of AbstractFloat.

    • rng=nothing: specify a number to make the points reproducible.

    Example

    X, y = make_circles(100; noise=0.5, factor=0.3)
    source
    MLJBase.make_moonsFunction
    make_moons(n::Int=100; kwargs...)

    Generates n examples sampling from two interleaved half-circles returning the n x 2 matrix of points and a vector of membership (0, 1) depending on whether the points are on the half-circle on the left (0) or on the right (1).

    Keyword arguments

    • shuffle=true: whether to shuffle the resulting points,
    • noise=0.1: standard deviation of the gaussian noise added to the data,
    • xshift=1.0: horizontal translation of the second center with respect to the first one.
    • yshift=0.3: vertical translation of the second center with respect to the first one.
    • as_table=true: whether to return the points as a table (true) or a
    				matrix (false). If true, the target vector is a
    				categorical vector.
    • eltype=Float64: to specify another type for the points, can be any

    subtype of AbstractFloat.

    • rng=nothing: specify a number to make the points reproducible.

    Example

    X, y = make_moons(100; noise=0.5)
    source
    MLJBase.make_regressionFunction

    make_regression(n, p; kwargs...)

    Keywords

    • intercept=true: whether to generate data from a model with intercept,
    • sparse=0: portion of the generating weight vector that is sparse,
    • noise=0.1: standard deviation of the gaussian noise added to the

    response,

    • outliers=0: portion of the response vector to make as outliers by ading
    				a random quantity with high variance. (Only applied if
    				`binary` is `false`)
    • binary=false: whether the target should be binarized (via a sigmoid).
    • as_table=true: whether to return the points as a table (true) or a
    				matrix (false). If true, the target vector is a
    				categorical vector.
    • eltype=Float64: to specify another type for the points, can be any

    subtype of AbstractFloat.

    • rng=nothing: specify a number to make the points reproducible.

    Example

    X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
    source
    MLJBase.finalize_XyMethod
    finalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)

    Internal function to finalize the make_* functions.

    source
    MLJBase.runif_abMethod
    runif_ab(rng, n, p, a, b)

    Internal function to generate n points in [a, b]ᵖ uniformly at random.

    source
    MLJBase.sigmoidMethod

    sigmoid(x)

    Return the sigmoid computed in a numerically stable way:

    $σ(x) = 1/(1+exp(-x))$

    source

    Utility functions

    MLJBase.complementMethod
    complement(folds, i)

    The complement of the ith fold of folds in the concatenation of all elements of folds. Here folds is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.

    complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]
    source
    MLJBase.corestrictMethod
    corestrict(X, folds, i)

    The restriction of X, a vector, matrix or table, to the complement of the ith fold of folds, where folds is a tuple of vectors of row indices.

    The method is curried, so that corestrict(folds, i) is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i).

    Example

    folds = ([1, 2], [3, 4, 5],  [6,])
    corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]
    source
    MLJBase.partitionMethod
    partition(rows::AbstractVector{Int}, fractions...;
              shuffle=nothing, rng=Random.GLOBAL_RNG, stratify=nothing)

    Splits the vector rows into a tuple of vectors whose lengths are given by the corresponding fractions of length(rows) where valid fractions are in (0,1) and sum up to less than 1. The last fraction is not provided, as it is inferred from the preceding ones. So, for example,

    julia> partition(1:1000, 0.8)
    ([1,...,800], [801,...,1000])
    
    julia> partition(1:1000, 0.2, 0.7)
    ([1,...,200], [201,...,900], [901,...,1000])

    Keywords

    • shuffle=nothing: if set to true, shuffles the rows before taking fractions.
    • rng=Random.GLOBAL_RNG: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing is interpreted as true.
    • stratify=nothing: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle cannot be false.
    source
    MLJBase.restrictMethod
    restrict(X, folds, i)

    The restriction of X, a vector, matrix or table, to the ith fold of folds, where folds is a tuple of vectors of row indices.

    The method is curried, so that restrict(folds, i) is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i).

    Example

    folds = ([1, 2], [3, 4, 5],  [6,])
    restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]

    See also corestrict

    source
    MLJBase.unpackMethod
    t1, t2, ...., tk = unnpack(table, f1, f2, ... fk; wrap_singles=false)

    Split any Tables.jl compatible table into smaller tables (or vectors) t1, t2, ..., tk by making selections without replacement from the column names defined by the filters f1, f2, ..., fk. A filter is any object f such that f(name) is true or false for each column name::Symbol of table.

    Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true.

    Scientific type conversions can be optionally specified (note semicolon):

    unpack(table, t...; wrap_singles=false, col1=>scitype1, col2=>scitype2, ... )

    Example

    julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=[:A, :B])
    julia> Z, XY = unpack(table, ==(:z), !=(:w);
                   :x=>Continuous, :y=>Multiclass)
    julia> XY
    2×2 DataFrame
    │ Row │ x       │ y            │
    │     │ Float64 │ Categorical… │
    ├─────┼─────────┼──────────────┤
    │ 1   │ 1.0     │ 'a'          │
    │ 2   │ 2.0     │ 'b'          │
    
    julia> Z
    2-element Array{Float64,1}:
     10.0
     20.0
    source
    MLJModelInterface.transformMethod
    transform(e::Union{CategoricalElement,CategoricalArray,CategoricalPool},  X)

    Transform the specified object X into a categorical version, using the pool contained in e. Here X is a raw value (an element of levels(e)) or an AbstractArray of such values.

    ```julia v = categorical([:x, :y, :y, :x, :x]) julia> transform(v, :x) CategoricalValue{Symbol,UInt32} :x

    julia> transform(v[1], [:x :x; missing :y]) 2×2 CategoricalArray{Union{Missing, Symbol},2,UInt32}: :x :x missing :y

    source