Datasets
Standard datasets
To add a new dataset assuming it has a header and is, at path data/newdataset.csv
Start by loading it with CSV:
fpath = joinpath("datadir", "newdataset.csv")
data = CSV.read(fpath, copycols=true,
categorical=true)
Load it with DelimitedFiles and Tables
data_raw, data_header = readdlm(fpath, ',', header=true)
data_table = Tables.table(data_raw; header=Symbol.(vec(data_header)))
Retrieve the conversions:
for (n, st) in zip(names(data), scitype_union.(eachcol(data)))
println(":$n=>$st,")
end
Copy and paste the result in a coerce
data_table = coerce(data_table, ...)
MLJBase.@load_ames
— MacroLoad the full version of the well-known Ames Housing task.
MLJBase.@load_boston
— MacroLoad a well-known public regression dataset with Continuous
features.
MLJBase.@load_crabs
— MacroLoad a well-known crab classification dataset with nominal features.
MLJBase.@load_iris
— MacroLoad a well-known public classification task with nominal features.
MLJBase.@load_reduced_ames
— MacroLoad a reduced version of the well-known Ames Housing task
MLJBase.@load_smarket
— MacroLoad S&P Stock Market dataset, as used in (An Introduction to Statistical Learning with applications in R)https://rdrr.io/cran/ISLR/man/Smarket.html, by Witten et al (2013), Springer-Verlag, New York.
MLJBase.load_dataset
— Methodload_dataset(fpath, coercions)
Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.
Synthetic datasets
MLJBase.make_blobs
— Functionmake_blobs(n=100, p=2; kwargs...)
Generate gaussian blobs with n
examples of p
features. The function returns a n x p
matrix with the samples and a n
integer vector indicating the membership of each point.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,centers=3
: either a number of centers or ac x p
matrix withc
pre-determined centers,cluster_std=1.0
: the standard deviation(s) of each blob,center_box=(-10. => 10.)
: the limits of thep
-dimensional cube within which the cluster centers are drawn if they are not provided,as_table=true
: whether to return the points as a table (true) or a
matrix (false). If true, the target vector is a
categorical vector.
eltype=Float64
: to specify another type for the points, can be any
subtype of AbstractFloat.
rng=nothing
: specify a number to make the points reproducible.
Example
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
MLJBase.make_circles
— Functionmake_circles(n=100; kwargs...)
Generate n
points along two circumscribed circles returning the n x 2
matrix of points and a vector of membership (0, 1) depending on whether the points are on the smaller circle (0) or the larger one (1).
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,noise=0
: standard deviation of the gaussian noise added to the data,factor=0.8
: ratio of the smaller radius over the larger one,as_table=true
: whether to return the points as a table (true) or a
matrix (false). If true, the target vector is a
categorical vector.
eltype=Float64
: to specify another type for the points, can be any
subtype of AbstractFloat.
rng=nothing
: specify a number to make the points reproducible.
Example
X, y = make_circles(100; noise=0.5, factor=0.3)
MLJBase.make_moons
— Functionmake_moons(n::Int=100; kwargs...)
Generates n
examples sampling from two interleaved half-circles returning the n x 2
matrix of points and a vector of membership (0, 1) depending on whether the points are on the half-circle on the left (0) or on the right (1).
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,noise=0.1
: standard deviation of the gaussian noise added to the data,xshift=1.0
: horizontal translation of the second center with respect to the first one.yshift=0.3
: vertical translation of the second center with respect to the first one.as_table=true
: whether to return the points as a table (true) or a
matrix (false). If true, the target vector is a
categorical vector.
eltype=Float64
: to specify another type for the points, can be any
subtype of AbstractFloat.
rng=nothing
: specify a number to make the points reproducible.
Example
X, y = make_moons(100; noise=0.5)
MLJBase.make_regression
— Functionmake_regression(n, p; kwargs...)
Keywords
intercept=true
: whether to generate data from a model with intercept,sparse=0
: portion of the generating weight vector that is sparse,noise=0.1
: standard deviation of the gaussian noise added to the
response,
outliers=0
: portion of the response vector to make as outliers by ading
a random quantity with high variance. (Only applied if
`binary` is `false`)
binary=false
: whether the target should be binarized (via a sigmoid).as_table=true
: whether to return the points as a table (true) or a
matrix (false). If true, the target vector is a
categorical vector.
eltype=Float64
: to specify another type for the points, can be any
subtype of AbstractFloat.
rng=nothing
: specify a number to make the points reproducible.
Example
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
MLJBase.augment_X
— Methodaugment_X(X, fit_intercept)
Given a matrix X
, append a column of ones if fit_intercept
is true. See make_regression
.
MLJBase.finalize_Xy
— Methodfinalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)
Internal function to finalize the make_*
functions.
MLJBase.outlify!
— MethodAdd outliers to portion s of vector.
MLJBase.runif_ab
— Methodrunif_ab(rng, n, p, a, b)
Internal function to generate n
points in [a, b]ᵖ
uniformly at random.
MLJBase.sigmoid
— Methodsigmoid(x)
Return the sigmoid computed in a numerically stable way:
$σ(x) = 1/(1+exp(-x))$
MLJBase.sparsify!
— Methodsparsify!(rng, θ, s)
Make portion s
of vector θ
exactly 0.
Utility functions
MLJBase.complement
— Methodcomplement(folds, i)
The complement of the i
th fold of folds
in the concatenation of all elements of folds
. Here folds
is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.
complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]
MLJBase.corestrict
— Methodcorestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the complement of the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that corestrict(folds, i)
is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]
MLJBase.partition
— Methodpartition(rows::AbstractVector{Int}, fractions...;
shuffle=nothing, rng=Random.GLOBAL_RNG, stratify=nothing)
Splits the vector rows
into a tuple of vectors whose lengths are given by the corresponding fractions
of length(rows)
where valid fractions are in (0,1) and sum up to less than 1. The last fraction is not provided, as it is inferred from the preceding ones. So, for example,
julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])
julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])
Keywords
shuffle=nothing
: if set totrue
, shuffles the rows before taking fractions.rng=Random.GLOBAL_RNG
: specifies the random number generator to be used, can be an integer seed. If specified, andshuffle === nothing
is interpreted as true.stratify=nothing
: if a vector is specified, the partition will match the stratification of the given vector. In that case,shuffle
cannot befalse
.
MLJBase.restrict
— Methodrestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that restrict(folds, i)
is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]
See also corestrict
MLJBase.unpack
— Methodt1, t2, ...., tk = unnpack(table, f1, f2, ... fk; wrap_singles=false)
Split any Tables.jl compatible table
into smaller tables (or vectors) t1, t2, ..., tk
by making selections without replacement from the column names defined by the filters f1
, f2
, ..., fk
. A filter is any object f
such that f(name)
is true
or false
for each column name::Symbol
of table
.
Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true
.
Scientific type conversions can be optionally specified (note semicolon):
unpack(table, t...; wrap_singles=false, col1=>scitype1, col2=>scitype2, ... )
Example
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=[:A, :B])
julia> Z, XY = unpack(table, ==(:z), !=(:w);
:x=>Continuous, :y=>Multiclass)
julia> XY
2×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Categorical… │
├─────┼─────────┼──────────────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 2.0 │ 'b' │
julia> Z
2-element Array{Float64,1}:
10.0
20.0
MLJModelInterface.transform
— Methodtransform(e::Union{CategoricalElement,CategoricalArray,CategoricalPool}, X)
Transform the specified object X
into a categorical version, using the pool contained in e
. Here X
is a raw value (an element of levels(e)
) or an AbstractArray
of such values.
```julia v = categorical([:x, :y, :y, :x, :x]) julia> transform(v, :x) CategoricalValue{Symbol,UInt32} :x
julia> transform(v[1], [:x :x; missing :y]) 2×2 CategoricalArray{Union{Missing, Symbol},2,UInt32}: :x :x missing :y