Datasets
Standard datasets
To add a new dataset assuming it has a header and is, at path data/newdataset.csv
Start by loading it with CSV:
fpath = joinpath("datadir", "newdataset.csv")
data = CSV.read(fpath, copycols=true,
categorical=true)
Load it with DelimitedFiles and Tables
data_raw, data_header = readdlm(fpath, ',', header=true)
data_table = Tables.table(data_raw; header=Symbol.(vec(data_header)))
Retrieve the conversions:
for (n, st) in zip(names(data), scitype_union.(eachcol(data)))
println(":$n=>$st,")
end
Copy and paste the result in a coerce
data_table = coerce(data_table, ...)
MLJBase.load_dataset
— Methodload_dataset(fpath, coercions)
Load one of standard dataset like Boston etc assuming the file is a comma separated file with a header.
MLJBase.load_sunspots
— MethodLoad a well-known sunspot time series (table with one column). https://www.sws.bom.gov.au/Educational/2/3/6
MLJBase.@load_ames
— MacroLoad the full version of the well-known Ames Housing task.
MLJBase.@load_boston
— MacroLoad a well-known public regression dataset with Continuous
features.
MLJBase.@load_crabs
— MacroLoad a well-known crab classification dataset with nominal features.
MLJBase.@load_iris
— MacroLoad a well-known public classification task with nominal features.
MLJBase.@load_reduced_ames
— MacroLoad a reduced version of the well-known Ames Housing task
MLJBase.@load_smarket
— MacroLoad S&P Stock Market dataset, as used in An Introduction to Statistical Learning with applications in R, by Witten et al (2013), Springer-Verlag, New York.
MLJBase.@load_sunspots
— MacroLoad a well-known sunspot time series (single table with one column).
Synthetic datasets
MLJBase.augment_X
— Methodaugment_X(X, fit_intercept)
Given a matrix X
, append a column of ones if fit_intercept
is true. See make_regression
.
MLJBase.finalize_Xy
— Methodfinalize_Xy(X, y, shuffle, as_table, eltype, rng; clf)
Internal function to finalize the make_*
functions.
MLJBase.make_blobs
— FunctionX, y = make_blobs(n=100, p=2; kwargs...)
Generate Gaussian blobs for clustering and classification problems.
Return value
By default, a table X
with p
columns (features) and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
, indicating blob membership.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,centers=3
: either a number of centers or ac x p
matrix withc
pre-determined centers,cluster_std=1.0
: the standard deviation(s) of each blob,center_box=(-10. => 10.)
: the limits of thep
-dimensional cube within which the cluster centers are drawn if they are not provided,eltype=Float64
: machine type of points (any subtype ofAbstractFloat
).rng=Random.GLOBAL_RNG
: anyAbstractRNG
object, or integer to seed aMersenneTwister
(for reproducibility).as_table=true
: whether to return the points as a table (true) or a matrix (false). Iffalse
the targety
has integer element type.
Example
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
MLJBase.make_circles
— FunctionX, y = make_circles(n=100; kwargs...)
Generate n
labeled points close to two concentric circles for classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the smaller or larger circle, respectively.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,noise=0
: standard deviation of the Gaussian noise added to the data,factor=0.8
: ratio of the smaller radius over the larger one,eltype=Float64
: machine type of points (any subtype ofAbstractFloat
).rng=Random.GLOBAL_RNG
: anyAbstractRNG
object, or integer to seed aMersenneTwister
(for reproducibility).as_table=true
: whether to return the points as a table (true) or a matrix (false). Iffalse
the targety
has integer element type.
Example
X, y = make_circles(100; noise=0.5, factor=0.3)
MLJBase.make_moons
— Functionmake_moons(n::Int=100; kwargs...)
Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.
Return value
By default, a table X
with 2
columns and n
rows (observations), together with a corresponding vector of n
Multiclass
target observations y
. The target is either 0
or 1
, corresponding to membership to the left or right semi-circle.
Keyword arguments
shuffle=true
: whether to shuffle the resulting points,noise=0.1
: standard deviation of the Gaussian noise added to the data,xshift=1.0
: horizontal translation of the second center with respect to the first one.yshift=0.3
: vertical translation of the second center with respect to the first one.eltype=Float64
: machine type of points (any subtype ofAbstractFloat
).rng=Random.GLOBAL_RNG
: anyAbstractRNG
object, or integer to seed aMersenneTwister
(for reproducibility).as_table=true
: whether to return the points as a table (true) or a matrix (false). Iffalse
the targety
has integer element type.
Example
X, y = make_moons(100; noise=0.5)
MLJBase.make_regression
— Functionmake_regression(n, p; kwargs...)
Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.
Return value
By default, a tuple (X, y)
where table X
has p
columns and n
rows (observations), together with a corresponding vector of n
Continuous
target observations y
.
Keywords
intercept=true
: Whether to generate data from a model with intercept.n_targets=1
: Number of columns in the target.sparse=0
: Proportion of the generating weight vector that is sparse.noise=0.1
: Standard deviation of the Gaussian noise added to the response (target).outliers=0
: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied ifbinary
isfalse
.)as_table=true
: WhetherX
(andy
, ifn_targets > 1
) should be a table or a matrix.eltype=Float64
: Element type forX
andy
. Must subtypeAbstractFloat
.binary=false
: Whether the target should be binarized (via a sigmoid).eltype=Float64
: machine type of points (any subtype ofAbstractFloat
).rng=Random.GLOBAL_RNG
: anyAbstractRNG
object, or integer to seed aMersenneTwister
(for reproducibility).as_table=true
: whether to return the points as a table (true) or a matrix (false).
Example
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
MLJBase.outlify!
— Methodoutlify!(rng, y, s)
Add outliers to portion s
of vector y
.
MLJBase.runif_ab
— Methodrunif_ab(rng, n, p, a, b)
Internal function to generate n
points in [a, b]ᵖ
uniformly at random.
MLJBase.sigmoid
— Methodsigmoid(x)
Return the sigmoid computed in a numerically stable way: $σ(x) = 1/(1+\exp(-x))$
MLJBase.sparsify!
— Methodsparsify!(rng, θ, s)
Make portion s
of vector θ
exactly 0.
Utility functions
MLJBase.complement
— Methodcomplement(folds, i)
The complement of the i
th fold of folds
in the concatenation of all elements of folds
. Here folds
is a vector or tuple of integer vectors, typically representing row indices or a vector, matrix or table.
complement(([1,2], [3,], [4, 5]), 2) # [1 ,2, 4, 5]
MLJBase.corestrict
— Methodcorestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the complement of the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that corestrict(folds, i)
is the operator on data defined by corestrict(folds, i)(X) = corestrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
corestrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x1, :x2, :x6]
MLJBase.partition
— Methodpartition(X, fractions...;
shuffle=nothing,
rng=Random.GLOBAL_RNG,
stratify=nothing,
multi=false)
Splits the vector, matrix or table X
into a tuple of objects of the same type, whose vertical concatenation is X
. The number of rows in each component of the return value is determined by the corresponding fractions
of length(nrows(X))
, where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.
For synchronized partitioning of multiple objects, use the multi=true
option.
julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])
julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])
julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])
julia> X, y = make_blobs() # a table and vector
julia> Xtrain, Xtest = partition(X, 0.8, stratify=y)
Here's an example of synchronized partitioning of multiple objects:
julia> (Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)
Keywords
shuffle=nothing
: if set totrue
, shuffles the rows before taking fractions.rng=Random.GLOBAL_RNG
: specifies the random number generator to be used, can be an integer seed. If specified, andshuffle === nothing
is interpreted as true.stratify=nothing
: if a vector is specified, the partition will match the stratification of the given vector. In that case,shuffle
cannot befalse
.multi=false
: iftrue
thenX
is expected to be atuple
of objects sharing a common length, which are each partitioned separately using the same specifiedfractions
and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).
MLJBase.restrict
— Methodrestrict(X, folds, i)
The restriction of X
, a vector, matrix or table, to the i
th fold of folds
, where folds
is a tuple of vectors of row indices.
The method is curried, so that restrict(folds, i)
is the operator on data defined by restrict(folds, i)(X) = restrict(X, folds, i)
.
Example
folds = ([1, 2], [3, 4, 5], [6,])
restrict([:x1, :x2, :x3, :x4, :x5, :x6], folds, 2) # [:x3, :x4, :x5]
See also corestrict
MLJBase.skipinvalid
— Methodskipinvalid(A, B)
For vectors A
and B
of the same length, return a tuple of vectors (A[mask], B[mask])
where mask[i]
is true
if and only if A[i]
and B[i]
are both valid (non-missing
and non-NaN
). Can also called on other iterators of matching length, such as arrays, but always returns a vector. Does not remove Missing
from the element types if present in the original iterators.
MLJBase.skipinvalid
— Methodskipinvalid(itr)
Return an iterator over the elements in itr
skipping missing
and NaN
values. Behaviour is similar to skipmissing
.
MLJBase.unpack
— Methodunpack(table, f1, f2, ... fk;
wrap_singles=false,
shuffle=false,
rng::Union{AbstractRNG,Int,Nothing}=nothing,
coerce_options...)
Horizontally split any Tables.jl compatible table
into smaller tables or vectors by making column selections determined by the predicates f1
, f2
, ..., fk
. Selection from the column names is without replacement. A predicate is any object f
such that f(name)
is true
or false
for each column name::Symbol
of table
.
Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
2×4 DataFrame
Row │ x y z w
│ Int64 Char Float64 String
─────┼──────────────────────────────
1 │ 1 a 10.0 A
2 │ 2 b 20.0 B
julia> Z, XY, W = unpack(table, ==(:z), !=(:w));
julia> Z
2-element Vector{Float64}:
10.0
20.0
julia> XY
2×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
julia> W # the column(s) left over
2-element Vector{String}:
"A"
"B"
Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true
.
If coerce_options
are specified then table
is first replaced with coerce(table, coerce_options)
. See ScientificTypes.coerce
for details.
If shuffle=true
then the rows of table
are first shuffled, using the global RNG, unless rng
is specified; if rng
is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng
is specified then shuffle=true
is implicit.