Preparing Data
Splitting data
MLJ has two tools for splitting data. To split data vertically (that is, to split by observations) use partition
. This is commonly applied to a vector of observation indices, but can also be applied to datasets themselves, provided they are vectors, matrices or tables.
To split tabular data horizontally (i.e., break up a table based on feature names) use unpack
.
MLJBase.partition
— Functionpartition(X, fractions...;
shuffle=nothing,
rng=Random.GLOBAL_RNG,
stratify=nothing,
multi=false)
Splits the vector, matrix or table X
into a tuple of objects of the same type, whose vertical concatenation is X
. The number of rows in each component of the return value is determined by the corresponding fractions
of length(nrows(X))
, where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.
For synchronized partitioning of multiple objects, use the multi=true
option.
julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])
julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])
julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])
julia> X, y = make_blobs() # a table and vector
julia> Xtrain, Xtest = partition(X, 0.8, stratify=y)
Here's an example of synchronized partitioning of multiple objects:
julia> (Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)
Keywords
shuffle=nothing
: if set totrue
, shuffles the rows before taking fractions.rng=Random.GLOBAL_RNG
: specifies the random number generator to be used, can be an integer seed. If specified, andshuffle === nothing
is interpreted as true.stratify=nothing
: if a vector is specified, the partition will match the stratification of the given vector. In that case,shuffle
cannot befalse
.multi=false
: iftrue
thenX
is expected to be atuple
of objects sharing a common length, which are each partitioned separately using the same specifiedfractions
and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).
MLJBase.unpack
— Functionunpack(table, f1, f2, ... fk;
wrap_singles=false,
shuffle=false,
rng::Union{AbstractRNG,Int,Nothing}=nothing,
coerce_options...)
Horizontally split any Tables.jl compatible table
into smaller tables or vectors by making column selections determined by the predicates f1
, f2
, ..., fk
. Selection from the column names is without replacement. A predicate is any object f
such that f(name)
is true
or false
for each column name::Symbol
of table
.
Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
2×4 DataFrame
Row │ x y z w
│ Int64 Char Float64 String
─────┼──────────────────────────────
1 │ 1 a 10.0 A
2 │ 2 b 20.0 B
julia> Z, XY, W = unpack(table, ==(:z), !=(:w));
julia> Z
2-element Vector{Float64}:
10.0
20.0
julia> XY
2×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
julia> W # the column(s) left over
2-element Vector{String}:
"A"
"B"
Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true
.
If coerce_options
are specified then table
is first replaced with coerce(table, coerce_options)
. See ScientificTypes.coerce
for details.
If shuffle=true
then the rows of table
are first shuffled, using the global RNG, unless rng
is specified; if rng
is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng
is specified then shuffle=true
is implicit.
Bridging the gap between data type and model requirements
As outlined in Getting Started, it is important that the scientific type of data matches the requirements of the model of interest. For example, while the majority of supervised learning models require input features to be Continuous
, newcomers to MLJ are sometimes surprised at the disappointing results of model queries such as this one:
X = (height = [185, 153, 163, 114, 180],
time = [2.3, 4.5, 4.2, 1.8, 7.1],
mark = ["D", "A", "C", "B", "A"],
admitted = ["yes", "no", missing, "yes"]);
y = [12.4, 12.5, 12.0, 31.9, 43.0]
models(matching(X, y))
4-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = ConstantRegressor, package_name = MLJModels, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = RandomForestRegressor, package_name = BetaML, ... )
Or are unsure about the source of the following warning:
julia> Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0;
julia> tree = Tree();
julia> machine(tree, X, y)
┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`:
│ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}
│ input_scitype(model) = Table{var"#s46"} where var"#s46"<:Union{AbstractVector{var"#s9"} where var"#s9"<:Continuous, AbstractVector{var"#s9"} where var"#s9"<:Count, AbstractVector{var"#s9"} where var"#s9"<:OrderedFactor}.
└ @ MLJBase ~/Dropbox/Julia7/MLJ/MLJBase/src/machines.jl:103
Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data
args:
1: Source @628 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}`
2: Source @544 ⏎ `AbstractVector{Continuous}`
The meaning of the warning is:
The input
X
is a table with column scitypesContinuous
,Count
, andTextual
andUnion{Missing, Textual}
, which can also see by inspecting the schema:schema(X)
┌──────────┬─────────────────────────┬────────────────────────┐ │ names │ scitypes │ types │ ├──────────┼─────────────────────────┼────────────────────────┤ │ height │ Count │ Int64 │ │ time │ Continuous │ Float64 │ │ mark │ Textual │ String │ │ admitted │ Union{Missing, Textual} │ Union{Missing, String} │ └──────────┴─────────────────────────┴────────────────────────┘
The model requires a table whose column element scitypes subtype
Continuous
, an incompatibility.
Common data preprocessing workflows
There are two tools for addressing data-model type mismatches like the above, with links to further documentation given below:
Scientific type coercion: We coerce machine types to obtain the intended scientific interpretation. If height
in the above example is intended to be Continuous
, mark
is supposed to be OrderedFactor
, and admitted
a (binary) Multiclass
, then we can do
X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass);
schema(X_coerced)
┌──────────┬───────────────────────────────┬────────────────────────────────────
│ names │ scitypes │ types ⋯
├──────────┼───────────────────────────────┼────────────────────────────────────
│ height │ Continuous │ Float64 ⋯
│ time │ Continuous │ Float64 ⋯
│ mark │ OrderedFactor{4} │ CategoricalValue{String, UInt32} ⋯
│ admitted │ Union{Missing, Multiclass{2}} │ Union{Missing, CategoricalValue{S ⋯
└──────────┴───────────────────────────────┴────────────────────────────────────
1 column omitted
Data transformations: We carry out conventional data transformations, such as missing value imputation and feature encoding:
imputer = FillImputer()
mach = machine(imputer, X_coerced) |> fit!
X_imputed = transform(mach, X_coerced);
schema(X_imputed)
┌──────────┬──────────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├──────────┼──────────────────┼──────────────────────────────────┤
│ height │ Continuous │ Float64 │
│ time │ Continuous │ Float64 │
│ mark │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
│ admitted │ Multiclass{2} │ CategoricalValue{String, UInt32} │
└──────────┴──────────────────┴──────────────────────────────────┘
encoder = ContinuousEncoder()
mach = machine(encoder, X_imputed) |> fit!
X_encoded = transform(mach, X_imputed)
(height = [185.0, 153.0, 163.0, 114.0, 180.0],
time = [2.3, 4.5, 4.2, 1.8, 7.1],
mark = [4.0, 1.0, 3.0, 2.0, 1.0],
admitted__no = [0.0, 1.0, 0.0, 0.0],
admitted__yes = [1.0, 0.0, 1.0, 1.0],)
schema(X_encoded)
┌───────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────────────┼────────────┼─────────┤
│ height │ Continuous │ Float64 │
│ time │ Continuous │ Float64 │
│ mark │ Continuous │ Float64 │
│ admitted__no │ Continuous │ Float64 │
│ admitted__yes │ Continuous │ Float64 │
└───────────────┴────────────┴─────────┘
Such transformations can also be combined in a pipeline; see Linear Pipelines.
Scientific type coercion
Scientific type coercion is documented in detail at ScientificTypesBase.jl. See also the tutorial at the this MLJ Workshop (specifically, here) and this Data Science in Julia tutorial.
Also relevant is the section, Working with Categorical Data.
Data transformation
MLJ's Built-in transformers are documented at Transformers and Other Unsupervised Models. The most relevant in the present context are: ContinuousEncoder
, OneHotEncoder
, FeatureSelector
and FillImputer
. A Gaussian mixture models imputer is provided by BetaML, which can be loaded with
MissingImputator = @load MissingImputator pkg=BetaML
This MLJ Workshop, and the "End-to-end examples" in Data Science in Julia tutorials give further illustrations of data preprocessing in MLJ.