Preparing Data

Splitting data

MLJ has two tools for splitting data. To split data vertically (that is, to split by observations) use partition. This is commonly applied to a vector of observation indices, but can also be applied to datasets themselves, provided they are vectors, matrices or tables.

To split tabular data horizontally (i.e., break up a table based on feature names) use unpack.

MLJBase.partition — Function

partition(X, fractions...;
          shuffle=nothing,
          rng=Random.GLOBAL_RNG,
          stratify=nothing,
          multi=false)

Splits the vector, matrix or table X into a tuple of objects of the same type, whose vertical concatenation is X. The number of rows in each component of the return value is determined by the corresponding fractions of length(nrows(X)), where valid fractions are floats between 0 and 1 whose sum is less than one. The last fraction is not provided, as it is inferred from the preceding ones.

For synchronized partitioning of multiple objects, use the multi=true option.

julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])

julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])

julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])

julia> X, y = make_blobs() # a table and vector
julia> Xtrain, Xtest = partition(X, 0.8, stratify=y)

Here's an example of synchronized partitioning of multiple objects:

julia> (Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.8, rng=123, multi=true)

Keywords

shuffle=nothing: if set to true, shuffles the rows before taking fractions.
rng=Random.GLOBAL_RNG: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing is interpreted as true.
stratify=nothing: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle cannot be false.
multi=false: if true then X is expected to be a tuple of objects sharing a common length, which are each partitioned separately using the same specified fractions and the same row shuffling. Returns a tuple of partitions (a tuple of tuples).

source

MLJBase.unpack — Function

unpack(table, f1, f2, ... fk;
       wrap_singles=false,
       shuffle=false,
       rng::Union{AbstractRNG,Int,Nothing}=nothing,
       coerce_options...)

Horizontally split any Tables.jl compatible table into smaller tables or vectors by making column selections determined by the predicates f1, f2, ..., fk. Selection from the column names is without replacement. A predicate is any object f such that f(name) is true or false for each column name::Symbol of table.

Returns a tuple of tables/vectors with length one greater than the number of supplied predicates, with the last component including all previously unselected columns.

julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
2×4 DataFrame
 Row │ x      y     z        w
     │ Int64  Char  Float64  String
─────┼──────────────────────────────
   1 │     1  a        10.0  A
   2 │     2  b        20.0  B

julia> Z, XY, W = unpack(table, ==(:z), !=(:w));
julia> Z
2-element Vector{Float64}:
 10.0
 20.0

julia> XY
2×2 DataFrame
 Row │ x      y
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b

julia> W  # the column(s) left over
2-element Vector{String}:
 "A"
 "B"

Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true.

If coerce_options are specified then table is first replaced with coerce(table, coerce_options). See ScientificTypes.coerce for details.

If shuffle=true then the rows of table are first shuffled, using the global RNG, unless rng is specified; if rng is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng is specified then shuffle=true is implicit.

source

Bridging the gap between data type and model requirements

As outlined in Getting Started, it is important that the scientific type of data matches the requirements of the model of interest. For example, while the majority of supervised learning models require input features to be Continuous, newcomers to MLJ are sometimes surprised at the disappointing results of model queries such as this one:

X = (height   = [185, 153, 163, 114, 180],
     time     = [2.3, 4.5, 4.2, 1.8, 7.1],
     mark     = ["D", "A", "C", "B", "A"],
     admitted = ["yes", "no", missing, "yes"]);
y = [12.4, 12.5, 12.0, 31.9, 43.0]
models(matching(X, y))

4-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = RandomForestRegressor, package_name = BetaML, ... )

Or are unsure about the source of the following warning:

julia> Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0;
julia> tree = Tree();

julia> machine(tree, X, y)
┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`:
│ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}
│ input_scitype(model) = Table{var"#s46"} where var"#s46"<:Union{AbstractVector{var"#s9"} where var"#s9"<:Continuous, AbstractVector{var"#s9"} where var"#s9"<:Count, AbstractVector{var"#s9"} where var"#s9"<:OrderedFactor}.
└ @ MLJBase ~/Dropbox/Julia7/MLJ/MLJBase/src/machines.jl:103
Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data
  args:
    1:  Source @628 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}`
    2:  Source @544 ⏎ `AbstractVector{Continuous}`

The meaning of the warning is:

The input X is a table with column scitypes Continuous, Count, and Textual and Union{Missing, Textual}, which can also see by inspecting the schema:

schema(X)

┌──────────┬─────────────────────────┬────────────────────────┐
│ names    │ scitypes                │ types                  │
├──────────┼─────────────────────────┼────────────────────────┤
│ height   │ Count                   │ Int64                  │
│ time     │ Continuous              │ Float64                │
│ mark     │ Textual                 │ String                 │
│ admitted │ Union{Missing, Textual} │ Union{Missing, String} │
└──────────┴─────────────────────────┴────────────────────────┘

The model requires a table whose column element scitypes subtype Continuous, an incompatibility.

Common data preprocessing workflows

There are two tools for addressing data-model type mismatches like the above, with links to further documentation given below:

Scientific type coercion: We coerce machine types to obtain the intended scientific interpretation. If height in the above example is intended to be Continuous, mark is supposed to be OrderedFactor, and admitted a (binary) Multiclass, then we can do

X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass);
schema(X_coerced)

┌──────────┬───────────────────────────────┬────────────────────────────────────
│ names    │ scitypes                      │ types                             ⋯
├──────────┼───────────────────────────────┼────────────────────────────────────
│ height   │ Continuous                    │ Float64                           ⋯
│ time     │ Continuous                    │ Float64                           ⋯
│ mark     │ OrderedFactor{4}              │ CategoricalValue{String, UInt32}  ⋯
│ admitted │ Union{Missing, Multiclass{2}} │ Union{Missing, CategoricalValue{S ⋯
└──────────┴───────────────────────────────┴────────────────────────────────────
                                                                1 column omitted

Data transformations: We carry out conventional data transformations, such as missing value imputation and feature encoding:

imputer = FillImputer()
mach = machine(imputer, X_coerced) |> fit!
X_imputed = transform(mach, X_coerced);
schema(X_imputed)

┌──────────┬──────────────────┬──────────────────────────────────┐
│ names    │ scitypes         │ types                            │
├──────────┼──────────────────┼──────────────────────────────────┤
│ height   │ Continuous       │ Float64                          │
│ time     │ Continuous       │ Float64                          │
│ mark     │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
│ admitted │ Multiclass{2}    │ CategoricalValue{String, UInt32} │
└──────────┴──────────────────┴──────────────────────────────────┘

encoder = ContinuousEncoder()
mach = machine(encoder, X_imputed) |> fit!
X_encoded = transform(mach, X_imputed)

(height = [185.0, 153.0, 163.0, 114.0, 180.0],
 time = [2.3, 4.5, 4.2, 1.8, 7.1],
 mark = [4.0, 1.0, 3.0, 2.0, 1.0],
 admitted__no = [0.0, 1.0, 0.0, 0.0],
 admitted__yes = [1.0, 0.0, 1.0, 1.0],)

schema(X_encoded)

┌───────────────┬────────────┬─────────┐
│ names         │ scitypes   │ types   │
├───────────────┼────────────┼─────────┤
│ height        │ Continuous │ Float64 │
│ time          │ Continuous │ Float64 │
│ mark          │ Continuous │ Float64 │
│ admitted__no  │ Continuous │ Float64 │
│ admitted__yes │ Continuous │ Float64 │
└───────────────┴────────────┴─────────┘

Such transformations can also be combined in a pipeline; see Linear Pipelines.

Scientific type coercion

Scientific type coercion is documented in detail at ScientificTypesBase.jl. See also the tutorial at the this MLJ Workshop (specifically, here) and this Data Science in Julia tutorial.

Also relevant is the section, Working with Categorical Data.

Data transformation

MLJ's Built-in transformers are documented at Transformers and Other Unsupervised Models. The most relevant in the present context are: ContinuousEncoder, OneHotEncoder, FeatureSelector and FillImputer. A Gaussian mixture models imputer is provided by BetaML, which can be loaded with

MissingImputator = @load MissingImputator pkg=BetaML

This MLJ Workshop, and the "End-to-end examples" in Data Science in Julia tutorials give further illustrations of data preprocessing in MLJ.