Built-in Transformers

MLJModels.StandardizerType
Standardizer(; features=Symbol[], ignore=false, ordered_factor=false, count=false)

Unsupervised model for standardizing (whitening) the columns of tabular data. If features is empty then all columns v having Continuous element scitype are standardized. Otherwise, the features standardized are Continuous named in features (ignore=false) or Continuous features not named in features (ignore=true). To allow standarization of Count or OrderedFactor features as well, set the appropriate flag to true.

Instead of supplying a features vector, a Bool-valued callable can be also be specified. For example, specifying Standardizer(features = name -> name in [:x1, :x3], ignore = true, count=true) has the same effect as Standardizer(features = [:x1, :x3], ignore = true, count=true), namely to standardise all Continuous and Count features, with the exception of :x1 and :x3.

Example

julia> using MLJModels, CategoricalArrays, MLJBase

julia> X = (ordinal1 = [1, 2, 3],
            ordinal2 = categorical([:x, :y, :x], ordered=true),
            ordinal3 = [10.0, 20.0, 30.0],
            ordinal4 = [-20.0, -30.0, -40.0],
            nominal = categorical(["Your father", "he", "is"]));

julia> stand1 = Standardizer();

julia> transform(fit!(machine(stand1, X)), X)
[ Info: Training Machine{Standardizer} @ 7…97.
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalString{UInt32}["Your father", "he", "is"],)

julia> stand2 = Standardizer(features=[:ordinal3, ], ignore=true, count=true);

julia> transform(fit!(machine(stand2, X)), X)
[ Info: Training Machine{Standardizer} @ 1…87.
(ordinal1 = [-1.0, 0.0, 1.0],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [10.0, 20.0, 30.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalString{UInt32}["Your father", "he", "is"],)
MLJModels.OneHotEncoderType
OneHotEncoder(; features=Symbol[], drop_last=false, ordered_factor=true)

Unsupervised model for one-hot encoding all features of Finite scitype, within some table. If ordered_factor=false then only Multiclass features are considered. The features encoded are further restricted to those in features, when specified and non-empty.

If drop_last is true, the column for the last level of each categorical feature is dropped. New data to be transformed may lack features present in the fit data, but no new features can be present.

Warning: This transformer assumes that the elements of a categorical feature in new data to be transformed point to the same CategoricalPool object encountered during the fit.

source
MLJModels.FeatureSelectorType
FeatureSelector(features=Symbol[])

An unsupervised model for filtering features (columns) of a table. Only those features encountered during fitting will appear in transformed tables if features is empty (the default). Alternatively, if a non-empty features is specified, then only the specified features are used. Throws an error if a recorded or specified feature is not present in the transformation input.

source
MLJModels.UnivariateBoxCoxTransformerType
UnivariateBoxCoxTransformer(; n=171, shift=false)

Unsupervised model specifying a univariate Box-Cox transformation of a single variable taking non-negative values, with a possible preliminary shift. Such a transformation is of the form

x -> ((x + c)^λ - 1)/λ for λ not 0
x -> log(x + c) for λ = 0

On fitting to data n different values of the Box-Cox exponent λ (between -0.4 and 3) are searched to fix the value maximizing normality. If shift=true and zero values are encountered in the data then the transformation sought includes a preliminary positive shift c of 0.2 times the data mean. If there are no zero values, then no shift is applied.

source
MLJModels.UnivariateDiscretizerType
UnivariateDiscretizer(n_classes=512)

Returns an MLJModel for for discretizing any continuous vector v (scitype(v) <: AbstractVector{Continuous}), where n_classes describes the resolution of the discretization.

Transformed output w is a vector of ordered factors (scitype(w) <: AbstractVector{<:OrderedFactor}). Specifically, w is a CategoricalVector, with element type CategoricalValue{R,R}, where R<Unsigned is optimized.

The transformation is chosen so that the vector on which the transformer is fit has, in transformed form, an approximately uniform distribution of values.

Example

using MLJ
t = UnivariateDiscretizer(n_classes=10)
discretizer = machine(t, randn(1000))
fit!(discretizer)
v = rand(10)
w = transform(discretizer, v)
v_approx = inverse_transform(discretizer, w) # reconstruction of v from w
source
MLJModels.FillImputerType
FillImputer(features=[],
            continuous_fill=<median>,
            count_fill=<round_median>,
            finite_fill=<mode>)

Imputes missing data with a fixed value computed on the non-missing values. A different imputing function can be specified for Continuous, Count and Finite data.

Fields

  • continuous_fill: function to use on Continuous data, by default the median

  • count_fill: function to use on Count data, by default the rounded median

  • finite_fill: function to use on Multiclass and OrderedFactor data (including binary data), by default the mode

source