Classical Encoders · MLJTransforms

Classical encoders include well known and commonly used categorical encoders:

Transformer	Brief Description
OneHotEncoder	Encode categorical variables into one-hot vectors
ContinuousEncoder	Adds type casting functionality to OnehotEncoder
OrdinalEncoder	Encode categorical variables into ordered integers
FrequencyEncoder	Encode categorical variables into their normalized or unormalized frequencies
TargetEncoder	Encode categorical variables into relevant target statistics

MLJTransforms.OneHotEncoder — Type

OneHotEncoder

A model type for constructing a one-hot encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

OneHotEncoder = @load OneHotEncoder pkg=MLJTransforms

Do model = OneHotEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...).

Use this model to one-hot encode the Multiclass and OrderedFactor features (columns) of some table, leaving other columns unchanged.

New data to be transformed may lack features present in the fit data, but no new features can be present.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column, col, is the same for training data and new data to be transformed.

To ensure all features are transformed into Continuous features, or dropped, use ContinuousEncoder instead.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

where

X: any Tables.jl compatible table. Columns can be of mixed type but only those with element scitype Multiclass or OrderedFactor can be encoded. Check column scitypes with schema(X).

Train the machine using fit!(mach, rows=...).

Hyper-parameters

features: a vector of symbols (feature names). If empty (default) then all Multiclass and OrderedFactor features are encoded. Otherwise, encoding is further restricted to the specified features (ignore=false) or the unspecified features (ignore=true). This default behavior can be modified by the ordered_factor flag.
ordered_factor=false: when true, OrderedFactor features are universally excluded
drop_last=false: whether to drop the column corresponding to the final class of encoded features. For example, a three-class feature is spawned into three new features if drop_last=false, but just two features otherwise.

Fitted parameters

The fields of fitted_params(mach) are:

all_features: names of all features encountered in training
fitted_levels_given_feature: dictionary of the levels associated with each feature encoded, keyed on the feature name
ref_name_pairs_given_feature: dictionary of pairs r => ftr (such as 0x00000001 => :grad__A) where r is a CategoricalArrays.jl reference integer representing a level, and ftr the corresponding new feature name; the dictionary is keyed on the names of features that are encoded

Report

The fields of report(mach) are:

features_to_be_encoded: names of input features to be encoded
new_features: names of all output features

Example

using MLJ

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical(["A", "B", "A", "C"], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3])

julia> schema(X)
┌───────────┬──────────────────┐
│ names     │ scitypes         │
├───────────┼──────────────────┤
│ name      │ Multiclass{4}    │
│ grade     │ OrderedFactor{3} │
│ height    │ Continuous       │
│ n_devices │ Count            │
└───────────┴──────────────────┘

hot = OneHotEncoder(drop_last=true)
mach = fit!(machine(hot, X))
W = transform(mach, X)

julia> schema(W)
┌──────────────┬────────────┐
│ names        │ scitypes   │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John   │ Continuous │
│ name__Lee    │ Continuous │
│ grade__A     │ Continuous │
│ grade__B     │ Continuous │
│ height       │ Continuous │
│ n_devices    │ Count      │
└──────────────┴────────────┘

See also ContinuousEncoder.

source

MLJTransforms.ContinuousEncoder — Type

ContinuousEncoder

A model type for constructing a continuous encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

ContinuousEncoder = @load ContinuousEncoder pkg=MLJTransforms

Do model = ContinuousEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContinuousEncoder(drop_last=...).

Use this model to arrange all features (features) of a table to have Continuous element scitype, by applying the following protocol to each feature ftr:

If ftr is already Continuous retain it.
If ftr is Multiclass, one-hot encode it.
If ftr is OrderedFactor, replace it with coerce(ftr, Continuous) (vector of floating point integers), unless ordered_factors=false is specified, in which case one-hot encode it.
If ftr is Count, replace it with coerce(ftr, Continuous).
If ftr has some other element scitype, or was not observed in fitting the encoder, drop it from the table.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column, col, is the same for training data and new data to be transformed.

To selectively one-hot-encode categorical features (without dropping features) use OneHotEncoder instead.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

where

X: any Tables.jl compatible table. features can be of mixed type but only those with element scitype Multiclass or OrderedFactor can be encoded. Check column scitypes with schema(X).

Train the machine using fit!(mach, rows=...).

Hyper-parameters

drop_last=true: whether to drop the column corresponding to the final class of one-hot encoded features. For example, a three-class feature is spawned into three new features if drop_last=false, but two just features otherwise.
one_hot_ordered_factors=false: whether to one-hot any feature with OrderedFactor element scitype, or to instead coerce it directly to a (single) Continuous feature using the order

Fitted parameters

The fields of fitted_params(mach) are:

features_to_keep: names of features that will not be dropped from the table
one_hot_encoder: the OneHotEncoder model instance for handling the one-hot encoding
one_hot_encoder_fitresult: the fitted parameters of the OneHotEncoder model

Report

features_to_keep: names of input features that will not be dropped from the table
new_features: names of all output features

Example

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical(["A", "B", "A", "C"], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3],
     comments=["the force", "be", "with you", "too"])

julia> schema(X)
┌───────────┬──────────────────┐
│ names     │ scitypes         │
├───────────┼──────────────────┤
│ name      │ Multiclass{4}    │
│ grade     │ OrderedFactor{3} │
│ height    │ Continuous       │
│ n_devices │ Count            │
│ comments  │ Textual          │
└───────────┴──────────────────┘

encoder = ContinuousEncoder(drop_last=true)
mach = fit!(machine(encoder, X))
W = transform(mach, X)

julia> schema(W)
┌──────────────┬────────────┐
│ names        │ scitypes   │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John   │ Continuous │
│ name__Lee    │ Continuous │
│ grade        │ Continuous │
│ height       │ Continuous │
│ n_devices    │ Continuous │
└──────────────┴────────────┘

julia> setdiff(schema(X).names, report(mach).features_to_keep) # dropped features
1-element Vector{Symbol}:
 :comments

See also TargetEncoder

source

MLJTransforms.TargetEncoder — Type

TargetEncoder

A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

TargetEncoder = @load TargetEncoder pkg=MLJTransforms

Do model = TargetEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...).

TargetEncoder implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.

Training data

In MLJ (or MLJBase) bind an instance model to data with

mach = machine(model, X, y)

Here:

X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

y is the target, which can be any AbstractVector whose element scitype is Continuous or Count for regression problems and Multiclass or OrderedFactor for classification problems; check the scitype with schema(y)

Train the machine using fit!(mach, rows=...).

Hyper-parameters

features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.

ignore=true: Whether to exclude or include the features given in features

ordered_factor=false: Whether to encode OrderedFactor or ignore them

λ: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]
m: An integer hyperparameter to compute shrinkage as described in [1]. If m=:auto then m will be computed using empirical Bayes estimation as described in [1]

Operations

transform(mach, Xnew): Apply target encoding to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

task: Whether the task is Classification or Regression
y_statistic_given_feat_level: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.

Report

The fields of report(mach) are:

encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

# Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]

# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)

encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)

julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes         │ types                           │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1   │ Continuous       │ Float64                         │
│ A_2   │ Continuous       │ Float64                         │
│ A_3   │ Continuous       │ Float64                         │
│ B     │ Continuous       │ Float64                         │
│ C_1   │ Continuous       │ Float64                         │
│ C_2   │ Continuous       │ Float64                         │
│ C_3   │ Continuous       │ Float64                         │
│ D_1   │ Continuous       │ Float64                         │
│ D_2   │ Continuous       │ Float64                         │
│ D_3   │ Continuous       │ Float64                         │
│ E     │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘

Reference

[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.