Classical encoders include well known and commonly used categorical encoders:

TransformerBrief Description
OneHotEncoderEncode categorical variables into one-hot vectors
ContinuousEncoderAdds type casting functionality to OnehotEncoder
OrdinalEncoderEncode categorical variables into ordered integers
FrequencyEncoderEncode categorical variables into their normalized or unormalized frequencies
TargetEncoderEncode categorical variables into relevant target statistics
MLJTransforms.OneHotEncoderType
OneHotEncoder

A model type for constructing a one-hot encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

OneHotEncoder = @load OneHotEncoder pkg=MLJTransforms

Do model = OneHotEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...).

Use this model to one-hot encode the Multiclass and OrderedFactor features (columns) of some table, leaving other columns unchanged.

New data to be transformed may lack features present in the fit data, but no new features can be present.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column, col, is the same for training data and new data to be transformed.

To ensure all features are transformed into Continuous features, or dropped, use ContinuousEncoder instead.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

where

  • X: any Tables.jl compatible table. Columns can be of mixed type but only those with element scitype Multiclass or OrderedFactor can be encoded. Check column scitypes with schema(X).

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features: a vector of symbols (feature names). If empty (default) then all Multiclass and OrderedFactor features are encoded. Otherwise, encoding is further restricted to the specified features (ignore=false) or the unspecified features (ignore=true). This default behavior can be modified by the ordered_factor flag.

  • ordered_factor=false: when true, OrderedFactor features are universally excluded

  • drop_last=false: whether to drop the column corresponding to the final class of encoded features. For example, a three-class feature is spawned into three new features if drop_last=false, but just two features otherwise.

Fitted parameters

The fields of fitted_params(mach) are:

  • all_features: names of all features encountered in training

  • fitted_levels_given_feature: dictionary of the levels associated with each feature encoded, keyed on the feature name

  • ref_name_pairs_given_feature: dictionary of pairs r => ftr (such as 0x00000001 => :grad__A) where r is a CategoricalArrays.jl reference integer representing a level, and ftr the corresponding new feature name; the dictionary is keyed on the names of features that are encoded

Report

The fields of report(mach) are:

  • features_to_be_encoded: names of input features to be encoded

  • new_features: names of all output features

Example

using MLJ

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical(["A", "B", "A", "C"], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3])

julia> schema(X)
┌───────────┬──────────────────┐
│ names     │ scitypes         │
├───────────┼──────────────────┤
│ name      │ Multiclass{4}    │
│ grade     │ OrderedFactor{3} │
│ height    │ Continuous       │
│ n_devices │ Count            │
└───────────┴──────────────────┘

hot = OneHotEncoder(drop_last=true)
mach = fit!(machine(hot, X))
W = transform(mach, X)

julia> schema(W)
┌──────────────┬────────────┐
│ names        │ scitypes   │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John   │ Continuous │
│ name__Lee    │ Continuous │
│ grade__A     │ Continuous │
│ grade__B     │ Continuous │
│ height       │ Continuous │
│ n_devices    │ Count      │
└──────────────┴────────────┘

See also ContinuousEncoder.

source
MLJTransforms.ContinuousEncoderType
ContinuousEncoder

A model type for constructing a continuous encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

ContinuousEncoder = @load ContinuousEncoder pkg=MLJTransforms

Do model = ContinuousEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContinuousEncoder(drop_last=...).

Use this model to arrange all features (features) of a table to have Continuous element scitype, by applying the following protocol to each feature ftr:

  • If ftr is already Continuous retain it.

  • If ftr is Multiclass, one-hot encode it.

  • If ftr is OrderedFactor, replace it with coerce(ftr, Continuous) (vector of floating point integers), unless ordered_factors=false is specified, in which case one-hot encode it.

  • If ftr is Count, replace it with coerce(ftr, Continuous).

  • If ftr has some other element scitype, or was not observed in fitting the encoder, drop it from the table.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column, col, is the same for training data and new data to be transformed.

To selectively one-hot-encode categorical features (without dropping features) use OneHotEncoder instead.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

where

  • X: any Tables.jl compatible table. features can be of mixed type but only those with element scitype Multiclass or OrderedFactor can be encoded. Check column scitypes with schema(X).

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • drop_last=true: whether to drop the column corresponding to the final class of one-hot encoded features. For example, a three-class feature is spawned into three new features if drop_last=false, but two just features otherwise.

  • one_hot_ordered_factors=false: whether to one-hot any feature with OrderedFactor element scitype, or to instead coerce it directly to a (single) Continuous feature using the order

Fitted parameters

The fields of fitted_params(mach) are:

  • features_to_keep: names of features that will not be dropped from the table

  • one_hot_encoder: the OneHotEncoder model instance for handling the one-hot encoding

  • one_hot_encoder_fitresult: the fitted parameters of the OneHotEncoder model

Report

  • features_to_keep: names of input features that will not be dropped from the table

  • new_features: names of all output features

Example

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical(["A", "B", "A", "C"], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3],
     comments=["the force", "be", "with you", "too"])

julia> schema(X)
┌───────────┬──────────────────┐
│ names     │ scitypes         │
├───────────┼──────────────────┤
│ name      │ Multiclass{4}    │
│ grade     │ OrderedFactor{3} │
│ height    │ Continuous       │
│ n_devices │ Count            │
│ comments  │ Textual          │
└───────────┴──────────────────┘

encoder = ContinuousEncoder(drop_last=true)
mach = fit!(machine(encoder, X))
W = transform(mach, X)

julia> schema(W)
┌──────────────┬────────────┐
│ names        │ scitypes   │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John   │ Continuous │
│ name__Lee    │ Continuous │
│ grade        │ Continuous │
│ height       │ Continuous │
│ n_devices    │ Continuous │
└──────────────┴────────────┘

julia> setdiff(schema(X).names, report(mach).features_to_keep) # dropped features
1-element Vector{Symbol}:
 :comments

See also OneHotEncoder

source
MLJTransforms.OrdinalEncoderType
OrdinalEncoder

A model type for constructing a ordinal encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

OrdinalEncoder = @load OrdinalEncoder pkg=MLJTransforms

Do model = OrdinalEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OrdinalEncoder(features=...).

OrdinalEncoder implements ordinal encoding which replaces the categorical values in the specified categorical features with integers (ordered arbitrarily). This will create an implicit ordering between categories which may not be a proper modelling assumption.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.

  • ignore=true: Whether to exclude or include the features given in features

  • ordered_factor=false: Whether to encode OrderedFactor or ignore them

  • output_type: The numerical concrete type of the encoded features. Default is Float32.

Operations

  • transform(mach, Xnew): Apply ordinal encoding to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • index_given_feat_level: A dictionary that maps each level for each column in a subset of the categorical features of X into an integer.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

# Define categorical features
A = ["g", "b", "g", "r", "r",]  
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]  
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)

# Check scitype coercion:
schema(X)

encoder = OrdinalEncoder(ordered_factor = false)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (A = [2, 1, 2, 3, 3],
    B = [1.0, 2.0, 3.0, 4.0, 5.0],
    C = [1, 1, 1, 2, 1],
    D = [2, 1, 2, 1, 2],
    E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)

See also TargetEncoder

source
MLJTransforms.FrequencyEncoderType
FrequencyEncoder

A model type for constructing a frequency encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

FrequencyEncoder = @load FrequencyEncoder pkg=MLJTransforms

Do model = FrequencyEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FrequencyEncoder(features=...).

FrequencyEncoder implements frequency encoding which replaces the categorical values in the specified categorical features with their (normalized or raw) frequencies of occurrence in the dataset.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.

  • ignore=true: Whether to exclude or include the features given in features

  • ordered_factor=false: Whether to encode OrderedFactor or ignore them

  • normalize=false: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.

  • output_type=Float32: The type of the output values. The default is Float32, but you can set it to Float64 or any other type that can hold the frequency values.

Operations

  • transform(mach, Xnew): Apply frequency encoding to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • statistic_given_feat_val: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

# Define categorical features
A = ["g", "b", "g", "r", "r",]  
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]  
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)

# Check scitype coercions:
schema(X)

encoder = FrequencyEncoder(ordered_factor = false, normalize=true)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (A = [2, 1, 2, 2, 2],
    B = [1.0, 2.0, 3.0, 4.0, 5.0],
    C = [4, 4, 4, 1, 4],
    D = [3, 2, 3, 2, 3],
    E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)

See also TargetEncoder

source
MLJTransforms.TargetEncoderType
TargetEncoder

A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

TargetEncoder = @load TargetEncoder pkg=MLJTransforms

Do model = TargetEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...).

TargetEncoder implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.

Training data

In MLJ (or MLJBase) bind an instance model to data with

mach = machine(model, X, y)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.
  • y is the target, which can be any AbstractVector whose element scitype is Continuous or Count for regression problems and Multiclass or OrderedFactor for classification problems; check the scitype with schema(y)

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
  • ignore=true: Whether to exclude or include the features given in features
  • ordered_factor=false: Whether to encode OrderedFactor or ignore them
  • λ: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]

  • m: An integer hyperparameter to compute shrinkage as described in [1]. If m=:auto then m will be computed using empirical Bayes estimation as described in [1]

Operations

  • transform(mach, Xnew): Apply target encoding to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • task: Whether the task is Classification or Regression

  • y_statistic_given_feat_level: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

# Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]

# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)

encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)

julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes         │ types                           │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1   │ Continuous       │ Float64                         │
│ A_2   │ Continuous       │ Float64                         │
│ A_3   │ Continuous       │ Float64                         │
│ B     │ Continuous       │ Float64                         │
│ C_1   │ Continuous       │ Float64                         │
│ C_2   │ Continuous       │ Float64                         │
│ C_3   │ Continuous       │ Float64                         │
│ D_1   │ Continuous       │ Float64                         │
│ D_2   │ Continuous       │ Float64                         │
│ D_3   │ Continuous       │ Float64                         │
│ E     │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘

Reference

[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.

See also OneHotEncoder

source