Classical encoders include well known and commonly used categorical encoders:
Transformer | Brief Description |
---|---|
OneHotEncoder | Encode categorical variables into one-hot vectors |
ContinuousEncoder | Adds type casting functionality to OnehotEncoder |
OrdinalEncoder | Encode categorical variables into ordered integers |
FrequencyEncoder | Encode categorical variables into their normalized or unormalized frequencies |
TargetEncoder | Encode categorical variables into relevant target statistics |
MLJTransforms.OneHotEncoder
— TypeOneHotEncoder
A model type for constructing a one-hot encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
OneHotEncoder = @load OneHotEncoder pkg=MLJTransforms
Do model = OneHotEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...)
.
Use this model to one-hot encode the Multiclass
and OrderedFactor
features (columns) of some table, leaving other columns unchanged.
New data to be transformed may lack features present in the fit data, but no new features can be present.
Warning: This transformer assumes that levels(col)
for any Multiclass
or OrderedFactor
column, col
, is the same for training data and new data to be transformed.
To ensure all features are transformed into Continuous
features, or dropped, use ContinuousEncoder
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table. Columns can be of mixed type but only those with element scitypeMulticlass
orOrderedFactor
can be encoded. Check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features
: a vector of symbols (feature names). If empty (default) then allMulticlass
andOrderedFactor
features are encoded. Otherwise, encoding is further restricted to the specified features (ignore=false
) or the unspecified features (ignore=true
). This default behavior can be modified by theordered_factor
flag.ordered_factor=false
: whentrue
,OrderedFactor
features are universally excludeddrop_last=false
: whether to drop the column corresponding to the final class of encoded features. For example, a three-class feature is spawned into three new features ifdrop_last=false
, but just two features otherwise.
Fitted parameters
The fields of fitted_params(mach)
are:
all_features
: names of all features encountered in trainingfitted_levels_given_feature
: dictionary of the levels associated with each feature encoded, keyed on the feature nameref_name_pairs_given_feature
: dictionary of pairsr => ftr
(such as0x00000001 => :grad__A
) wherer
is a CategoricalArrays.jl reference integer representing a level, andftr
the corresponding new feature name; the dictionary is keyed on the names of features that are encoded
Report
The fields of report(mach)
are:
features_to_be_encoded
: names of input features to be encodednew_features
: names of all output features
Example
using MLJ
X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
grade=categorical(["A", "B", "A", "C"], ordered=true),
height=[1.85, 1.67, 1.5, 1.67],
n_devices=[3, 2, 4, 3])
julia> schema(X)
┌───────────┬──────────────────┐
│ names │ scitypes │
├───────────┼──────────────────┤
│ name │ Multiclass{4} │
│ grade │ OrderedFactor{3} │
│ height │ Continuous │
│ n_devices │ Count │
└───────────┴──────────────────┘
hot = OneHotEncoder(drop_last=true)
mach = fit!(machine(hot, X))
W = transform(mach, X)
julia> schema(W)
┌──────────────┬────────────┐
│ names │ scitypes │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John │ Continuous │
│ name__Lee │ Continuous │
│ grade__A │ Continuous │
│ grade__B │ Continuous │
│ height │ Continuous │
│ n_devices │ Count │
└──────────────┴────────────┘
See also ContinuousEncoder
.
MLJTransforms.ContinuousEncoder
— TypeContinuousEncoder
A model type for constructing a continuous encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
ContinuousEncoder = @load ContinuousEncoder pkg=MLJTransforms
Do model = ContinuousEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContinuousEncoder(drop_last=...)
.
Use this model to arrange all features (features) of a table to have Continuous
element scitype, by applying the following protocol to each feature ftr
:
If
ftr
is alreadyContinuous
retain it.If
ftr
isMulticlass
, one-hot encode it.If
ftr
isOrderedFactor
, replace it withcoerce(ftr, Continuous)
(vector of floating point integers), unlessordered_factors=false
is specified, in which case one-hot encode it.If
ftr
isCount
, replace it withcoerce(ftr, Continuous)
.If
ftr
has some other element scitype, or was not observed in fitting the encoder, drop it from the table.
Warning: This transformer assumes that levels(col)
for any Multiclass
or OrderedFactor
column, col
, is the same for training data and new data to be transformed.
To selectively one-hot-encode categorical features (without dropping features) use OneHotEncoder
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table. features can be of mixed type but only those with element scitypeMulticlass
orOrderedFactor
can be encoded. Check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
drop_last=true
: whether to drop the column corresponding to the final class of one-hot encoded features. For example, a three-class feature is spawned into three new features ifdrop_last=false
, but two just features otherwise.one_hot_ordered_factors=false
: whether to one-hot any feature withOrderedFactor
element scitype, or to instead coerce it directly to a (single)Continuous
feature using the order
Fitted parameters
The fields of fitted_params(mach)
are:
features_to_keep
: names of features that will not be dropped from the tableone_hot_encoder
: theOneHotEncoder
model instance for handling the one-hot encodingone_hot_encoder_fitresult
: the fitted parameters of theOneHotEncoder
model
Report
features_to_keep
: names of input features that will not be dropped from the tablenew_features
: names of all output features
Example
X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
grade=categorical(["A", "B", "A", "C"], ordered=true),
height=[1.85, 1.67, 1.5, 1.67],
n_devices=[3, 2, 4, 3],
comments=["the force", "be", "with you", "too"])
julia> schema(X)
┌───────────┬──────────────────┐
│ names │ scitypes │
├───────────┼──────────────────┤
│ name │ Multiclass{4} │
│ grade │ OrderedFactor{3} │
│ height │ Continuous │
│ n_devices │ Count │
│ comments │ Textual │
└───────────┴──────────────────┘
encoder = ContinuousEncoder(drop_last=true)
mach = fit!(machine(encoder, X))
W = transform(mach, X)
julia> schema(W)
┌──────────────┬────────────┐
│ names │ scitypes │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John │ Continuous │
│ name__Lee │ Continuous │
│ grade │ Continuous │
│ height │ Continuous │
│ n_devices │ Continuous │
└──────────────┴────────────┘
julia> setdiff(schema(X).names, report(mach).features_to_keep) # dropped features
1-element Vector{Symbol}:
:comments
See also OneHotEncoder
MLJTransforms.OrdinalEncoder
— TypeOrdinalEncoder
A model type for constructing a ordinal encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
OrdinalEncoder = @load OrdinalEncoder pkg=MLJTransforms
Do model = OrdinalEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OrdinalEncoder(features=...)
.
OrdinalEncoder
implements ordinal encoding which replaces the categorical values in the specified categorical features with integers (ordered arbitrarily). This will create an implicit ordering between categories which may not be a proper modelling assumption.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore themoutput_type
: The numerical concrete type of the encoded features. Default isFloat32
.
Operations
transform(mach, Xnew)
: Apply ordinal encoding to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
index_given_feat_level
: A dictionary that maps each level for each column in a subset of the categorical features of X into an integer.
Report
The fields of report(mach)
are:
encoded_features
: The subset of the categorical features ofX
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
# Check scitype coercion:
schema(X)
encoder = OrdinalEncoder(ordered_factor = false)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(A = [2, 1, 2, 3, 3],
B = [1.0, 2.0, 3.0, 4.0, 5.0],
C = [1, 1, 1, 2, 1],
D = [2, 1, 2, 1, 2],
E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)
See also TargetEncoder
MLJTransforms.FrequencyEncoder
— TypeFrequencyEncoder
A model type for constructing a frequency encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
FrequencyEncoder = @load FrequencyEncoder pkg=MLJTransforms
Do model = FrequencyEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FrequencyEncoder(features=...)
.
FrequencyEncoder
implements frequency encoding which replaces the categorical values in the specified categorical features with their (normalized or raw) frequencies of occurrence in the dataset.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore themnormalize=false
: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.output_type=Float32
: The type of the output values. The default isFloat32
, but you can set it toFloat64
or any other type that can hold the frequency values.
Operations
transform(mach, Xnew)
: Apply frequency encoding to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
statistic_given_feat_val
: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
Report
The fields of report(mach)
are:
encoded_features
: The subset of the categorical features ofX
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
# Check scitype coercions:
schema(X)
encoder = FrequencyEncoder(ordered_factor = false, normalize=true)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(A = [2, 1, 2, 2, 2],
B = [1.0, 2.0, 3.0, 4.0, 5.0],
C = [4, 4, 4, 1, 4],
D = [3, 2, 3, 2, 3],
E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)
See also TargetEncoder
MLJTransforms.TargetEncoder
— TypeTargetEncoder
A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
TargetEncoder = @load TargetEncoder pkg=MLJTransforms
Do model = TargetEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...)
.
TargetEncoder
implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.
Training data
In MLJ (or MLJBase) bind an instance model
to data with
mach = machine(model, X, y)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
y
is the target, which can be anyAbstractVector
whose element scitype isContinuous
orCount
for regression problems andMulticlass
orOrderedFactor
for classification problems; check the scitype withschema(y)
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore them
λ
: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]m
: An integer hyperparameter to compute shrinkage as described in [1]. Ifm=:auto
then m will be computed using empirical Bayes estimation as described in [1]
Operations
transform(mach, Xnew)
: Apply target encoding to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
task
: Whether the task isClassification
orRegression
y_statistic_given_feat_level
: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.
Report
The fields of report(mach)
are:
encoded_features
: The subset of the categorical features ofX
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)
encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)
julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1 │ Continuous │ Float64 │
│ A_2 │ Continuous │ Float64 │
│ A_3 │ Continuous │ Float64 │
│ B │ Continuous │ Float64 │
│ C_1 │ Continuous │ Float64 │
│ C_2 │ Continuous │ Float64 │
│ C_3 │ Continuous │ Float64 │
│ D_1 │ Continuous │ Float64 │
│ D_2 │ Continuous │ Float64 │
│ D_3 │ Continuous │ Float64 │
│ E │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘
Reference
[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.
See also OneHotEncoder