TargetEncoder
TargetEncoder
A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
TargetEncoder = @load TargetEncoder pkg=MLJTransforms
Do model = TargetEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...)
.
TargetEncoder
implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.
Training data
In MLJ (or MLJBase) bind an instance model
to data with
mach = machine(model, X, y)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.y
is the target, which can be anyAbstractVector
whose element scitype isContinuous
orCount
for regression problems andMulticlass
orOrderedFactor
for classification problems; check the scitype withschema(y)
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded - ignore=true: Whether to exclude or include the features given in
features
- ordered_factor=false: Whether to encode
OrderedFactor
or ignore them λ
: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]m
: An integer hyperparameter to compute shrinkage as described in [1]. Ifm=:auto
then m will be computed using
empirical Bayes estimation as described in [1]
Operations
transform(mach, Xnew)
: Apply target encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
task
: Whether the task isClassification
orRegression
y_statistic_given_feat_level
: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
## Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
## Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]
## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)
encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)
julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1 │ Continuous │ Float64 │
│ A_2 │ Continuous │ Float64 │
│ A_3 │ Continuous │ Float64 │
│ B │ Continuous │ Float64 │
│ C_1 │ Continuous │ Float64 │
│ C_2 │ Continuous │ Float64 │
│ C_3 │ Continuous │ Float64 │
│ D_1 │ Continuous │ Float64 │
│ D_2 │ Continuous │ Float64 │
│ D_3 │ Continuous │ Float64 │
│ E │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘
Reference
[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.
See also OneHotEncoder