TargetEncoder
TargetEncoderA model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
TargetEncoder = @load TargetEncoder pkg=MLJTransformsDo model = TargetEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...).
TargetEncoder implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.
Training data
In MLJ (or MLJBase) bind an instance model to data with
mach = machine(model, X, y)Here:
Xis any table of input features (eg, aDataFrame). Features to be transformed must have element scitypeMulticlassorOrderedFactor. Useschema(X)to check scitypes.yis the target, which can be anyAbstractVectorwhose element scitype isContinuousorCountfor regression problems andMulticlassorOrderedFactorfor classification problems; check the scitype withschema(y)
Train the machine using fit!(mach, rows=...).
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded. - ignore=true: Whether to exclude or include the features given in
features - ordered_factor=false: Whether to encode
OrderedFactoror ignore them λ: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]m: An integer hyperparameter to compute shrinkage as described in [1]. Ifm=:autothen m will be computed using empirical Bayes estimation as described in [1]
Operations
transform(mach, Xnew): Apply target encoding to selectedMulticlassorOrderedFactorfeatures ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactorare always left unchanged.
Fitted parameters
The fields of fitted_params(mach) are:
task: Whether the task isClassificationorRegressiony_statistic_given_feat_level: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.
Report
The fields of report(mach) are:
- encoded_features: The subset of the categorical features of
Xthat were encoded
Examples
using MLJ
## Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
## Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]
## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)
encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)
julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1 │ Continuous │ Float64 │
│ A_2 │ Continuous │ Float64 │
│ A_3 │ Continuous │ Float64 │
│ B │ Continuous │ Float64 │
│ C_1 │ Continuous │ Float64 │
│ C_2 │ Continuous │ Float64 │
│ C_3 │ Continuous │ Float64 │
│ D_1 │ Continuous │ Float64 │
│ D_2 │ Continuous │ Float64 │
│ D_3 │ Continuous │ Float64 │
│ E │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘Reference
[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.
See also OneHotEncoder