TargetEncoder

TargetEncoder

A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

TargetEncoder = @load TargetEncoder pkg=MLJTransforms

Do model = TargetEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...).

TargetEncoder implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.

Training data

In MLJ (or MLJBase) bind an instance model to data with

mach = machine(model, X, y)

Here:

X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.
y is the target, which can be any AbstractVector whose element scitype is Continuous or Count for regression problems and Multiclass or OrderedFactor for classification problems; check the scitype with schema(y)

Train the machine using fit!(mach, rows=...).

Hyper-parameters

features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
ignore=true: Whether to exclude or include the features given in features
ordered_factor=false: Whether to encode OrderedFactor or ignore them
λ: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]
m: An integer hyperparameter to compute shrinkage as described in [1]. If m=:auto then m will be computed using empirical Bayes estimation as described in [1]

Operations

transform(mach, Xnew): Apply target encoding to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

task: Whether the task is Classification or Regression
y_statistic_given_feat_level: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.

Report

The fields of report(mach) are:

encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

## Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

## Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]

## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)

encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)

julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes         │ types                           │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1   │ Continuous       │ Float64                         │
│ A_2   │ Continuous       │ Float64                         │
│ A_3   │ Continuous       │ Float64                         │
│ B     │ Continuous       │ Float64                         │
│ C_1   │ Continuous       │ Float64                         │
│ C_2   │ Continuous       │ Float64                         │
│ C_3   │ Continuous       │ Float64                         │
│ D_1   │ Continuous       │ Float64                         │
│ D_2   │ Continuous       │ Float64                         │
│ D_3   │ Continuous       │ Float64                         │
│ E     │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘

Reference

[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.