Contrast Encoders include categorical encoders that could be modeled by a contrast matrix:

TransformerBrief Description
DummyEncoderEncodes by comparing each level to the reference level, intercept being the cell mean of the reference group
SumEncoderEncodes by comparing each level to the reference level, intercept being the grand mean
HelmertEncoderEncodes by comparing levels of a variable with the mean of the subsequent levels of the variable
ForwardDifferenceEncoderEncodes by comparing adjacent levels of a variable (each level minus the next level)
ContrastEncoderAllows defining a custom contrast encoder via a contrast matrix
HypothesisEncoderAllows defining a custom contrast encoder via a hypothesis matrix
MLJTransforms.ContrastEncoderType
ContrastEncoder

A model type for constructing a contrast encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

ContrastEncoder = @load ContrastEncoder pkg=MLJTransforms

Do model = ContrastEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContrastEncoder(features=...).

ContrastEncoder implements the following contrast encoding methods for categorical features: dummy, sum, backward/forward difference, and Helmert coding. More generally, users can specify a custom contrast or hypothesis matrix, and each feature can be encoded using a different method.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
  • mode=:dummy: The type of encoding to use. Can be one of :contrast, :dummy, :sum, :backward_diff, :forward_diff, :helmert or :hypothesis. If ignore=false (features to be encoded are listed explictly in features), then this can be a vector of the same length as features to specify a different contrast encoding scheme for each feature

  • buildmatrix=nothing: A function or other callable with signature buildmatrix(colname,k), where colname is the name of the feature levels and k is it's length, and which returns contrast or hypothesis matrix with row/column ordering consistent with the ordering of levels(col). Only relevant if mode is :contrast or :hypothesis.

  • ignore=true: Whether to exclude or include the features given in features

  • ordered_factor=false: Whether to encode OrderedFactor or ignore them

Operations

  • transform(mach, Xnew): Apply contrast encoding to selected Multiclass or OrderedFactor features ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactor` are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • vector_given_value_given_feature: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

# Define categorical dataset
X = (
    name   = categorical(["Ben", "John", "Mary", "John"]),
    height = [1.85, 1.67, 1.5, 1.67],
    favnum = categorical([7, 5, 10, 1]),
    age    = [23, 23, 14, 23],
)

# Check scitype coercions:
schema(X)

encoder =  ContrastEncoder(
    features = [:name, :favnum],
    ignore = false,
    mode = [:dummy, :helmert],
)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (name_John = [1.0, 0.0, 0.0, 0.0],
    name_Mary = [0.0, 1.0, 0.0, 1.0],
    height = [1.85, 1.67, 1.5, 1.67],
    favnum_5 = [0.0, 1.0, 0.0, -1.0],
    favnum_7 = [2.0, -1.0, 0.0, -1.0],
    favnum_10 = [-1.0, -1.0, 3.0, -1.0],
    age = [23, 23, 14, 23],)

See also OneHotEncoder

source