ContrastEncoder

ContrastEncoder

A model type for constructing a contrast encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

ContrastEncoder = @load ContrastEncoder pkg=MLJTransforms

Do model = ContrastEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContrastEncoder(features=...).

ContrastEncoder implements the following contrast encoding methods for categorical features: dummy, sum, backward/forward difference, and Helmert coding. More generally, users can specify a custom contrast or hypothesis matrix, and each feature can be encoded using a different method.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded
  • mode=:dummy: The type of encoding to use. Can be one of :contrast, :dummy, :sum, :backward_diff, :forward_diff, :helmert or :hypothesis.

If ignore=false (features to be encoded are listed explictly in features), then this can be a vector of the same length as features to specify a different contrast encoding scheme for each feature

  • buildmatrix=nothing: A function or other callable with signature buildmatrix(colname, k),

where colname is the name of the feature levels and k is it's length, and which returns contrast or hypothesis matrix with row/column ordering consistent with the ordering of levels(col). Only relevant if mode is :contrast or :hypothesis.

  • ignore=true: Whether to exclude or include the features given in features
  • ordered_factor=false: Whether to encode OrderedFactor or ignore them

Operations

  • transform(mach, Xnew): Apply contrast encoding to selected Multiclass or OrderedFactor features ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactor` are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • vector_given_value_given_feature: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

## Define categorical dataset
X = (
    name   = categorical(["Ben", "John", "Mary", "John"]),
    height = [1.85, 1.67, 1.5, 1.67],
    favnum = categorical([7, 5, 10, 1]),
    age    = [23, 23, 14, 23],
)

## Check scitype coercions:
schema(X)

encoder =  ContrastEncoder(
    features = [:name, :favnum],
    ignore = false, 
    mode = [:dummy, :helmert],
)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (name_John = [1.0, 0.0, 0.0, 0.0],
    name_Mary = [0.0, 1.0, 0.0, 1.0],
    height = [1.85, 1.67, 1.5, 1.67],
    favnum_5 = [0.0, 1.0, 0.0, -1.0],
    favnum_7 = [2.0, -1.0, 0.0, -1.0],
    favnum_10 = [-1.0, -1.0, 3.0, -1.0],
    age = [23, 23, 14, 23],)

See also OneHotEncoder