OrdinalEncoder

OrdinalEncoder

A model type for constructing a ordinal encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

OrdinalEncoder = @load OrdinalEncoder pkg=MLJTransforms

Do model = OrdinalEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OrdinalEncoder(features=...).

OrdinalEncoder implements ordinal encoding which replaces the categorical values in the specified categorical features with integers (ordered arbitrarily). This will create an implicit ordering between categories which may not be a proper modelling assumption.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded
  • ignore=true: Whether to exclude or include the features given in features
  • ordered_factor=false: Whether to encode OrderedFactor or ignore them
  • output_type: The numerical concrete type of the encoded features. Default is Float32.

Operations

  • transform(mach, Xnew): Apply ordinal encoding to selected Multiclass or OrderedFactor features ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactor` are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • index_given_feat_level: A dictionary that maps each level for each column in a subset of the categorical features of X into an integer.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

## Define categorical features
A = ["g", "b", "g", "r", "r",]  
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]  
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)

## Check scitype coercion:
schema(X)

encoder = OrdinalEncoder(ordered_factor = false)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (A = [2, 1, 2, 3, 3],
    B = [1.0, 2.0, 3.0, 4.0, 5.0],
    C = [1, 1, 1, 2, 1],
    D = [2, 1, 2, 1, 2],
    E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)

See also TargetEncoder