FrequencyEncoder

FrequencyEncoder

A model type for constructing a frequency encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

FrequencyEncoder = @load FrequencyEncoder pkg=MLJTransforms

Do model = FrequencyEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FrequencyEncoder(features=...).

FrequencyEncoder implements frequency encoding which replaces the categorical values in the specified categorical features with their (normalized or raw) frequencies of occurrence in the dataset.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded
  • ignore=true: Whether to exclude or include the features given in features
  • ordered_factor=false: Whether to encode OrderedFactor or ignore them
  • normalize=false: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.
  • output_type=Float32: The type of the output values. The default is Float32, but you can set it to Float64 or any other type that can hold the frequency values.

Operations

  • transform(mach, Xnew): Apply frequency encoding to selected Multiclass or OrderedFactor features ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactor` are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • statistic_given_feat_val: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

using MLJ

## Define categorical features
A = ["g", "b", "g", "r", "r",]  
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]  
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]

## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)

## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)

## Check scitype coercions:
schema(X)

encoder = FrequencyEncoder(ordered_factor = false, normalize=true)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia > Xnew
    (A = [2, 1, 2, 2, 2],
    B = [1.0, 2.0, 3.0, 4.0, 5.0],
    C = [4, 4, 4, 1, 4],
    D = [3, 2, 3, 2, 3],
    E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)

See also TargetEncoder