CardinalityReducer

CardinalityReducer

A model type for constructing a cardinality reducer, based on MLJTransforms.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

CardinalityReducer = @load CardinalityReducer pkg=MLJTransforms

Do model = CardinalityReducer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CardinalityReducer(features=...).

CardinalityReducer maps any level of a categorical feature that occurs with frequency < min_frequency into a new level (e.g., "Other"). This is useful when some categorical features have high cardinality and many levels are infrequent. This assumes that the categorical features have raw types that are in Union{AbstractString, Char, Number}.

Training data

In MLJ (or MLJBase) bind an instance unsupervised model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame). Features to be transformed must have element scitype Multiclass or OrderedFactor. Use schema(X) to check scitypes.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded
  • ignore=true: Whether to exclude or include the features given in features
  • ordered_factor=false: Whether to encode OrderedFactor or ignore them
  • min_frequency::Real=3: Any level of a categorical feature that occurs with frequency < min_frequency will be mapped to a new level. Could be

an integer or a float which decides whether raw counts or normalized frequencies are used.

  • label_for_infrequent::Dict{<:Type, <:Any}()= Dict( AbstractString => "Other", Char => 'O', ): A

dictionary where the possible values for keys are the types in Char, AbstractString, and Number and each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypes AbstractString then the new value is "Other" and if the raw type subtypes Char then the new value is 'O' and if the raw type subtypes Number then the new value is the lowest value in the column - 1.

Operations

  • transform(mach, Xnew): Apply cardinality reduction to selected Multiclass or OrderedFactor features of Xnew specified by hyper-parameters, and return the new table. Features that are neither Multiclass nor OrderedFactor are always left unchanged.

Fitted parameters

The fields of fitted_params(mach) are:

  • new_cat_given_col_val: A dictionary that maps each level in a categorical feature to a new level (either itself or the new level specified in label_for_infrequent)

Report

The fields of report(mach) are:

  • encoded_features: The subset of the categorical features of X that were encoded

Examples

import StatsBase.proportionmap
using MLJ

## Define categorical features
A = [ ["a" for i in 1:100]..., "b", "b", "b", "c", "d"]
B = [ [0 for i in 1:100]..., 1, 2, 3, 4, 4]

## Combine into a named tuple
X = (A = A, B = B)

## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Multiclass
)

encoder = CardinalityReducer(ordered_factor = false, min_frequency=3)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)

julia> proportionmap(Xnew.A)
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Float64} with 3 entries:
  "Other" => 0.0190476
  "b"     => 0.0285714
  "a"     => 0.952381

julia> proportionmap(Xnew.B)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Float64} with 2 entries:
  0  => 0.952381
  -1 => 0.047619

See also FrequencyEncoder