CardinalityReducer
CardinalityReducer
A model type for constructing a cardinality reducer, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
CardinalityReducer = @load CardinalityReducer pkg=MLJTransforms
Do model = CardinalityReducer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CardinalityReducer(features=...)
.
CardinalityReducer
maps any level of a categorical feature that occurs with frequency < min_frequency
into a new level (e.g., "Other"). This is useful when some categorical features have high cardinality and many levels are infrequent. This assumes that the categorical features have raw types that are in Union{AbstractString, Char, Number}
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded - ignore=true: Whether to exclude or include the features given in
features
- ordered_factor=false: Whether to encode
OrderedFactor
or ignore them min_frequency::Real=3
: Any level of a categorical feature that occurs with frequency <min_frequency
will be mapped to a new level. Could be
an integer or a float which decides whether raw counts or normalized frequencies are used.
label_for_infrequent::Dict{<:Type, <:Any}()= Dict( AbstractString => "Other", Char => 'O', )
: A
dictionary where the possible values for keys are the types in Char
, AbstractString
, and Number
and each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypes AbstractString
then the new value is "Other"
and if the raw type subtypes Char
then the new value is 'O'
and if the raw type subtypes Number
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
new_cat_given_col_val
: A dictionary that maps each level in a categorical feature to a new level (either itself or the new level specified inlabel_for_infrequent
)
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
## Define categorical features
A = [ ["a" for i in 1:100]..., "b", "b", "b", "c", "d"]
B = [ [0 for i in 1:100]..., 1, 2, 3, 4, 4]
## Combine into a named tuple
X = (A = A, B = B)
## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Multiclass
)
encoder = CardinalityReducer(ordered_factor = false, min_frequency=3)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia> proportionmap(Xnew.A)
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Float64} with 3 entries:
"Other" => 0.0190476
"b" => 0.0285714
"a" => 0.952381
julia> proportionmap(Xnew.B)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Float64} with 2 entries:
0 => 0.952381
-1 => 0.047619
See also FrequencyEncoder