FrequencyEncoder
FrequencyEncoder
A model type for constructing a frequency encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
FrequencyEncoder = @load FrequencyEncoder pkg=MLJTransforms
Do model = FrequencyEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FrequencyEncoder(features=...)
.
FrequencyEncoder
implements frequency encoding which replaces the categorical values in the specified categorical features with their (normalized or raw) frequencies of occurrence in the dataset.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded - ignore=true: Whether to exclude or include the features given in
features
- ordered_factor=false: Whether to encode
OrderedFactor
or ignore them normalize=false
: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.output_type=Float32
: The type of the output values. The default isFloat32
, but you can set it toFloat64
or any other type that can hold the frequency values.
Operations
transform(mach, Xnew)
: Apply frequency encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
statistic_given_feat_val
: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
## Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
## Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
## Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
## Check scitype coercions:
schema(X)
encoder = FrequencyEncoder(ordered_factor = false, normalize=true)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(A = [2, 1, 2, 2, 2],
B = [1.0, 2.0, 3.0, 4.0, 5.0],
C = [4, 4, 4, 1, 4],
D = [3, 2, 3, 2, 3],
E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)
See also TargetEncoder