Utility Encoders include categorical encoders meant to be used as preprocessors for other encoders or models.
Transformer | Brief Description |
---|---|
CardinalityReducer | Reduce cardinality of high cardinality categorical features by grouping infrequent categories |
MissingnessEncoder | Encode missing values of categorical features into new values |
MLJTransforms.CardinalityReducer
— TypeCardinalityReducer
A model type for constructing a cardinality reducer, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
CardinalityReducer = @load CardinalityReducer pkg=MLJTransforms
Do model = CardinalityReducer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CardinalityReducer(features=...)
.
CardinalityReducer
maps any level of a categorical feature that occurs with frequency < min_frequency
into a new level (e.g., "Other"). This is useful when some categorical features have high cardinality and many levels are infrequent. This assumes that the categorical features have raw types that are in Union{AbstractString, Char, Number}
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore them
min_frequency::Real=3
: Any level of a categorical feature that occurs with frequency <min_frequency
will be mapped to a new level. Could be an integer or a float which decides whether raw counts or normalized frequencies are used.label_for_infrequent::Dict{<:Type, <:Any}()= Dict( AbstractString => "Other", Char => 'O', )
: A dictionary where the possible values for keys are the types inChar
,AbstractString
, andNumber
and each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypesAbstractString
then the new value is"Other"
and if the raw type subtypesChar
then the new value is'O'
and if the raw type subtypesNumber
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
new_cat_given_col_val
: A dictionary that maps each level in a categorical feature to a new level (either itself or the new level specified inlabel_for_infrequent
)
Report
The fields of report(mach)
are:
encoded_features
: The subset of the categorical features ofX
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
# Define categorical features
A = [ ["a" for i in 1:100]..., "b", "b", "b", "c", "d"]
B = [ [0 for i in 1:100]..., 1, 2, 3, 4, 4]
# Combine into a named tuple
X = (A = A, B = B)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Multiclass
)
encoder = CardinalityReducer(ordered_factor = false, min_frequency=3)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia> proportionmap(Xnew.A)
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Float64} with 3 entries:
"Other" => 0.0190476
"b" => 0.0285714
"a" => 0.952381
julia> proportionmap(Xnew.B)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Float64} with 2 entries:
0 => 0.952381
-1 => 0.047619
See also FrequencyEncoder
MLJTransforms.MissingnessEncoder
— TypeMissingnessEncoder
A model type for constructing a missingness encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
MissingnessEncoder = @load MissingnessEncoder pkg=MLJTransforms
Do model = MissingnessEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MissingnessEncoder(features=...)
.
MissingnessEncoder
maps any missing level of a categorical feature into a new level (e.g., "Missing"). By this, missingness will be treated as a new level by any subsequent model. This assumes that the categorical features have raw types that are in Char
, AbstractString
, and Number
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.
ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore them
label_for_missing::Dict{<:Type, <:Any}()= Dict( AbstractString => "missing", Char => 'm', )
: A dictionary where the possible values for keys are the types inChar
,AbstractString
, andNumber
and where each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypesAbstractString
then missing values will be replaced with"missing"
and if the raw type subtypesChar
then the new value is'm'
and if the raw type subtypesNumber
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
label_for_missing_given_feature
: A dictionary that for each column, mapsmissing
into some value according tolabel_for_missing
Report
The fields of report(mach)
are:
encoded_features
: The subset of the categorical features ofX
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
# Define a table with missing values
Xm = (
A = categorical(["Ben", "John", missing, missing, "Mary", "John", missing]),
B = [1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C= categorical([7, 5, missing, missing, 10, 0, missing]),
D = [23, 23, 44, 66, 14, 23, 11],
E = categorical([missing, 'g', 'r', missing, 'r', 'g', 'p'])
)
encoder = MissingnessEncoder()
mach = fit!(machine(encoder, Xm))
Xnew = transform(mach, Xm)
julia> Xnew
(A = ["Ben", "John", "missing", "missing", "Mary", "John", "missing"],
B = Union{Missing, Float64}[1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C = [7, 5, -1, -1, 10, 0, -1],
D = [23, 23, 44, 66, 14, 23, 11],
E = ['m', 'g', 'r', 'm', 'r', 'g', 'p'],)
See also CardinalityReducer