Adding new models to MLJTransforms
In this package, data transformers are not implemented using a specific generic template, whereas categorical encoders are due to their systematic nature of encoding categorical levels into scalars or vectors. In light of this, the most pivotal method in implementing a new categorical encoder is:
MLJTransforms.generic_fit
— Functiongeneric_fit(X,
features = Symbol[],
args...;
ignore::Bool = true,
ordered_factor::Bool = false,
feature_mapper,
kwargs...,
)
Given a feature_mapper
(see definition below), this method applies feature_mapper
across a specified subset of categorical columns in X and returns a dictionary whose keys are the feature names, and each value is the corresponding level‑to‑value mapping produced by feature_mapper
.
In essence, it spares effort of looping over each column and applying the feature_mapper
function manually as well as handling the feature selection logic.
Arguments
X: A table where the elements of the categorical features have scitypes
Multiclass
orOrderedFactor
features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.ignore=true
: Whether to exclude or include the features given infeatures
ordered_factor=false
: Whether to encodeOrderedFactor
or ignore themfeature_mapper: function that, for a given vector (eg, corresponding to a categorical column from the dataset
X
), produces a mapping from each category level name in this vector to a scalar or vector according to specified transformation logic.
Note
- Any additional arguments (whether keyword or not) provided to this function are passed to the
feature_mapper
function which is helpful whenfeature_mapper
requires additional arguments to compute the mapping (eg, hyperparameters).
Returns
mapping_per_feat_level
: Maps each level for each feature in a subset of the categorical features of X into a scalar or a vector.encoded_features
: The subset of the categorical features ofX
that were encoded
followed by:
MLJTransforms.generic_transform
— Functiongeneric_transform(
X,
mapping_per_feat_level;
single_feat::Bool = true,
ignore_unknown::Bool = false,
use_levelnames::Bool = false,
custom_levels = nothing,
ensure_categorical::Bool = false,
)
Apply a per‐level feature mapping to selected categorical columns in X
, returning a new table of the same type.
Arguments
X: A table where the elements of the categorical features have scitypes
Multiclass
orOrderedFactor
mapping_per_feat_level::Dict{Symbol,Dict}
: A dict whose keys are feature names (Symbol
) and values are themselves dictionaries mapping each observed level to either a scalar (ifsingle_feat=true
) or a fixed‐length vector (ifsingle_feat=false
). Only columns whose names appear inmapping_per_feat_level
are transformed; others pass through unchanged.single_feat::Bool=true
: Iftrue
, each input level is mapped to a single scalar feature; iffalse
, each input level is mapped to a length‑k
vector, producingk
output columns.ignore_unknown::Bool=false
: Iffalse
, novel levels inX
(not seen during fit) will raise an error; iftrue
, novel levels will be left unchanged (identity mapping).use_levelnames::Bool=false
: Whensingle_feat=false
, controls naming of the expanded columns:true
: use actual level names (e.g.:color_red
,:color_blue
),false
: use numeric indices (e.g.:color_1
,:color_2
).custom_levels::Union{Nothing,Vector}
: If notnothing
, overrides the names of levels used to generate feature names whensingle_feat=false
.ensure_categorical::Bool=false
: Only whensingle_feat=true
and iftrue
, preserves the categorical type of the column after recoding (eg, feature should still be recognized asMulticlass
after transformation)
Returns
A new table of potentially similar to X
but with categorical columns transformed according to mapping_per_feat_level
.
All categorical encoders in this packager are implemented using these two methods. For an example, see FrequencyEncoder
source code.
Moreover, you should implement the MLJModelInterface
for any method you provide in this package. Check the interface docs and/or the existing interfaces in this package (eg, this interface for the FrequencyEncoder
).