Adding new models to MLJTransforms
In this package, data transformers are not implemented using a specific generic template, whereas categorical encoders are due to their systematic nature of encoding categorical levels into scalars or vectors. In light of this, the most pivotal method in implementing a new categorical encoder is:
MLJTransforms.generic_fit — Functiongeneric_fit(X,
features = Symbol[],
args...;
ignore::Bool = true,
ordered_factor::Bool = false,
feature_mapper,
kwargs...,
)Given a feature_mapper (see definition below), this method applies feature_mapper across a specified subset of categorical columns in X and returns a dictionary whose keys are the feature names, and each value is the corresponding level‑to‑value mapping produced by feature_mapper.
In essence, it spares effort of looping over each column and applying the feature_mapper function manually as well as handling the feature selection logic.
Arguments
X: A table where the elements of the categorical features have scitypes
MulticlassorOrderedFactorfeatures=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.ignore=true: Whether to exclude or include the features given infeaturesordered_factor=false: Whether to encodeOrderedFactoror ignore them
- feature_mapper: function that, for a given vector (eg, corresponding to a categorical column from the dataset
X), produces a mapping from each category level name in this vector to a scalar or vector according to specified transformation logic.
Note
- Any additional arguments (whether keyword or not) provided to this function are passed to the
feature_mapperfunction which is helpful whenfeature_mapperrequires additional arguments to compute the mapping (eg, hyperparameters).
Returns
mapping_per_feat_level: Maps each level for each feature in a subset of the categorical features of X into a scalar or a vector.encoded_features: The subset of the categorical features ofXthat were encoded
followed by:
MLJTransforms.generic_transform — Functiongeneric_transform(
X,
mapping_per_feat_level;
single_feat::Bool = true,
ignore_unknown::Bool = false,
use_levelnames::Bool = false,
custom_levels = nothing,
ensure_categorical::Bool = false,
)Apply a per‐level feature mapping to selected categorical columns in X, returning a new table of the same type.
Arguments
- X: A table where the elements of the categorical features have scitypes
MulticlassorOrderedFactor
mapping_per_feat_level::Dict{Symbol,Dict}: A dict whose keys are feature names (Symbol) and values are themselves dictionaries mapping each observed level to either a scalar (ifsingle_feat=true) or a fixed‐length vector (ifsingle_feat=false). Only columns whose names appear inmapping_per_feat_levelare transformed; others pass through unchanged.single_feat::Bool=true: Iftrue, each input level is mapped to a single scalar feature; iffalse, each input level is mapped to a length‑kvector, producingkoutput columns.ignore_unknown::Bool=false: Iffalse, novel levels inX(not seen during fit) will raise an error; iftrue, novel levels will be left unchanged (identity mapping).use_levelnames::Bool=false: Whensingle_feat=false, controls naming of the expanded columns:true: use actual level names (e.g.:color_red,:color_blue),false: use numeric indices (e.g.:color_1,:color_2).custom_levels::Union{Nothing,Vector}: If notnothing, overrides the names of levels used to generate feature names whensingle_feat=false.ensure_categorical::Bool=false: Only whensingle_feat=trueand iftrue, preserves the categorical type of the column after recoding (eg, feature should still be recognized asMulticlassafter transformation)
Returns
A new table of potentially similar to X but with categorical columns transformed according to mapping_per_feat_level.
All categorical encoders in this packager are implemented using these two methods. For an example, see FrequencyEncoder source code.
Moreover, you should implement the MLJModelInterface for any method you provide in this package. Check the interface docs and/or the existing interfaces in this package (eg, this interface for the FrequencyEncoder).