Adding new models to MLJTransforms

In this package, data transformers are not implemented using a specific generic template, whereas categorical encoders are due to their systematic nature of encoding categorical levels into scalars or vectors. In light of this, the most pivotal method in implementing a new categorical encoder is:

MLJTransforms.generic_fitFunction
generic_fit(X,
    features = Symbol[],
    args...;
    ignore::Bool = true,
    ordered_factor::Bool = false,
    feature_mapper,
    kwargs...,
)

Given a feature_mapper (see definition below), this method applies feature_mapper across a specified subset of categorical columns in X and returns a dictionary whose keys are the feature names, and each value is the corresponding level‑to‑value mapping produced by feature_mapper.

In essence, it spares effort of looping over each column and applying the feature_mapper function manually as well as handling the feature selection logic.

Arguments

  • X: A table where the elements of the categorical features have scitypes Multiclass or OrderedFactor

  • features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded.

  • ignore=true: Whether to exclude or include the features given in features

  • ordered_factor=false: Whether to encode OrderedFactor or ignore them

  • feature_mapper: function that, for a given vector (eg, corresponding to a categorical column from the dataset X), produces a mapping from each category level name in this vector to a scalar or vector according to specified transformation logic.

Note

  • Any additional arguments (whether keyword or not) provided to this function are passed to the feature_mapper function which is helpful when feature_mapper requires additional arguments to compute the mapping (eg, hyperparameters).

Returns

  • mapping_per_feat_level: Maps each level for each feature in a subset of the categorical features of X into a scalar or a vector.
  • encoded_features: The subset of the categorical features of X that were encoded
source

followed by:

MLJTransforms.generic_transformFunction
generic_transform(
    X,
    mapping_per_feat_level;
    single_feat::Bool = true,
    ignore_unknown::Bool = false,
    use_levelnames::Bool = false,
    custom_levels = nothing,
    ensure_categorical::Bool = false,
)

Apply a per‐level feature mapping to selected categorical columns in X, returning a new table of the same type.

Arguments

  • X: A table where the elements of the categorical features have scitypes Multiclass or OrderedFactor

  • mapping_per_feat_level::Dict{Symbol,Dict}: A dict whose keys are feature names (Symbol) and values are themselves dictionaries mapping each observed level to either a scalar (if single_feat=true) or a fixed‐length vector (if single_feat=false). Only columns whose names appear in mapping_per_feat_level are transformed; others pass through unchanged.

  • single_feat::Bool=true: If true, each input level is mapped to a single scalar feature; if false, each input level is mapped to a length‑k vector, producing k output columns.

  • ignore_unknown::Bool=false: If false, novel levels in X (not seen during fit) will raise an error; if true, novel levels will be left unchanged (identity mapping).

  • use_levelnames::Bool=false: When single_feat=false, controls naming of the expanded columns: true: use actual level names (e.g. :color_red, :color_blue), false: use numeric indices (e.g. :color_1, :color_2).

  • custom_levels::Union{Nothing,Vector}: If not nothing, overrides the names of levels used to generate feature names when single_feat=false.

  • ensure_categorical::Bool=false: Only when single_feat=true and if true, preserves the categorical type of the column after recoding (eg, feature should still be recognized as Multiclass after transformation)

Returns

A new table of potentially similar to X but with categorical columns transformed according to mapping_per_feat_level.

source

All categorical encoders in this packager are implemented using these two methods. For an example, see FrequencyEncoder source code.

Moreover, you should implement the MLJModelInterface for any method you provide in this package. Check the interface docs and/or the existing interfaces in this package (eg, this interface for the FrequencyEncoder).