ContrastEncoder
ContrastEncoder
A model type for constructing a contrast encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
ContrastEncoder = @load ContrastEncoder pkg=MLJTransforms
Do model = ContrastEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContrastEncoder(features=...)
.
ContrastEncoder
implements the following contrast encoding methods for categorical features: dummy, sum, backward/forward difference, and Helmert coding. More generally, users can specify a custom contrast or hypothesis matrix, and each feature can be encoded using a different method.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded mode=:dummy
: The type of encoding to use. Can be one of:contrast
,:dummy
,:sum
,:backward_diff
,:forward_diff
,:helmert
or:hypothesis
.
If ignore=false
(features to be encoded are listed explictly in features
), then this can be a vector of the same length as features
to specify a different contrast encoding scheme for each feature
buildmatrix=nothing
: A function or other callable with signaturebuildmatrix(colname, k)
,
where colname
is the name of the feature levels and k
is it's length, and which returns contrast or hypothesis matrix with row/column ordering consistent with the ordering of levels(col)
. Only relevant if mode
is :contrast
or :hypothesis
.
- ignore=true: Whether to exclude or include the features given in
features
- ordered_factor=false: Whether to encode
OrderedFactor
or ignore them
Operations
transform(mach, Xnew)
: Apply contrast encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
vector_given_value_given_feature
: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
## Define categorical dataset
X = (
name = categorical(["Ben", "John", "Mary", "John"]),
height = [1.85, 1.67, 1.5, 1.67],
favnum = categorical([7, 5, 10, 1]),
age = [23, 23, 14, 23],
)
## Check scitype coercions:
schema(X)
encoder = ContrastEncoder(
features = [:name, :favnum],
ignore = false,
mode = [:dummy, :helmert],
)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(name_John = [1.0, 0.0, 0.0, 0.0],
name_Mary = [0.0, 1.0, 0.0, 1.0],
height = [1.85, 1.67, 1.5, 1.67],
favnum_5 = [0.0, 1.0, 0.0, -1.0],
favnum_7 = [2.0, -1.0, 0.0, -1.0],
favnum_10 = [-1.0, -1.0, 3.0, -1.0],
age = [23, 23, 14, 23],)
See also OneHotEncoder