OneHotEncoder
OneHotEncoder
A model type for constructing a one-hot encoder, based on MLJModels.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
OneHotEncoder = @load OneHotEncoder pkg=MLJModels
Do model = OneHotEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...)
.
Use this model to one-hot encode the Multiclass
and OrderedFactor
features (columns) of some table, leaving other columns unchanged.
New data to be transformed may lack features present in the fit data, but no new features can be present.
Warning: This transformer assumes that levels(col)
for any Multiclass
or OrderedFactor
column, col
, is the same for training data and new data to be transformed.
To ensure all features are transformed into Continuous
features, or dropped, use ContinuousEncoder
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table. Columns can be of mixed type but only those with element scitypeMulticlass
orOrderedFactor
can be encoded. Check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features
: a vector of symbols (column names). If empty (default) then allMulticlass
andOrderedFactor
features are encoded. Otherwise, encoding is further restricted to the specified features (ignore=false
) or the unspecified features (ignore=true
). This default behavior can be modified by theordered_factor
flag.ordered_factor=false
: whentrue
,OrderedFactor
features are universally excludeddrop_last=true
: whether to drop the column corresponding to the final class of encoded features. For example, a three-class feature is spawned into three new features ifdrop_last=false
, but just two features otherwise.
Fitted parameters
The fields of fitted_params(mach)
are:
all_features
: names of all features encountered in trainingfitted_levels_given_feature
: dictionary of the levels associated with each feature encoded, keyed on the feature nameref_name_pairs_given_feature
: dictionary of pairsr => ftr
(such as0x00000001 => :grad__A
) wherer
is a CategoricalArrays.jl reference integer representing a level, andftr
the corresponding new feature name; the dictionary is keyed on the names of features that are encoded
Report
The fields of report(mach)
are:
features_to_be_encoded
: names of input features to be encodednew_features
: names of all output features
Example
using MLJ
X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
grade=categorical(["A", "B", "A", "C"], ordered=true),
height=[1.85, 1.67, 1.5, 1.67],
n_devices=[3, 2, 4, 3])
julia> schema(X)
┌───────────┬──────────────────┐
│ names │ scitypes │
├───────────┼──────────────────┤
│ name │ Multiclass{4} │
│ grade │ OrderedFactor{3} │
│ height │ Continuous │
│ n_devices │ Count │
└───────────┴──────────────────┘
hot = OneHotEncoder(drop_last=true)
mach = fit!(machine(hot, X))
W = transform(mach, X)
julia> schema(W)
┌──────────────┬────────────┐
│ names │ scitypes │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John │ Continuous │
│ name__Lee │ Continuous │
│ grade__A │ Continuous │
│ grade__B │ Continuous │
│ height │ Continuous │
│ n_devices │ Count │
└──────────────┴────────────┘
See also ContinuousEncoder
.