MissingnessEncoder
MissingnessEncoder
A model type for constructing a missingness encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
MissingnessEncoder = @load MissingnessEncoder pkg=MLJTransforms
Do model = MissingnessEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MissingnessEncoder(features=...)
.
MissingnessEncoder
maps any missing level of a categorical feature into a new level (e.g., "Missing"). By this, missingness will be treated as a new level by any subsequent model. This assumes that the categorical features have raw types that are in Char
, AbstractString
, and Number
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded - ignore=true: Whether to exclude or include the features given in
features
- ordered_factor=false: Whether to encode
OrderedFactor
or ignore them label_for_missing::Dict{<:Type, <:Any}()= Dict( AbstractString => "missing", Char => 'm', )
: A
dictionary where the possible values for keys are the types in Char
, AbstractString
, and Number
and where each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypes AbstractString
then missing values will be replaced with "missing"
and if the raw type subtypes Char
then the new value is 'm'
and if the raw type subtypes Number
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
label_for_missing_given_feature
: A dictionary that for each column, mapsmissing
into some value according tolabel_for_missing
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
## Define a table with missing values
Xm = (
A = categorical(["Ben", "John", missing, missing, "Mary", "John", missing]),
B = [1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C= categorical([7, 5, missing, missing, 10, 0, missing]),
D = [23, 23, 44, 66, 14, 23, 11],
E = categorical([missing, 'g', 'r', missing, 'r', 'g', 'p'])
)
encoder = MissingnessEncoder()
mach = fit!(machine(encoder, Xm))
Xnew = transform(mach, Xm)
julia> Xnew
(A = ["Ben", "John", "missing", "missing", "Mary", "John", "missing"],
B = Union{Missing, Float64}[1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C = [7, 5, -1, -1, 10, 0, -1],
D = [23, 23, 44, 66, 14, 23, 11],
E = ['m', 'g', 'r', 'm', 'r', 'g', 'p'],)
See also CardinalityReducer