MissingnessEncoder
MissingnessEncoderA model type for constructing a missingness encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
MissingnessEncoder = @load MissingnessEncoder pkg=MLJTransformsDo model = MissingnessEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MissingnessEncoder(features=...).
MissingnessEncoder maps any missing level of a categorical feature into a new level (e.g., "Missing"). By this, missingness will be treated as a new level by any subsequent model. This assumes that the categorical features have raw types that are in Char, AbstractString, and Number.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model to data with
mach = machine(model, X)Here:
Xis any table of input features (eg, aDataFrame). Features to be transformed must have element scitypeMulticlassorOrderedFactor. Useschema(X)to check scitypes.
Train the machine using fit!(mach, rows=...).
Hyper-parameters
- features=[]: A list of names of categorical features given as symbols to exclude or in clude from encoding, according to the value of
ignore, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excluded. - ignore=true: Whether to exclude or include the features given in
features - ordered_factor=false: Whether to encode
OrderedFactoror ignore them label_for_missing::Dict{<:Type, <:Any}()= Dict( AbstractString => "missing", Char => 'm', ): A dictionary where the possible values for keys are the types inChar,AbstractString, andNumberand where each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypesAbstractStringthen missing values will be replaced with"missing"and if the raw type subtypesCharthen the new value is'm'and if the raw type subtypesNumberthen the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew): Apply cardinality reduction to selectedMulticlassorOrderedFactorfeatures ofXnewspecified by hyper-parameters, and return the new table. Features that are neitherMulticlassnorOrderedFactorare always left unchanged.
Fitted parameters
The fields of fitted_params(mach) are:
label_for_missing_given_feature: A dictionary that for each column, mapsmissinginto some value according tolabel_for_missing
Report
The fields of report(mach) are:
- encoded_features: The subset of the categorical features of
Xthat were encoded
Examples
import StatsBase.proportionmap
using MLJ
## Define a table with missing values
Xm = (
A = categorical(["Ben", "John", missing, missing, "Mary", "John", missing]),
B = [1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C= categorical([7, 5, missing, missing, 10, 0, missing]),
D = [23, 23, 44, 66, 14, 23, 11],
E = categorical([missing, 'g', 'r', missing, 'r', 'g', 'p'])
)
encoder = MissingnessEncoder()
mach = fit!(machine(encoder, Xm))
Xnew = transform(mach, Xm)
julia> Xnew
(A = ["Ben", "John", "missing", "missing", "Mary", "John", "missing"],
B = Union{Missing, Float64}[1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C = [7, 5, -1, -1, 10, 0, -1],
D = [23, 23, 44, 66, 14, 23, 11],
E = ['m', 'g', 'r', 'm', 'r', 'g', 'p'],)
See also CardinalityReducer