GeneralImputer

mutable struct GeneralImputer <: MLJModelInterface.Unsupervised

Impute missing values using arbitrary learning models, from the Beta Machine Learning Toolkit (BetaML).

Impute missing values using a vector (one per column) of arbitrary learning models (classifiers/regressors, not necessarily from BetaML) that implement the interface m = Model([options]), train!(m,X,Y) and predict(m,X).

Hyperparameters:

  • cols_to_impute::Union{String, Vector{Int64}}: Columns in the matrix for which to create an imputation model, i.e. to impute. It can be a vector of columns IDs (positions), or the keywords "auto" (default) or "all". With "auto" the model automatically detects the columns with missing data and impute only them. You may manually specify the columns or use "all" if you want to create a imputation model for that columns during training even if all training data are non-missing to apply then the training model to further data with possibly missing values.
  • estimator::Any: An entimator model (regressor or classifier), with eventually its options (hyper-parameters), to be used to impute the various columns of the matrix. It can also be a cols_to_impute-length vector of different estimators to consider a different estimator for each column (dimension) to impute, for example when some columns are categorical (and will hence require a classifier) and some others are numerical (hence requiring a regressor). [default: nothing, i.e. use BetaML random forests, handling classification and regression jobs automatically].
  • missing_supported::Union{Bool, Vector{Bool}}: Wheter the estimator(s) used to predict the missing data support itself missing data in the training features (X). If not, when the model for a certain dimension is fitted, dimensions with missing data in the same rows of those where imputation is needed are dropped and then only non-missing rows in the other remaining dimensions are considered. It can be a vector of boolean values to specify this property for each individual estimator or a single booleann value to apply to all the estimators [default: false]
  • fit_function::Union{Function, Vector{Function}}: The function used by the estimator(s) to fit the model. It should take as fist argument the model itself, as second argument a matrix representing the features, and as third argument a vector representing the labels. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default: BetaML.fit!]
  • predict_function::Union{Function, Vector{Function}}: The function used by the estimator(s) to predict the labels. It should take as fist argument the model itself and as second argument a matrix representing the features. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default: BetaML.predict]
  • recursive_passages::Int64: Define the number of times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: 1].
  • rng::Random.AbstractRNG: A Random Number Generator to be used in stochastic parts of the code [deafult: Random.GLOBAL_RNG]. Note that this influence only the specific GeneralImputer code, the individual estimators may have their own rng (or similar) parameter.

Examples :

  • Using BetaML models:
julia> using MLJ;
julia> import BetaML ## The library from which to get the individual estimators to be used for each column imputation
julia> X = ["a"         8.2;
            "a"     missing;
            "a"         7.8;
            "b"          21;
            "b"          18;
            "c"        -0.9;
            missing      20;
            "c"        -1.8;
            missing    -2.3;
            "c"        -2.4] |> table ;
julia> modelType = @load GeneralImputer  pkg = "BetaML" verbosity=0
BetaML.Imputation.GeneralImputer
julia> model     = modelType(estimator=BetaML.DecisionTreeEstimator(),recursive_passages=2);
julia> mach      = machine(model, X);
julia> fit!(mach);
[ Info: Training machine(GeneralImputer(cols_to_impute = auto, …), …).
julia> X_full       = transform(mach) |> MLJ.matrix
10×2 Matrix{Any}:
 "a"   8.2
 "a"   8.0
 "a"   7.8
 "b"  21
 "b"  18
 "c"  -0.9
 "b"  20
 "c"  -1.8
 "c"  -2.3
 "c"  -2.4
  • Using third party packages (in this example DecisionTree):
julia> using MLJ;
julia> import DecisionTree ## An example of external estimators to be used for each column imputation
julia> X = ["a"         8.2;
            "a"     missing;
            "a"         7.8;
            "b"          21;
            "b"          18;
            "c"        -0.9;
            missing      20;
            "c"        -1.8;
            missing    -2.3;
            "c"        -2.4] |> table ;
julia> modelType   = @load GeneralImputer  pkg = "BetaML" verbosity=0
BetaML.Imputation.GeneralImputer
julia> model     = modelType(estimator=[DecisionTree.DecisionTreeClassifier(),DecisionTree.DecisionTreeRegressor()], fit_function=DecisionTree.fit!,predict_function=DecisionTree.predict,recursive_passages=2);
julia> mach      = machine(model, X);
julia> fit!(mach);
[ Info: Training machine(GeneralImputer(cols_to_impute = auto, …), …).
julia> X_full       = transform(mach) |> MLJ.matrix
10×2 Matrix{Any}:
 "a"   8.2
 "a"   7.51111
 "a"   7.8
 "b"  21
 "b"  18
 "c"  -0.9
 "b"  20
 "c"  -1.8
 "c"  -2.3
 "c"  -2.4