RandomForestImputer

mutable struct RandomForestImputer <: MLJModelInterface.Unsupervised

Impute missing values using Random Forests, from the Beta Machine Learning Toolkit (BetaML).

Hyperparameters:

  • n_trees::Int64: Number of (decision) trees in the forest [def: 30]
  • max_depth::Union{Nothing, Int64}: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: nothing, i.e. no limits]
  • min_gain::Float64: The minimum information gain to allow for a node's partition [def: 0]
  • min_records::Int64: The minimum number of records a node must holds to consider for a partition of it [def: 2]
  • max_features::Union{Nothing, Int64}: The maximum number of (random) features to consider at each partitioning [def: nothing, i.e. square root of the data dimension]
  • forced_categorical_cols::Vector{Int64}: Specify the positions of the integer columns to treat as categorical instead of cardinal. [Default: empty vector (all numerical cols are treated as cardinal by default and the others as categorical)]
  • splitting_criterion::Union{Nothing, Function}: Either gini, entropy or variance. This is the name of the function to be used to compute the information gain of a specific partition. This is done by measuring the difference betwwen the "impurity" of the labels of the parent node with those of the two child nodes, weighted by the respective number of items. [def: nothing, i.e. gini for categorical labels (classification task) and variance for numerical labels(regression task)]. It can be an anonymous function.
  • recursive_passages::Int64: Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: 1].
  • rng::Random.AbstractRNG: A Random Number Generator to be used in stochastic parts of the code [deafult: Random.GLOBAL_RNG]

Example:

julia> using MLJ

julia> X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4] |> table ;

julia> modelType   = @load RandomForestImputer  pkg = "BetaML" verbosity=0
BetaML.Imputation.RandomForestImputer

julia> model     = modelType(n_trees=40)
RandomForestImputer(
  n_trees = 40, 
  max_depth = nothing, 
  min_gain = 0.0, 
  min_records = 2, 
  max_features = nothing, 
  forced_categorical_cols = Int64[], 
  splitting_criterion = nothing, 
  recursive_passages = 1, 
  rng = Random._GLOBAL_RNG())

julia> mach      = machine(model, X);

julia> fit!(mach);
[ Info: Training machine(RandomForestImputer(n_trees = 40, …), …).

julia> X_full       = transform(mach) |> MLJ.matrix
9×2 Matrix{Float64}:
 1.0      10.5
 1.5      10.3909
 1.8       8.0
 1.7      15.0
 3.2      40.0
 2.88375   8.66125
 3.3      38.0
 3.98125  -2.3
 5.2      -2.4