FeatureSelection
FeatureSelction is a julia package containing implementations of feature selection algorithms for use with the machine learning toolbox MLJ.
Installation
On a running instance of Julia with at least version 1.6 run
import Pkg;
Pkg.add("FeatureSelection")Example Usage
Lets build a supervised recursive feature eliminator with RandomForestRegressor from DecisionTree.jl as our base model. But first we need a dataset to train on. We shall create a synthetic dataset popularly known in the R community as the friedman dataset#1. Notice how the target vector for this dataset depends on only the first five columns of feature table. So we expect that our recursive feature elimination should return the first columns as important features.
using MLJ, FeatureSelection, StableRNGs
rng = StableRNG(123)
A = rand(rng, 50, 10)
X = MLJ.table(A) # features
y = @views(
10 .* sin.(
pi .* A[:, 1] .* A[:, 2]
) + 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5]
) # target50-element Vector{Float64}:
15.823421292367547
11.300228454892402
14.70281910203931
5.771835160196897
18.552879762728146
20.78516621103614
20.681427309506923
21.326088995836216
14.247147497721128
13.537577529977188
⋮
19.965258516245633
19.364285908333393
13.314083067474565
19.297478118395937
22.704030205168113
8.23163352846279
19.138707544262704
10.856925348363083
18.098524734814458Now we that we have our data, we can create our recursive feature elimination model and train it on our dataset
RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
forest = RandomForestRegressor(rng=rng)
rfe = RecursiveFeatureElimination(
model = forest, n_features=5, step=1
) # see doctring for description of defaults
mach = machine(rfe, X, y)
fit!(mach)trained Machine; caches model-specific representations of data
model: DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …)
args:
1: Source @051 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @777 ⏎ AbstractVector{ScientificTypesBase.Continuous}
We can inspect the feature importances in two ways:
julia> report(mach).scores
Dict{Symbol, Int64} with 10 entries:
:x9 => 4
:x2 => 6
:x5 => 6
:x6 => 3
:x7 => 2
:x3 => 6
:x8 => 1
:x4 => 6
:x10 => 5
:x1 => 6
julia> feature_importances(mach)
10-element Vector{Pair{Symbol, Int64}}:
:x9 => 4
:x2 => 6
:x5 => 6
:x6 => 3
:x7 => 2
:x3 => 6
:x8 => 1
:x4 => 6
:x10 => 5
:x1 => 6We can view the important features used by our model by inspecting the fitted_params object.
julia> p = fitted_params(mach)
(features_left = [:x4, :x2, :x1, :x5, :x3],
model_fitresult = (forest = Ensemble of Decision Trees
Trees: 100
Avg Leaves: 25.3
Avg Depth: 8.01,),)
julia> p.features_left
5-element Vector{Symbol}:
:x4
:x2
:x1
:x5
:x3We can also call the predict method on the fitted machine, to predict using a random forest regressor trained using only the important features, or call the transform method, to select just those features from some new table including all the original features. For more info, type ?RecursiveFeatureElimination on a Julia REPL.
Okay, let's say that we didn't know that our synthetic dataset depends on only five columns from our feature table. We could apply cross fold validation StratifiedCV(nfolds=5) with our recursive feature elimination model to select the optimal value of n_features for our model. In this case we will use a simple Grid search with root mean square as the measure.
rfe = RecursiveFeatureElimination(model = forest)
tuning_rfe_model = TunedModel(
model = rfe,
measure = rms,
tuning = Grid(rng=rng),
resampling = StratifiedCV(nfolds = 5),
range = range(
rfe, :n_features, values = 1:10
)
)
self_tuning_rfe_mach = machine(tuning_rfe_model, X, y)
fit!(self_tuning_rfe_mach)trained Machine; does not cache data
model: ProbabilisticTunedModel(model = DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …), …)
args:
1: Source @052 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @616 ⏎ AbstractVector{ScientificTypesBase.Continuous}
As before we can inspect the important features by inspecting the object returned by fitted_params or feature_importances as shown below.
julia> fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left
5-element Vector{Symbol}:
:x4
:x2
:x1
:x5
:x3
julia> feature_importances(self_tuning_rfe_mach)
10-element Vector{Pair{Symbol, Int64}}:
:x9 => 2
:x2 => 6
:x5 => 6
:x6 => 4
:x7 => 1
:x3 => 6
:x8 => 5
:x4 => 6
:x10 => 3
:x1 => 6and call predict on the tuned model machine as shown below
Xnew = MLJ.table(rand(rng, 50, 10)) # create test data
predict(self_tuning_rfe_mach, Xnew)50-element Vector{Float64}:
14.612915980139846
18.487917617909144
13.618764198364365
11.672276660630205
14.002553975255037
15.873693213080983
13.441382659338426
18.91285351506013
12.339465903155364
15.877906366769594
⋮
15.782144419104085
10.94908418407388
11.85904254303697
14.716854931815396
13.54784125547524
11.502891246322188
14.093312357135678
13.443435888734937
16.061363024914662In this case, prediction is done using the best recursive feature elimination model gotten from the tuning process above.
For resampling methods different from cross-validation, and for other TunedModel options, such as parallelization, see the Tuning Models section of the MLJ manual. MLJ Documentation