FeatureSelection
FeatureSelction is a julia package containing implementations of feature selection algorithms for use with the machine learning toolbox MLJ.
Installation
On a running instance of Julia with at least version 1.6 run
import Pkg;
Pkg.add("FeatureSelection")
Example Usage
Lets build a supervised recursive feature eliminator with RandomForestRegressor
from DecisionTree.jl as our base model. But first we need a dataset to train on. We shall create a synthetic dataset popularly known in the R community as the friedman dataset#1. Notice how the target vector for this dataset depends on only the first five columns of feature table. So we expect that our recursive feature elimination should return the first columns as important features.
using MLJ, FeatureSelection, StableRNGs
rng = StableRNG(123)
A = rand(rng, 50, 10)
X = MLJ.table(A) # features
y = @views(
10 .* sin.(
pi .* A[:, 1] .* A[:, 2]
) + 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5]
) # target
50-element Vector{Float64}:
15.823421292367547
11.300228454892402
14.70281910203931
5.771835160196897
18.552879762728146
20.78516621103614
20.681427309506923
21.326088995836216
14.247147497721128
13.537577529977188
⋮
19.965258516245633
19.364285908333393
13.314083067474565
19.297478118395937
22.704030205168113
8.23163352846279
19.138707544262704
10.856925348363083
18.098524734814458
Now we that we have our data, we can create our recursive feature elimination model and train it on our dataset
RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
forest = RandomForestRegressor(rng=rng)
rfe = RecursiveFeatureElimination(
model = forest, n_features=5, step=1
) # see doctring for description of defaults
mach = machine(rfe, X, y)
fit!(mach)
trained Machine; caches model-specific representations of data
model: DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …)
args:
1: Source @051 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @777 ⏎ AbstractVector{ScientificTypesBase.Continuous}
We can inspect the feature importances in two ways:
julia> report(mach).scores
Dict{Symbol, Int64} with 10 entries:
:x9 => 4
:x2 => 6
:x5 => 6
:x6 => 3
:x7 => 2
:x3 => 6
:x8 => 1
:x4 => 6
:x10 => 5
:x1 => 6
julia> feature_importances(mach)
10-element Vector{Pair{Symbol, Int64}}:
:x9 => 4
:x2 => 6
:x5 => 6
:x6 => 3
:x7 => 2
:x3 => 6
:x8 => 1
:x4 => 6
:x10 => 5
:x1 => 6
We can view the important features used by our model by inspecting the fitted_params
object.
julia> p = fitted_params(mach)
(features_left = [:x4, :x2, :x1, :x5, :x3],
model_fitresult = (forest = Ensemble of Decision Trees
Trees: 100
Avg Leaves: 25.3
Avg Depth: 8.01,),)
julia> p.features_left
5-element Vector{Symbol}:
:x4
:x2
:x1
:x5
:x3
We can also call the predict
method on the fitted machine, to predict using a random forest regressor trained using only the important features, or call the transform
method, to select just those features from some new table including all the original features. For more info, type ?RecursiveFeatureElimination
on a Julia REPL.
Okay, let's say that we didn't know that our synthetic dataset depends on only five columns from our feature table. We could apply cross fold validation StratifiedCV(nfolds=5)
with our recursive feature elimination model to select the optimal value of n_features
for our model. In this case we will use a simple Grid search with root mean square as the measure.
rfe = RecursiveFeatureElimination(model = forest)
tuning_rfe_model = TunedModel(
model = rfe,
measure = rms,
tuning = Grid(rng=rng),
resampling = StratifiedCV(nfolds = 5),
range = range(
rfe, :n_features, values = 1:10
)
)
self_tuning_rfe_mach = machine(tuning_rfe_model, X, y)
fit!(self_tuning_rfe_mach)
trained Machine; does not cache data
model: ProbabilisticTunedModel(model = DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …), …)
args:
1: Source @052 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @616 ⏎ AbstractVector{ScientificTypesBase.Continuous}
As before we can inspect the important features by inspecting the object returned by fitted_params
or feature_importances
as shown below.
julia> fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left
5-element Vector{Symbol}:
:x4
:x2
:x1
:x5
:x3
julia> feature_importances(self_tuning_rfe_mach)
10-element Vector{Pair{Symbol, Int64}}:
:x9 => 2
:x2 => 6
:x5 => 6
:x6 => 4
:x7 => 1
:x3 => 6
:x8 => 5
:x4 => 6
:x10 => 3
:x1 => 6
and call predict
on the tuned model machine as shown below
Xnew = MLJ.table(rand(rng, 50, 10)) # create test data
predict(self_tuning_rfe_mach, Xnew)
50-element Vector{Float64}:
14.612915980139846
18.487917617909144
13.618764198364365
11.672276660630205
14.002553975255037
15.873693213080983
13.441382659338426
18.91285351506013
12.339465903155364
15.877906366769594
⋮
15.782144419104085
10.94908418407388
11.85904254303697
14.716854931815396
13.54784125547524
11.502891246322188
14.093312357135678
13.443435888734937
16.061363024914662
In this case, prediction is done using the best recursive feature elimination model gotten from the tuning process above.
For resampling methods different from cross-validation, and for other TunedModel
options, such as parallelization, see the Tuning Models section of the MLJ manual. MLJ Documentation