Airfoil
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
Main author: Ashrya Agrawal.
Here we use the UCI "Airfoil Self-Noise" dataset
using MLJ
using PrettyPrinting
import DataFrames
import Statistics
using CSV
using HTTP
using StableRNGs
req = HTTP.get("https://raw.githubusercontent.com/rupakc/UCI-Data-Analysis/master/Airfoil%20Dataset/airfoil_self_noise.dat");
df = CSV.read(req.body, DataFrames.DataFrame; header=[
"Frequency","Attack_Angle","Chord+Length",
"Free_Velocity","Suction_Side","Scaled_Sound"
]
);
df[1:5, :] |> pretty
┌───────────┬──────────────┬──────────────┬───────────────┬──────────────┬──────────────┐
│ Frequency │ Attack_Angle │ Chord+Length │ Free_Velocity │ Suction_Side │ Scaled_Sound │
│ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
│ Count │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │
├───────────┼──────────────┼──────────────┼───────────────┼──────────────┼──────────────┤
│ 800.0 │ 0.0 │ 0.3048 │ 71.3 │ 0.00266337 │ 126.201 │
│ 1000.0 │ 0.0 │ 0.3048 │ 71.3 │ 0.00266337 │ 125.201 │
│ 1250.0 │ 0.0 │ 0.3048 │ 71.3 │ 0.00266337 │ 125.951 │
│ 1600.0 │ 0.0 │ 0.3048 │ 71.3 │ 0.00266337 │ 127.591 │
│ 2000.0 │ 0.0 │ 0.3048 │ 71.3 │ 0.00266337 │ 127.461 │
└───────────┴──────────────┴──────────────┴───────────────┴──────────────┴──────────────┘
inspect the schema:
schema(df)
┌───────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────────────┼────────────┼─────────┤
│ Frequency │ Count │ Int64 │
│ Attack_Angle │ Continuous │ Float64 │
│ Chord+Length │ Continuous │ Float64 │
│ Free_Velocity │ Continuous │ Float64 │
│ Suction_Side │ Continuous │ Float64 │
│ Scaled_Sound │ Continuous │ Float64 │
└───────────────┴────────────┴─────────┘
unpack into the data and labels:
y, X = unpack(df, ==(:Scaled_Sound));
Now we Standardize the features using the transformer Standardizer()
X = MLJ.transform(fit!(machine(Standardizer(), X)), X);
Partition into train and test set
train, test = partition(collect(eachindex(y)), 0.7, shuffle=true, rng=StableRNG(612));
Let's first see which models are compatible with the scientific type and machine type of our data
for model in models(matching(X, y))
print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: CatBoostRegressor , Package: CatBoost
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: BetaML
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: EvoLinearRegressor , Package: EvoLinear
Model Name: EvoSplineRegressor , Package: EvoLinear
Model Name: EvoTreeGaussian , Package: EvoTrees
Model Name: EvoTreeMLE , Package: EvoTrees
Model Name: EvoTreeRegressor , Package: EvoTrees
Model Name: GaussianMixtureRegressor , Package: BetaML
Model Name: NeuralNetworkRegressor , Package: BetaML
Model Name: RandomForestRegressor , Package: BetaML
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: MLJScikitLearnInterface
Model Name: StableForestRegressor , Package: SIRUS
Model Name: StableRulesRegressor , Package: SIRUS
Note that if we coerce X.Frequency
to Continuous
, many more models are available:
coerce!(X, :Frequency=>Continuous)
for model in models(matching(X, y))
print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: ARDRegressor , Package: MLJScikitLearnInterface
Model Name: AdaBoostRegressor , Package: MLJScikitLearnInterface
Model Name: BaggingRegressor , Package: MLJScikitLearnInterface
Model Name: BayesianRidgeRegressor , Package: MLJScikitLearnInterface
Model Name: CatBoostRegressor , Package: CatBoost
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: BetaML
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: DummyRegressor , Package: MLJScikitLearnInterface
Model Name: ElasticNetCVRegressor , Package: MLJScikitLearnInterface
Model Name: ElasticNetRegressor , Package: MLJLinearModels
Model Name: ElasticNetRegressor , Package: MLJScikitLearnInterface
Model Name: EpsilonSVR , Package: LIBSVM
Model Name: EvoLinearRegressor , Package: EvoLinear
Model Name: EvoSplineRegressor , Package: EvoLinear
Model Name: EvoTreeGaussian , Package: EvoTrees
Model Name: EvoTreeMLE , Package: EvoTrees
Model Name: EvoTreeRegressor , Package: EvoTrees
Model Name: ExtraTreesRegressor , Package: MLJScikitLearnInterface
Model Name: GaussianMixtureRegressor , Package: BetaML
Model Name: GaussianProcessRegressor , Package: MLJScikitLearnInterface
Model Name: GradientBoostingRegressor , Package: MLJScikitLearnInterface
Model Name: HistGradientBoostingRegressor , Package: MLJScikitLearnInterface
Model Name: HuberRegressor , Package: MLJLinearModels
Model Name: HuberRegressor , Package: MLJScikitLearnInterface
Model Name: KNNRegressor , Package: NearestNeighborModels
Model Name: KNeighborsRegressor , Package: MLJScikitLearnInterface
Model Name: KPLSRegressor , Package: PartialLeastSquaresRegressor
Model Name: LADRegressor , Package: MLJLinearModels
Model Name: LGBMRegressor , Package: LightGBM
Model Name: LarsCVRegressor , Package: MLJScikitLearnInterface
Model Name: LarsRegressor , Package: MLJScikitLearnInterface
Model Name: LassoCVRegressor , Package: MLJScikitLearnInterface
Model Name: LassoLarsCVRegressor , Package: MLJScikitLearnInterface
Model Name: LassoLarsICRegressor , Package: MLJScikitLearnInterface
Model Name: LassoLarsRegressor , Package: MLJScikitLearnInterface
Model Name: LassoRegressor , Package: MLJLinearModels
Model Name: LassoRegressor , Package: MLJScikitLearnInterface
Model Name: LinearRegressor , Package: GLM
Model Name: LinearRegressor , Package: MLJLinearModels
Model Name: LinearRegressor , Package: MLJScikitLearnInterface
Model Name: LinearRegressor , Package: MultivariateStats
Model Name: NeuralNetworkRegressor , Package: BetaML
Model Name: NeuralNetworkRegressor , Package: MLJFlux
Model Name: NuSVR , Package: LIBSVM
Model Name: OrthogonalMatchingPursuitCVRegressor , Package: MLJScikitLearnInterface
Model Name: OrthogonalMatchingPursuitRegressor , Package: MLJScikitLearnInterface
Model Name: PLSRegressor , Package: PartialLeastSquaresRegressor
Model Name: PassiveAggressiveRegressor , Package: MLJScikitLearnInterface
Model Name: QuantileRegressor , Package: MLJLinearModels
Model Name: RANSACRegressor , Package: MLJScikitLearnInterface
Model Name: RandomForestRegressor , Package: BetaML
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: MLJScikitLearnInterface
Model Name: RidgeCVRegressor , Package: MLJScikitLearnInterface
Model Name: RidgeRegressor , Package: MLJLinearModels
Model Name: RidgeRegressor , Package: MLJScikitLearnInterface
Model Name: RidgeRegressor , Package: MultivariateStats
Model Name: RobustRegressor , Package: MLJLinearModels
Model Name: SGDRegressor , Package: MLJScikitLearnInterface
Model Name: SRRegressor , Package: SymbolicRegression
Model Name: SVMLinearRegressor , Package: MLJScikitLearnInterface
Model Name: SVMNuRegressor , Package: MLJScikitLearnInterface
Model Name: SVMRegressor , Package: MLJScikitLearnInterface
Model Name: StableForestRegressor , Package: SIRUS
Model Name: StableRulesRegressor , Package: SIRUS
Model Name: TheilSenRegressor , Package: MLJScikitLearnInterface
Model Name: XGBoostRegressor , Package: XGBoost
We will first try out DecisionTreeRegressor:
DecisionTreeRegressor = @load DecisionTreeRegressor pkg=DecisionTree
dcrm = machine(DecisionTreeRegressor(), X, y)
fit!(dcrm, rows=train)
pred_dcrm = predict(dcrm, rows=test);
import MLJDecisionTreeInterface ✔
Now you can call a loss function to assess the performance on test set.
rms(pred_dcrm, y[test])
2.9034811027815564
Now let's try out RandomForestRegressor:
RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
rfr = RandomForestRegressor()
rfr_m = machine(rfr, X, y);
import MLJDecisionTreeInterface ✔
train on the rows corresponding to train
fit!(rfr_m, rows=train);
predict values on the rows corresponding to test
pred_rfr = predict(rfr_m, rows=test);
rms(pred_rfr, y[test])
2.173840483006333
Unsurprisingly, the RandomForestRegressor does a better job.
Can we do even better? Yeah, we can!! We can make use of Model Tuning.
In case you are new to model tuning using MLJ, refer lab5 and model-tuning
Range of values for parameters should be specified to do hyperparameter tuning
r_maxD = range(rfr, :n_trees, lower=9, upper=15)
r_samF = range(rfr, :sampling_fraction, lower=0.6, upper=0.8)
r = [r_maxD, r_samF];
Now we specify how the tuning should be done. Let's just specify a coarse grid tuning with cross validation and instantiate a tuned model:
tuning = Grid(resolution=7)
resampling = CV(nfolds=6)
tm = TunedModel(model=rfr, tuning=tuning,
resampling=resampling, ranges=r, measure=rms)
rfr_tm = machine(tm, X, y);
train on the rows corresponding to train
fit!(rfr_tm, rows=train);
predict values on the rows corresponding to test
pred_rfr_tm = predict(rfr_tm, rows=test);
rms(pred_rfr_tm, y[test])
2.2752847131799996
That was great! We have further improved the accuracy
Now to retrieve best model, You can use
fitted_params(rfr_tm).best_model
RandomForestRegressor(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = -1,
n_trees = 14,
sampling_fraction = 0.7,
feature_importance = :impurity,
rng = Random._GLOBAL_RNG())
Let's visualize the tuning results:
using Plots
plot(rfr_tm)