Airfoil

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Main author: Ashrya Agrawal.

Getting started

Here we use the UCI "Airfoil Self-Noise" dataset

Loading and preparing the data

using MLJ
using PrettyPrinting
import DataFrames
import Statistics
using CSV
using PyPlot
using HTTP
using StableRNGs


req = HTTP.get("https://raw.githubusercontent.com/rupakc/UCI-Data-Analysis/master/Airfoil%20Dataset/airfoil_self_noise.dat");

df = CSV.read(req.body, DataFrames.DataFrame; header=[
                   "Frequency","Attack_Angle","Chord+Length",
                   "Free_Velocity","Suction_Side","Scaled_Sound"
                   ]
              );
df[1:5, :] |> pretty
┌───────────┬──────────────┬──────────────┬───────────────┬──────────────┬──────────────┐
│ Frequency │ Attack_Angle │ Chord+Length │ Free_Velocity │ Suction_Side │ Scaled_Sound │
│ Int64     │ Float64      │ Float64      │ Float64       │ Float64      │ Float64      │
│ Count     │ Continuous   │ Continuous   │ Continuous    │ Continuous   │ Continuous   │
├───────────┼──────────────┼──────────────┼───────────────┼──────────────┼──────────────┤
│ 800.0     │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 126.201      │
│ 1000.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 125.201      │
│ 1250.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 125.951      │
│ 1600.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 127.591      │
│ 2000.0    │ 0.0          │ 0.3048       │ 71.3          │ 0.00266337   │ 127.461      │
└───────────┴──────────────┴──────────────┴───────────────┴──────────────┴──────────────┘

inspect the schema:

schema(df)
┌───────────────┬────────────┬─────────┐
│ names         │ scitypes   │ types   │
├───────────────┼────────────┼─────────┤
│ Frequency     │ Count      │ Int64   │
│ Attack_Angle  │ Continuous │ Float64 │
│ Chord+Length  │ Continuous │ Float64 │
│ Free_Velocity │ Continuous │ Float64 │
│ Suction_Side  │ Continuous │ Float64 │
│ Scaled_Sound  │ Continuous │ Float64 │
└───────────────┴────────────┴─────────┘

unpack into the data and labels:

y, X = unpack(df, ==(:Scaled_Sound));

Now we Standardize the features using the transformer Standardizer()

X = MLJ.transform(fit!(machine(Standardizer(), X)), X);

Partition into train and test set

train, test = partition(collect(eachindex(y)), 0.7, shuffle=true, rng=StableRNG(612));

Let's first see which models are compatible with the scientific type and machine type of our data

for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: BetaML
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: RandomForestRegressor , Package: BetaML
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: ScikitLearn

Note that if we coerce X.Frequency to Continuous, many more models are available:

coerce!(X, :Frequency=>Continuous)

for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end
Model Name: ARDRegressor , Package: ScikitLearn
Model Name: AdaBoostRegressor , Package: ScikitLearn
Model Name: BaggingRegressor , Package: ScikitLearn
Model Name: BayesianRidgeRegressor , Package: ScikitLearn
Model Name: ConstantRegressor , Package: MLJModels
Model Name: DecisionTreeRegressor , Package: BetaML
Model Name: DecisionTreeRegressor , Package: DecisionTree
Model Name: DeterministicConstantRegressor , Package: MLJModels
Model Name: DummyRegressor , Package: ScikitLearn
Model Name: ElasticNetCVRegressor , Package: ScikitLearn
Model Name: ElasticNetRegressor , Package: MLJLinearModels
Model Name: ElasticNetRegressor , Package: ScikitLearn
Model Name: EpsilonSVR , Package: LIBSVM
Model Name: EvoTreeGaussian , Package: EvoTrees
Model Name: EvoTreeRegressor , Package: EvoTrees
Model Name: ExtraTreesRegressor , Package: ScikitLearn
Model Name: GaussianProcessRegressor , Package: ScikitLearn
Model Name: GradientBoostingRegressor , Package: ScikitLearn
Model Name: HuberRegressor , Package: MLJLinearModels
Model Name: HuberRegressor , Package: ScikitLearn
Model Name: KNNRegressor , Package: NearestNeighborModels
Model Name: KNeighborsRegressor , Package: ScikitLearn
Model Name: KPLSRegressor , Package: PartialLeastSquaresRegressor
Model Name: LADRegressor , Package: MLJLinearModels
Model Name: LGBMRegressor , Package: LightGBM
Model Name: LarsCVRegressor , Package: ScikitLearn
Model Name: LarsRegressor , Package: ScikitLearn
Model Name: LassoCVRegressor , Package: ScikitLearn
Model Name: LassoLarsCVRegressor , Package: ScikitLearn
Model Name: LassoLarsICRegressor , Package: ScikitLearn
Model Name: LassoLarsRegressor , Package: ScikitLearn
Model Name: LassoRegressor , Package: MLJLinearModels
Model Name: LassoRegressor , Package: ScikitLearn
Model Name: LinearRegressor , Package: GLM
Model Name: LinearRegressor , Package: MLJLinearModels
Model Name: LinearRegressor , Package: MultivariateStats
Model Name: LinearRegressor , Package: ScikitLearn
Model Name: NeuralNetworkRegressor , Package: MLJFlux
Model Name: NuSVR , Package: LIBSVM
Model Name: OrthogonalMatchingPursuitCVRegressor , Package: ScikitLearn
Model Name: OrthogonalMatchingPursuitRegressor , Package: ScikitLearn
Model Name: PLSRegressor , Package: PartialLeastSquaresRegressor
Model Name: PassiveAggressiveRegressor , Package: ScikitLearn
Model Name: QuantileRegressor , Package: MLJLinearModels
Model Name: RANSACRegressor , Package: ScikitLearn
Model Name: RandomForestRegressor , Package: BetaML
Model Name: RandomForestRegressor , Package: DecisionTree
Model Name: RandomForestRegressor , Package: ScikitLearn
Model Name: RidgeCVRegressor , Package: ScikitLearn
Model Name: RidgeRegressor , Package: MLJLinearModels
Model Name: RidgeRegressor , Package: MultivariateStats
Model Name: RidgeRegressor , Package: ScikitLearn
Model Name: RobustRegressor , Package: MLJLinearModels
Model Name: SGDRegressor , Package: ScikitLearn
Model Name: SVMLinearRegressor , Package: ScikitLearn
Model Name: SVMNuRegressor , Package: ScikitLearn
Model Name: SVMRegressor , Package: ScikitLearn
Model Name: TheilSenRegressor , Package: ScikitLearn
Model Name: XGBoostRegressor , Package: XGBoost

DecisionTreeRegressor

We will first try out DecisionTreeRegressor:

DecisionTreeRegressor = @load DecisionTreeRegressor pkg=DecisionTree

dcrm = machine(DecisionTreeRegressor(), X, y)

fit!(dcrm, rows=train)
pred_dcrm = predict(dcrm, rows=test);
import MLJDecisionTreeInterface ✔

Now you can call a loss function to assess the performance on test set.

rms(pred_dcrm, y[test])
2.912630331542761

RandomForestRegressor

Now let's try out RandomForestRegressor:

RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
rfr = RandomForestRegressor()

rfr_m = machine(rfr, X, y);
import MLJDecisionTreeInterface ✔

train on the rows corresponding to train

fit!(rfr_m, rows=train);

predict values on the rows corresponding to test

pred_rfr = predict(rfr_m, rows=test);
rms(pred_rfr, y[test])
2.3791419542821077

Unsurprisingly, the RandomForestRegressor does a better job.

Can we do even better? Yeah, we can!! We can make use of Model Tuning.

Tuning

In case you are new to model tuning using MLJ, refer lab5 and model-tuning

Range of values for parameters should be specified to do hyperparameter tuning

r_maxD = range(rfr, :n_trees, lower=9, upper=15)
r_samF = range(rfr, :sampling_fraction, lower=0.6, upper=0.8)
r = [r_maxD, r_samF];

Now we specify how the tuning should be done. Let's just specify a coarse grid tuning with cross validation and instantiate a tuned model:

tuning = Grid(resolution=7)
resampling = CV(nfolds=6)

tm = TunedModel(model=rfr, tuning=tuning,
                resampling=resampling, ranges=r, measure=rms)

rfr_tm = machine(tm, X, y);

train on the rows corresponding to train

fit!(rfr_tm, rows=train);

predict values on the rows corresponding to test

pred_rfr_tm = predict(rfr_tm, rows=test);
rms(pred_rfr_tm, y[test])
2.2884649951457146

That was great! We have further improved the accuracy

Now to retrieve best model, You can use

fitted_params(rfr_tm).best_model
RandomForestRegressor(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = -1,
    n_trees = 13,
    sampling_fraction = 0.7666666666666667,
    pdf_smoothing = 0.0,
    rng = Random._GLOBAL_RNG())

Now we can investigate the tuning by using report. Let's plot a heatmap of the measurements:

r = report(rfr_tm)
res = r.plotting

md = res.parameter_values[:,1]
mcw = res.parameter_values[:,2]

figure(figsize=(8,6))
tricontourf(md, mcw, res.measurements)

xlabel("Number of trees", fontsize=14)
ylabel("Sampling fraction", fontsize=14)
xticks(9:1:15, fontsize=12)
yticks(fontsize=12)
Hyperparameter heatmap