Common MLJ Workflows
Data ingestion
using RDatasets
channing = dataset("boot", "channing")
first(channing, 4)
Sex | Entry | Exit | Time | Cens | |
---|---|---|---|---|---|
Categorical… | Int32 | Int32 | Int32 | Int32 | |
1 | Male | 782 | 909 | 127 | 1 |
2 | Male | 1020 | 1128 | 108 | 1 |
3 | Male | 856 | 969 | 113 | 1 |
4 | Male | 915 | 957 | 42 | 1 |
Inspecting metadata, including column scientific types:
schema(channing)
_.table =
┌─────────┬────────────────────────────────────────────┬───────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼────────────────────────────────────────────┼───────────────┤
│ Sex │ CategoricalArrays.CategoricalString{UInt8} │ Multiclass{2} │
│ Entry │ Int32 │ Count │
│ Exit │ Int32 │ Count │
│ Time │ Int32 │ Count │
│ Cens │ Int32 │ Count │
└─────────┴────────────────────────────────────────────┴───────────────┘
_.nrows = 462
Unpacking data and correcting for wrong scitypes:
y, X = unpack(channing,
==(:Exit), # y is the :Exit column
!=(:Time); # X is the rest, except :Time
:Exit=>Continuous,
:Entry=>Continuous,
:Cens=>Multiclass)
first(X, 4)
Sex | Entry | Cens | |
---|---|---|---|
Categorical… | Float64 | Categorical… | |
1 | Male | 782.0 | 1 |
2 | Male | 1020.0 | 1 |
3 | Male | 856.0 | 1 |
4 | Male | 915.0 | 1 |
Note: Before julia 1.2, replace !=(:Time)
with col -> col != :Time
.
y[1:4]
4-element Array{Float64,1}:
909.0
1128.0
969.0
957.0
Loading a built-in supervised dataset:
X, y = @load_iris;
selectrows(X, 1:4) # selectrows works for any Tables.jl table
(sepal_length = [5.1, 4.9, 4.7, 4.6],
sepal_width = [3.5, 3.0, 3.2, 3.1],
petal_length = [1.4, 1.4, 1.3, 1.5],
petal_width = [0.2, 0.2, 0.2, 0.2],)
y[1:4]
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"setosa"
"setosa"
"setosa"
"setosa"
Model search (experimental)
Reference: Model Search
Searching for a supervised model:
X, y = @load_boston
models(matching(X, y))
49-element Array{NamedTuple,1}:
(name = ARDRegressor, package_name = ScikitLearn, ... )
(name = AdaBoostRegressor, package_name = ScikitLearn, ... )
(name = BaggingRegressor, package_name = ScikitLearn, ... )
(name = BayesianRidgeRegressor, package_name = ScikitLearn, ... )
(name = ConstantRegressor, package_name = MLJModels, ... )
(name = DecisionTreeRegressor, package_name = DecisionTree, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = DummyRegressor, package_name = ScikitLearn, ... )
(name = ElasticNetCVRegressor, package_name = ScikitLearn, ... )
(name = ElasticNetRegressor, package_name = MLJLinearModels, ... )
⋮
(name = RidgeRegressor, package_name = MultivariateStats, ... )
(name = RidgeRegressor, package_name = ScikitLearn, ... )
(name = RobustRegressor, package_name = MLJLinearModels, ... )
(name = SGDRegressor, package_name = ScikitLearn, ... )
(name = SVMLRegressor, package_name = ScikitLearn, ... )
(name = SVMNuRegressor, package_name = ScikitLearn, ... )
(name = SVMRegressor, package_name = ScikitLearn, ... )
(name = TheilSenRegressor, package_name = ScikitLearn, ... )
(name = XGBoostRegressor, package_name = XGBoost, ... )
models(matching(X, y))[6]
CART decision tree regressor.
→ based on [DecisionTree](https://github.com/bensadeghi/DecisionTree.jl).
→ do `@load DecisionTreeRegressor pkg="DecisionTree"` to use the model.
→ do `?DecisionTreeRegressor` for documentation.
(name = "DecisionTreeRegressor",
package_name = "DecisionTree",
is_supervised = true,
docstring = "CART decision tree regressor.\n→ based on [DecisionTree](https://github.com/bensadeghi/DecisionTree.jl).\n→ do `@load DecisionTreeRegressor pkg=\"DecisionTree\"` to use the model.\n→ do `?DecisionTreeRegressor` for documentation.",
hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing),
hyperparameter_types = ("Int64", "Int64", "Int64", "Float64", "Int64", "Bool", "Float64"),
hyperparameters = (:max_depth, :min_samples_leaf, :min_samples_split, :min_purity_increase, :n_subfeatures, :post_prune, :merge_purity_threshold),
implemented_methods = Symbol[:fit, :predict, :fitted_params],
is_pure_julia = true,
is_wrapper = false,
load_path = "MLJModels.DecisionTree_.DecisionTreeRegressor",
package_license = "MIT",
package_url = "https://github.com/bensadeghi/DecisionTree.jl",
package_uuid = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb",
prediction_type = :deterministic,
supports_online = false,
supports_weights = false,
input_scitype = Table{_s13} where _s13<:Union{AbstractArray{_s12,1} where _s12<:Continuous, AbstractArray{_s12,1} where _s12<:Count, AbstractArray{_s12,1} where _s12<:OrderedFactor},
target_scitype = AbstractArray{Continuous,1},)
More refined searches:
models() do model
matching(model, X, y) &&
model.prediction_type == :deterministic &&
model.is_pure_julia
end
13-element Array{NamedTuple,1}:
(name = DecisionTreeRegressor, package_name = DecisionTree, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = ElasticNetRegressor, package_name = MLJLinearModels, ... )
(name = HuberRegressor, package_name = MLJLinearModels, ... )
(name = KNNRegressor, package_name = NearestNeighbors, ... )
(name = LADRegressor, package_name = MLJLinearModels, ... )
(name = LassoRegressor, package_name = MLJLinearModels, ... )
(name = LinearRegressor, package_name = MLJLinearModels, ... )
(name = QuantileRegressor, package_name = MLJLinearModels, ... )
(name = RandomForestRegressor, package_name = DecisionTree, ... )
(name = RidgeRegressor, package_name = MLJLinearModels, ... )
(name = RidgeRegressor, package_name = MultivariateStats, ... )
(name = RobustRegressor, package_name = MLJLinearModels, ... )
Searching for an unsupervised model:
models(matching(X))
11-element Array{NamedTuple,1}:
(name = FeatureSelector, package_name = MLJModels, ... )
(name = FillImputer, package_name = MLJModels, ... )
(name = ICA, package_name = MultivariateStats, ... )
(name = KMeans, package_name = Clustering, ... )
(name = KMedoids, package_name = Clustering, ... )
(name = KernelPCA, package_name = MultivariateStats, ... )
(name = OneClassSVM, package_name = LIBSVM, ... )
(name = OneHotEncoder, package_name = MLJModels, ... )
(name = PCA, package_name = MultivariateStats, ... )
(name = Standardizer, package_name = MLJModels, ... )
(name = StaticTransformer, package_name = MLJBase, ... )
Getting the metadata entry for a given model type:
info("PCA")
info("RidgeRegressor", pkg="MultivariateStats") # a model type in multiple packages
Ridge regressor with regularization parameter lambda. Learns a linear regression with a penalty on the l2 norm of the coefficients.
→ based on [MultivariateStats](https://github.com/JuliaStats/MultivariateStats.jl).
→ do `@load RidgeRegressor pkg="MultivariateStats"` to use the model.
→ do `?RidgeRegressor` for documentation.
(name = "RidgeRegressor",
package_name = "MultivariateStats",
is_supervised = true,
docstring = "Ridge regressor with regularization parameter lambda. Learns a linear regression with a penalty on the l2 norm of the coefficients.\n→ based on [MultivariateStats](https://github.com/JuliaStats/MultivariateStats.jl).\n→ do `@load RidgeRegressor pkg=\"MultivariateStats\"` to use the model.\n→ do `?RidgeRegressor` for documentation.",
hyperparameter_ranges = (nothing,),
hyperparameter_types = ("Real",),
hyperparameters = (:lambda,),
implemented_methods = Symbol[:fit, :predict, :fitted_params],
is_pure_julia = true,
is_wrapper = false,
load_path = "MLJModels.MultivariateStats_.RidgeRegressor",
package_license = "MIT",
package_url = "https://github.com/JuliaStats/MultivariateStats.jl",
package_uuid = "6f286f6a-111f-5878-ab1e-185364afe411",
prediction_type = :deterministic,
supports_online = false,
supports_weights = false,
input_scitype = Table{_s13} where _s13<:(AbstractArray{_s12,1} where _s12<:Continuous),
target_scitype = AbstractArray{Continuous,1},)
Instantiating a model
Reference: Getting Started
@load DecisionTreeClassifier
model = DecisionTreeClassifier(min_samples_split=5, max_depth=4)
DecisionTreeClassifier(
max_depth = 4,
min_samples_leaf = 1,
min_samples_split = 5,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
pdf_smoothing = 0.0,
display_depth = 5) @ 9…94
or
model = @load DecisionTreeClassifier
model.min_samples_split = 5
model.max_depth = 4
Evaluating a model
Reference: Evaluating Model Performance
X, y = @load_boston
model = @load KNNRegressor
evaluate(model, X, y, resampling=CV(nfolds=5), measure=[rms, mav])
┌───────────┬───────────────┬────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────────────┤
│ rms │ 8.82 │ [8.53, 8.52, 10.7, 9.39, 6.32] │
│ mav │ 6.07 │ [6.49, 5.43, 7.61, 6.03, 4.79] │
└───────────┴───────────────┴────────────────────────────────┘
_.per_observation = [missing, missing]
Basic fit/evaluate/predict by hand:
Reference: Getting Started, Machines, Evaluating Model Performance, Performance Measures
using RDatasets
vaso = dataset("robustbase", "vaso"); # a DataFrame
first(vaso, 3)
Volume | Rate | Y | |
---|---|---|---|
Float64 | Float64 | Int64 | |
1 | 3.7 | 0.825 | 1 |
2 | 3.5 | 1.09 | 1 |
3 | 1.25 | 2.5 | 1 |
y, X = unpack(vaso, ==(:Y), c -> true; :Y => Multiclass)
tree_model = @load DecisionTreeClassifier
┌ Info: A model type "DecisionTreeClassifier" is already loaded.
└ No new code loaded.
Bind the model and data together in a machine , which will additionally store the learned parameters (fitresults) when fit:
tree = machine(tree_model, X, y)
Machine{DecisionTreeClassifier} @ 1…78
Split row indices into training and evaluation rows:
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234); # 70:30 split
([27, 28, 30, 31, 32, 18, 21, 9, 26, 14 … 7, 39, 2, 37, 1, 8, 19, 25, 35, 34], [22, 13, 11, 4, 10, 16, 3, 20, 29, 23, 12, 24])
Fit on train and evaluate on test:
fit!(tree, rows=train)
yhat = predict(tree, rows=test);
mean(cross_entropy(yhat, y[test]))
6.5216583816514975
Predict on new data:
Xnew = (Volume=3*rand(3), Rate=3*rand(3))
predict(tree, Xnew) # a vector of distributions
3-element Array{UnivariateFinite{Int64,UInt32,Float64},1}:
UnivariateFinite(0=>0.0, 1=>1.0)
UnivariateFinite(0=>0.273, 1=>0.727)
UnivariateFinite(0=>0.273, 1=>0.727)
predict_mode(tree, Xnew) # a vector of point-predictions
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
1
1
More performance evaluation examples
import LossFunctions.ZeroOneLoss
Evaluating model + data directly:
evaluate(tree_model, X, y,
resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
measure=[cross_entropy, ZeroOneLoss()])
┌───────────────┬───────────────┬────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼────────────┤
│ cross_entropy │ 6.52 │ [6.52] │
│ ZeroOneLoss │ 0.417 │ [0.417] │
└───────────────┴───────────────┴────────────┘
_.per_observation = [[[0.105, 36.0, ..., 1.3]], [[0.0, 1.0, ..., 1.0]]]
If a machine is already defined, as above:
evaluate!(tree,
resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
measure=[cross_entropy, ZeroOneLoss()])
┌───────────────┬───────────────┬────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼────────────┤
│ cross_entropy │ 6.52 │ [6.52] │
│ ZeroOneLoss │ 0.417 │ [0.417] │
└───────────────┴───────────────┴────────────┘
_.per_observation = [[[0.105, 36.0, ..., 1.3]], [[0.0, 1.0, ..., 1.0]]]
Using cross-validation:
evaluate!(tree, resampling=CV(nfolds=5, shuffle=true, rng=1234),
measure=[cross_entropy, ZeroOneLoss()])
┌───────────────┬───────────────┬───────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼───────────────────────────────────┤
│ cross_entropy │ 3.91 │ [10.6, 0.676, 0.495, 0.717, 7.11] │
│ ZeroOneLoss │ 0.377 │ [0.571, 0.429, 0.0, 0.429, 0.455] │
└───────────────┴───────────────┴───────────────────────────────────┘
_.per_observation = [[[2.22e-16, 0.944, ..., 0.944], [1.23, 2.22e-16, ..., 0.345], [0.693, 0.693, ..., 0.693], [0.363, 1.19, ..., 1.19], [36.0, 0.0953, ..., 1.3]], [[0.0, 1.0, ..., 1.0], [1.0, 0.0, ..., 0.0], [0.0, 0.0, ..., 0.0], [0.0, 1.0, ..., 1.0], [1.0, 0.0, ..., 1.0]]]
With user-specified train/test pairs of row indices:
f1, f2, f3 = 1:13, 14:26, 27:36
pairs = [(f1, vcat(f2, f3)), (f2, vcat(f3, f1)), (f3, vcat(f1, f2))];
evaluate!(tree,
resampling=pairs,
measure=[cross_entropy, ZeroOneLoss()])
┌───────────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼───────────────────────┤
│ cross_entropy │ 5.88 │ [2.16, 11.0, 4.51] │
│ ZeroOneLoss │ 0.241 │ [0.304, 0.304, 0.115] │
└───────────────┴───────────────┴───────────────────────┘
_.per_observation = [[[0.154, 0.154, ..., 0.154], [2.22e-16, 36.0, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 0.693]], [[0.0, 0.0, ..., 0.0], [0.0, 1.0, ..., 0.0], [0.0, 0.0, ..., 0.0]]]
Changing a hyperparameter and re-evaluating:
tree_model.max_depth = 3
evaluate!(tree,
resampling=CV(nfolds=5, shuffle=true, rng=1234),
measure=[cross_entropy, ZeroOneLoss()])
┌───────────────┬───────────────┬─────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼─────────────────────────────────────┤
│ cross_entropy │ 5.47 │ [10.5, 0.532, 0.389, 5.63, 10.3] │
│ ZeroOneLoss │ 0.377 │ [0.429, 0.429, 0.143, 0.429, 0.455] │
└───────────────┴───────────────┴─────────────────────────────────────┘
_.per_observation = [[[2.22e-16, 1.32, ..., 2.22e-16], [0.887, 2.22e-16, ..., 2.22e-16], [0.405, 0.405, ..., 1.1], [2.22e-16, 0.827, ..., 0.827], [36.0, 0.288, ..., 2.2]], [[0.0, 1.0, ..., 0.0], [1.0, 0.0, ..., 0.0], [0.0, 0.0, ..., 1.0], [0.0, 1.0, ..., 1.0], [1.0, 0.0, ..., 1.0]]]
Inspecting training results
Fit a ordinary least square model to some synthetic data:
x1 = rand(100)
x2 = rand(100)
X = (x1=x1, x2=x2)
y = x1 - 2x2 + 0.1*rand(100);
ols_model = @load LinearRegressor pkg=GLM
ols = machine(ols_model, X, y)
fit!(ols)
Machine{LinearRegressor} @ 5…32
Get a named tuple representing the learned parameters, human-readable if appropriate:
fitted_params(ols)
(coef = [1.014856406802105, -2.0099582504333346],
intercept = 0.05167929582996633,)
Get other training-related information:
report(ols)
(deviance = 0.08029492971630983,
dof_residual = 97.0,
stderror = [0.009734201944682372, 0.009233448013839254, 0.007113896443524906],
vcov = [9.475468749985808e-5 -6.992301318089999e-6 -4.205649743475585e-5; -6.992301318089999e-6 8.525656222427209e-5 -4.167621342137693e-5; -4.205649743475585e-5 -4.167621342137693e-5 5.060752260919631e-5],)
Basic fit/transform for unsupervised models
Load data:
X, y = @load_iris
train, test = partition(eachindex(y), 0.97, shuffle=true, rng=123)
([125, 100, 130, 9, 70, 148, 39, 64, 6, 107 … 110, 59, 139, 21, 112, 144, 140, 72, 109, 41], [106, 147, 47, 5])
Instantiate and fit the model/machine:
@load PCA
pca_model = PCA(maxoutdim=2)
pca = machine(pca_model, X)
fit!(pca, rows=train)
Machine{PCA} @ 1…98
Transform selected data bound to the machine:
transform(pca, rows=test);
(x1 = [-3.3942826854483243, -1.5219827578765068, 2.538247455185219, 2.7299639893931373],
x2 = [0.5472450223745241, -0.36842368617126214, 0.5199299511335698, 0.3448466122232363],)
Transform new data:
Xnew = (sepal_length=rand(3), sepal_width=rand(3),
petal_length=rand(3), petal_width=rand(3));
transform(pca, Xnew)
(x1 = [5.020946674915796, 4.485504558406022, 4.465166636994135],
x2 = [-4.9586673813068005, -5.214435617710437, -4.645350477231649],)
Inverting learned transformations
y = rand(100);
stand_model = UnivariateStandardizer()
stand = machine(stand_model, y)
fit!(stand)
z = transform(stand, y);
@assert inverse_transform(stand, z) ≈ y # true
[ Info: Training Machine{UnivariateStandardizer} @ 6…38.
Nested hyperparameter tuning
Reference: Tuning Models
Define a model with nested hyperparameters:
tree_model = @load DecisionTreeClassifier
forest_model = EnsembleModel(atom=tree_model, n=300)
ProbabilisticEnsembleModel(
atom = DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
pdf_smoothing = 0.0,
display_depth = 5),
atomic_weights = Float64[],
bagging_fraction = 0.8,
rng = MersenneTwister(UInt32[0x47c03f76, 0x2c4136ea, 0x861b2316, 0x4ad3f251]) @ 103,
n = 300,
acceleration = ComputationalResources.CPU1{Nothing}(nothing),
out_of_bag_measure = Any[]) @ 8…40
Inspect all hyperparameters, even nested ones (returns nested named tuple):
params(forest_model)
(atom = (max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
pdf_smoothing = 0.0,
display_depth = 5,),
atomic_weights = Float64[],
bagging_fraction = 0.8,
rng = MersenneTwister(UInt32[0x47c03f76, 0x2c4136ea, 0x861b2316, 0x4ad3f251]) @ 103,
n = 300,
acceleration = ComputationalResources.CPU1{Nothing}(nothing),
out_of_bag_measure = Any[],)
Define ranges for hyperparameters to be tuned:
r1 = range(forest_model, :bagging_fraction, lower=0.5, upper=1.0, scale=:log10)
NumericRange(
field = :bagging_fraction,
lower = 0.5,
upper = 1.0,
origin = 0.75,
unit = 0.25,
scale = :log10) @ 1…04
r2 = range(forest_model, :(atom.n_subfeatures), lower=1, upper=4) # nested
NumericRange(
field = :(atom.n_subfeatures),
lower = 1,
upper = 4,
origin = 2.5,
unit = 1.5,
scale = :linear) @ 1…05
Wrap the model in a tuning strategy:
tuned_forest = TunedModel(model=forest_model,
tuning=Grid(resolution=12),
resampling=CV(nfolds=6),
ranges=[r1, r2],
measure=cross_entropy)
ProbabilisticTunedModel(
model = ProbabilisticEnsembleModel(
atom = DecisionTreeClassifier @ 1…63,
atomic_weights = Float64[],
bagging_fraction = 0.8,
rng = MersenneTwister(UInt32[0x47c03f76, 0x2c4136ea, 0x861b2316, 0x4ad3f251]) @ 103,
n = 300,
acceleration = ComputationalResources.CPU1{Nothing}(nothing),
out_of_bag_measure = Any[]),
tuning = Grid(
resolution = 12,
acceleration = ComputationalResources.CPU1{Nothing}(nothing)),
resampling = CV(
nfolds = 6,
shuffle = false,
rng = MersenneTwister(UInt32[0x47c03f76, 0x2c4136ea, 0x861b2316, 0x4ad3f251]) @ 103),
measure = CrossEntropy(
eps = 2.220446049250313e-16),
weights = nothing,
operation = StatsBase.predict,
ranges = NumericRange{MLJBase.Bounded,T,Symbol} where T[NumericRange @ 1…04, NumericRange @ 1…05],
full_report = true,
train_best = true,
repeats = 1) @ 9…34
Bound the wrapped model to data:
tuned = machine(tuned_forest, X, y)
Machine{ProbabilisticTunedModel} @ 9…74
Fitting the resultant machine optimizes the hyperaparameters specified in range
, using the specified tuning
and resampling
strategies and performance measure
(possibly a vector of measures), and retrains on all data bound to the machine:
fit!(tuned)
Machine{ProbabilisticTunedModel} @ 9…74
Inspecting the optimal model:
F = fitted_params(tuned)
(best_model = ProbabilisticEnsembleModel{DecisionTreeClassifier} @ 8…47,
best_fitted_params = (fitresult = WrappedEnsemble @ 1…22,),)
F.best_model
ProbabilisticEnsembleModel(
atom = DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 3,
post_prune = false,
merge_purity_threshold = 1.0,
pdf_smoothing = 0.0,
display_depth = 5),
atomic_weights = Float64[],
bagging_fraction = 0.6040447222022236,
rng = MersenneTwister(UInt32[0x47c03f76, 0x2c4136ea, 0x861b2316, 0x4ad3f251]) @ 747,
n = 300,
acceleration = ComputationalResources.CPU1{Nothing}(nothing),
out_of_bag_measure = Any[]) @ 8…47
Inspecting details of tuning procedure:
report(tuned)
(parameter_names = ["bagging_fraction" "atom.n_subfeatures"],
parameter_scales = Symbol[:log10 :linear],
best_measurement = 0.15482036371032085,
best_report = (measures = Any[],
oob_measurements = missing,),
parameter_values = Any[0.5 1; 0.5325205447199813 1; … ; 0.9389309106617063 4; 1.0 4],
measurements = [0.21534391030896025, 0.21819242425407892, 0.21083086862471032, 0.21062331572020118, 0.20143827595698505, 0.21325662234887452, 0.1959983990972185, 0.2088293410624006, 0.22164521608178123, 0.19486003193571125 … 0.1673363008872999, 0.1712625485095091, 0.17545372860277597, 0.40470613366972313, 0.20284315870791336, 0.633737683291989, 0.43764175364915986, 0.6443458093091506, 0.877868832382005, 2.425319130101292],)
Visualizing these results:
using Plots
plot(tuned)
Predicting on new data using the optimized model:
predict(tuned, Xnew)
3-element Array{UnivariateFinite{String,UInt32,Float64},1}:
UnivariateFinite(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
UnivariateFinite(setosa=>0.5, versicolor=>0.46, virginica=>0.04)
UnivariateFinite(setosa=>0.903, versicolor=>0.08, virginica=>0.0167)
Constructing a linear pipeline
Reference: Composing Models
Constructing a linear (unbranching) pipeline with a learned target transformation/inverse transformation:
X, y = @load_reduced_ames
@load KNNRegressor
pipe = @pipeline MyPipe(X -> coerce(X, :age=>Continuous),
hot = OneHotEncoder(),
knn = KNNRegressor(K=3),
target = UnivariateStandardizer())
MyPipe(
hot = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true),
knn = KNNRegressor(
K = 3,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = :uniform),
target = UnivariateStandardizer()) @ 1…51
Evaluating the pipeline (just as you would any other model):
pipe.knn.K = 2
pipe.hot.drop_last = true
evaluate(pipe, X, y, resampling=Holdout(), measure=rms, verbosity=2)
┌───────────┬───────────────┬────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────┤
│ rms │ 53100.0 │ [53100.0] │
└───────────┴───────────────┴────────────┘
_.per_observation = [missing]
Constructing a linear (unbranching) pipeline with a static (unlearned) target transformation/inverse transformation:
@load DecisionTreeRegressor
pipe2 = @pipeline MyPipe2(X -> coerce(X, :age=>Continuous),
hot = OneHotEncoder(),
tree = DecisionTreeRegressor(max_depth=4),
target = y -> log.(y),
inverse = z -> exp.(z))
MyPipe2(
hot = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true),
tree = DecisionTreeRegressor(
max_depth = 4,
min_samples_leaf = 5,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0),
target = StaticTransformer(
f = getfield(Main.ex-workflows, Symbol("##24#25"))()),
inverse = StaticTransformer(
f = getfield(Main.ex-workflows, Symbol("##26#27"))())) @ 3…08
Creating a homogeneous ensemble of models
Reference: Homogeneous Ensembles
X, y = @load_iris
tree_model = @load DecisionTreeClassifier
forest_model = EnsembleModel(atom=tree_model, bagging_fraction=0.8, n=300)
forest = machine(forest_model, X, y)
evaluate!(forest, measure=cross_entropy)
┌───────────────┬───────────────┬──────────────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────┼───────────────┼──────────────────────────────────────────────┤
│ cross_entropy │ 0.629 │ [3.66e-15, 3.66e-15, 0.3, 1.61, 1.56, 0.301] │
└───────────────┴───────────────┴──────────────────────────────────────────────┘
_.per_observation = [[[3.66e-15, 3.66e-15, ..., 3.66e-15], [3.66e-15, 3.66e-15, ..., 3.66e-15], [0.0305, 0.00334, ..., 3.66e-15], [3.66e-15, 0.135, ..., 3.66e-15], [3.66e-15, 0.0339, ..., 3.66e-15], [0.0339, 0.483, ..., 0.0583]]]
Performance curves
Generate a plot of performance, as a function of some hyperparameter (building on the preceding example):
r = range(forest_model, :n, lower=1, upper=1000, scale=:log10)
curve = MLJ.learning_curve!(forest,
range=r,
resampling=Holdout(),
measure=cross_entropy,
n=4,
verbosity=0)
(parameter_name = "n",
parameter_scale = :log10,
parameter_values = [1, 2, 3, 4, 5, 7, 9, 11, 14, 17 … 117, 149, 189, 240, 304, 386, 489, 621, 788, 1000],
measurements = [13.616491280333147 12.014551129705719 9.611640903764574 8.009700753137146; 8.256153084002902 4.1588830833596715 9.611640903764574 5.683806880591551; … ; 1.247976052355591 0.5782817230195656 1.2574856741696 1.2316361904504538; 1.2569091425094638 0.5924629399391862 1.2631028701131062 1.2318603821954464],)
using Plots
plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)