Common MLJ Workflows
This demo assumes you have certain packages in your active package environment. To activate a new environment, "MyNewEnv", with just these packages, do this in a new REPL session:
using Pkg
Pkg.activate("MyNewEnv")
Pkg.add(["MLJ", "RDatasets", "DataFrames", "MLJDecisionTreeInterface",
"MLJMultivariateStatsInterface", "NearestNeighborModels", "MLJGLMInterface",
"Plots"])
The following starts MLJ and shows the current version of MLJ (you can also use Pkg.status()
):
using MLJ
MLJ_VERSION
v"0.20.7"
Data ingestion
import RDatasets
channing = RDatasets.dataset("boot", "channing")
first(channing, 4) |> pretty
┌──────────────────────────────────┬───────┬───────┬───────┬───────┐
│ Sex │ Entry │ Exit │ Time │ Cens │
│ CategoricalValue{String, UInt32} │ Int32 │ Int32 │ Int32 │ Int32 │
│ Multiclass{2} │ Count │ Count │ Count │ Count │
├──────────────────────────────────┼───────┼───────┼───────┼───────┤
│ Male │ 782 │ 909 │ 127 │ 1 │
│ Male │ 1020 │ 1128 │ 108 │ 1 │
│ Male │ 856 │ 969 │ 113 │ 1 │
│ Male │ 915 │ 957 │ 42 │ 1 │
└──────────────────────────────────┴───────┴───────┴───────┴───────┘
Inspecting metadata, including column scientific types:
schema(channing)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ Sex │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ Entry │ Count │ Int32 │
│ Exit │ Count │ Int32 │
│ Time │ Count │ Int32 │
│ Cens │ Count │ Int32 │
└───────┴───────────────┴──────────────────────────────────┘
Horizontally splitting data and shuffling rows.
Here y
is the :Exit
column and X
a table with everything else:
y, X = unpack(channing, ==(:Exit), rng=123)
Here y
is the :Exit
column and X
everything else except :Time
:
y, X = unpack(channing,
==(:Exit),
!=(:Time);
rng=123);
scitype(y)
AbstractVector{Count} (alias for AbstractArray{Count, 1})
schema(X)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ Sex │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ Entry │ Count │ Int32 │
│ Cens │ Count │ Int32 │
└───────┴───────────────┴──────────────────────────────────┘
Fixing wrong scientific types in X
:
X = coerce(X, :Exit=>Continuous, :Entry=>Continuous, :Cens=>Multiclass);
schema(X)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ Sex │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ Entry │ Continuous │ Float64 │
│ Cens │ Multiclass{2} │ CategoricalValue{Int32, UInt32} │
└───────┴───────────────┴──────────────────────────────────┘
Loading a built-in supervised dataset:
table = load_iris();
schema(table)
┌──────────────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├──────────────┼───────────────┼──────────────────────────────────┤
│ sepal_length │ Continuous │ Float64 │
│ sepal_width │ Continuous │ Float64 │
│ petal_length │ Continuous │ Float64 │
│ petal_width │ Continuous │ Float64 │
│ target │ Multiclass{3} │ CategoricalValue{String, UInt32} │
└──────────────┴───────────────┴──────────────────────────────────┘
Loading a built-in data set already split into X
and y
:
X, y = @load_iris;
selectrows(X, 1:4) # selectrows works whenever `Tables.istable(X)==true`.
(sepal_length = [5.1, 4.9, 4.7, 4.6],
sepal_width = [3.5, 3.0, 3.2, 3.1],
petal_length = [1.4, 1.4, 1.3, 1.5],
petal_width = [0.2, 0.2, 0.2, 0.2],)
y[1:4]
4-element CategoricalArray{String,1,UInt32}:
"setosa"
"setosa"
"setosa"
"setosa"
Splitting data vertically after row shuffling:
channing_train, channing_test = partition(channing, 0.6, rng=123);
Or, if already horizontally split:
(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.6, multi=true, rng=123)
(((sepal_length = [6.5, 5.1, 6.3, 5.4, 5.5, 5.0, 5.0, 4.9, 6.5, 5.7 … 6.9, 6.3, 6.0, 7.6, 5.7, 5.5, 6.8, 6.1, 4.8, 5.4], sepal_width = [2.8, 3.5, 3.3, 3.9, 4.2, 2.0, 3.5, 3.1, 3.0, 2.8 … 3.1, 2.5, 2.2, 3.0, 2.8, 2.3, 3.2, 2.6, 3.1, 3.4], petal_length = [4.6, 1.4, 4.7, 1.7, 1.4, 3.5, 1.3, 1.5, 5.2, 4.5 … 5.1, 5.0, 4.0, 6.6, 4.1, 4.0, 5.9, 5.6, 1.6, 1.7], petal_width = [1.5, 0.2, 1.6, 0.4, 0.2, 1.0, 0.3, 0.1, 2.0, 1.3 … 2.3, 1.9, 1.0, 2.1, 1.3, 1.3, 2.3, 1.4, 0.2, 0.2]), (sepal_length = [5.4, 6.9, 4.4, 7.4, 5.7, 5.8, 5.9, 7.2, 4.9, 4.3 … 6.3, 6.4, 5.8, 4.4, 5.0, 6.4, 4.7, 6.2, 6.4, 6.5], sepal_width = [3.9, 3.2, 3.0, 2.8, 3.0, 2.7, 3.2, 3.0, 2.5, 3.0 … 2.5, 3.2, 2.7, 3.2, 3.3, 2.7, 3.2, 2.8, 2.9, 3.2], petal_length = [1.3, 5.7, 1.3, 6.1, 4.2, 5.1, 4.8, 5.8, 4.5, 1.1 … 4.9, 5.3, 4.1, 1.3, 1.4, 5.3, 1.6, 4.8, 4.3, 5.1], petal_width = [0.4, 2.3, 0.2, 1.9, 1.2, 1.9, 1.8, 1.6, 1.7, 0.1 … 1.5, 2.3, 1.0, 0.2, 0.2, 1.9, 0.2, 1.8, 1.3, 2.0])), (CategoricalValue{String, UInt32}["versicolor", "setosa", "versicolor", "setosa", "setosa", "versicolor", "setosa", "setosa", "virginica", "versicolor" … "virginica", "virginica", "versicolor", "virginica", "versicolor", "versicolor", "virginica", "virginica", "setosa", "setosa"], CategoricalValue{String, UInt32}["setosa", "virginica", "setosa", "virginica", "versicolor", "virginica", "versicolor", "virginica", "virginica", "setosa" … "versicolor", "virginica", "versicolor", "setosa", "setosa", "virginica", "setosa", "virginica", "versicolor", "virginica"]))
Model Search
Reference: Model Search
Searching for a supervised model:
X, y = @load_boston
ms = models(matching(X, y))
70-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = ARDRegressor, package_name = MLJScikitLearnInterface, ... )
(name = AdaBoostRegressor, package_name = MLJScikitLearnInterface, ... )
(name = BaggingRegressor, package_name = MLJScikitLearnInterface, ... )
(name = BayesianRidgeRegressor, package_name = MLJScikitLearnInterface, ... )
(name = CatBoostRegressor, package_name = CatBoost, ... )
(name = ConstantRegressor, package_name = MLJModels, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = DecisionTree, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = DummyRegressor, package_name = MLJScikitLearnInterface, ... )
⋮
(name = SGDRegressor, package_name = MLJScikitLearnInterface, ... )
(name = SRRegressor, package_name = SymbolicRegression, ... )
(name = SVMLinearRegressor, package_name = MLJScikitLearnInterface, ... )
(name = SVMNuRegressor, package_name = MLJScikitLearnInterface, ... )
(name = SVMRegressor, package_name = MLJScikitLearnInterface, ... )
(name = StableForestRegressor, package_name = SIRUS, ... )
(name = StableRulesRegressor, package_name = SIRUS, ... )
(name = TheilSenRegressor, package_name = MLJScikitLearnInterface, ... )
(name = XGBoostRegressor, package_name = XGBoost, ... )
ms[6]
(name = "ConstantRegressor",
package_name = "MLJModels",
is_supervised = true,
abstract_type = Probabilistic,
constructor = nothing,
deep_properties = (),
docstring = "```\nConstantRegressor\n```\n\nThis \"dummy\" probabilis...",
fit_data_scitype = Tuple{Table, AbstractVector{Continuous}},
human_name = "constant regressor",
hyperparameter_ranges = (nothing,),
hyperparameter_types = ("Type{D} where D<:Distributions.Sampleable",),
hyperparameters = (:distribution_type,),
implemented_methods = [:fitted_params, :predict],
inverse_transform_scitype = Unknown,
is_pure_julia = true,
is_wrapper = false,
iteration_parameter = nothing,
load_path = "MLJModels.ConstantRegressor",
package_license = "MIT",
package_url = "https://github.com/JuliaAI/MLJModels.jl",
package_uuid = "d491faf4-2d78-11e9-2867-c94bc002c0b7",
predict_scitype = AbstractVector{ScientificTypesBase.Density{Continuous}},
prediction_type = :probabilistic,
reporting_operations = (),
reports_feature_importances = false,
supports_class_weights = false,
supports_online = false,
supports_training_losses = false,
supports_weights = false,
target_in_fit = true,
transform_scitype = Unknown,
input_scitype = Table,
target_scitype = AbstractVector{Continuous},
output_scitype = Unknown)
models("Tree")
28-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
(name = COFDetector, package_name = OutlierDetectionNeighbors, ... )
(name = DNNDetector, package_name = OutlierDetectionNeighbors, ... )
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeClassifier, package_name = DecisionTree, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = DecisionTree, ... )
(name = EvoTreeClassifier, package_name = EvoTrees, ... )
(name = EvoTreeCount, package_name = EvoTrees, ... )
⋮
(name = LOFDetector, package_name = OutlierDetectionNeighbors, ... )
(name = MultitargetKNNClassifier, package_name = NearestNeighborModels, ... )
(name = MultitargetKNNRegressor, package_name = NearestNeighborModels, ... )
(name = OneRuleClassifier, package_name = OneRule, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestClassifier, package_name = DecisionTree, ... )
(name = RandomForestRegressor, package_name = BetaML, ... )
(name = RandomForestRegressor, package_name = DecisionTree, ... )
(name = SMOTENC, package_name = Imbalance, ... )
A more refined search:
models() do model
matching(model, X, y) &&
model.prediction_type == :deterministic &&
model.is_pure_julia
end;
Searching for an unsupervised model:
models(matching(X))
63-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :target_in_fit, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
(name = ABODDetector, package_name = OutlierDetectionPython, ... )
(name = AffinityPropagation, package_name = MLJScikitLearnInterface, ... )
(name = AgglomerativeClustering, package_name = MLJScikitLearnInterface, ... )
(name = AutoEncoder, package_name = BetaML, ... )
(name = Birch, package_name = MLJScikitLearnInterface, ... )
(name = BisectingKMeans, package_name = MLJScikitLearnInterface, ... )
(name = CBLOFDetector, package_name = OutlierDetectionPython, ... )
(name = CDDetector, package_name = OutlierDetectionPython, ... )
(name = COFDetector, package_name = OutlierDetectionNeighbors, ... )
⋮
(name = RODDetector, package_name = OutlierDetectionPython, ... )
(name = RandomForestImputer, package_name = BetaML, ... )
(name = SODDetector, package_name = OutlierDetectionPython, ... )
(name = SOSDetector, package_name = OutlierDetectionPython, ... )
(name = SelfOrganizingMap, package_name = SelfOrganizingMaps, ... )
(name = SimpleImputer, package_name = BetaML, ... )
(name = SpectralClustering, package_name = MLJScikitLearnInterface, ... )
(name = Standardizer, package_name = MLJModels, ... )
(name = TSVDTransformer, package_name = TSVD, ... )
Getting the metadata entry for a given model type:
info("PCA")
info("RidgeRegressor", pkg="MultivariateStats") # a model type in multiple packages
(name = "RidgeRegressor",
package_name = "MultivariateStats",
is_supervised = true,
abstract_type = Deterministic,
constructor = nothing,
deep_properties = (),
docstring = "```\nRidgeRegressor\n```\n\nA model type for construct...",
fit_data_scitype =
Tuple{Table{<:AbstractVector{<:Continuous}}, AbstractVector{Continuous}},
human_name = "ridge regressor",
hyperparameter_ranges = (nothing, nothing),
hyperparameter_types = ("Union{Real, AbstractVecOrMat}", "Bool"),
hyperparameters = (:lambda, :bias),
implemented_methods = [:clean!, :fit, :fitted_params, :predict],
inverse_transform_scitype = Unknown,
is_pure_julia = true,
is_wrapper = false,
iteration_parameter = nothing,
load_path = "MLJMultivariateStatsInterface.RidgeRegressor",
package_license = "MIT",
package_url = "https://github.com/JuliaStats/MultivariateStats.jl",
package_uuid = "6f286f6a-111f-5878-ab1e-185364afe411",
predict_scitype = AbstractVector{Continuous},
prediction_type = :deterministic,
reporting_operations = (),
reports_feature_importances = false,
supports_class_weights = false,
supports_online = false,
supports_training_losses = false,
supports_weights = false,
target_in_fit = true,
transform_scitype = Unknown,
input_scitype = Table{<:AbstractVector{<:Continuous}},
target_scitype = AbstractVector{Continuous},
output_scitype = Unknown)
Extracting the model document string (output omitted):
doc("DecisionTreeClassifier", pkg="DecisionTree")
Instantiating a model
Reference: Getting Started, Loading Model Code
Assumes MLJDecisionTreeClassifier
is in your environment. Otherwise, try interactive loading with @iload
:
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree(min_samples_split=5, max_depth=4)
DecisionTreeClassifier(
max_depth = 4,
min_samples_leaf = 1,
min_samples_split = 5,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
display_depth = 5,
feature_importance = :impurity,
rng = Random.TaskLocalRNG())
or
tree = (@load DecisionTreeClassifier)()
tree.min_samples_split = 5
tree.max_depth = 4
Evaluating a model
Reference: Evaluating Model Performance
X, y = @load_boston # a table and a vector
KNN = @load KNNRegressor
knn = KNN()
evaluate(knn, X, y,
resampling=CV(nfolds=5),
measure=[RootMeanSquaredError(), LPLoss(1)])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬────────────────────────┬───────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼────────────────────────┼───────────┼─────────────┤
│ A │ RootMeanSquaredError() │ predict │ 8.77 │
│ B │ LPLoss( │ predict │ 6.02 │
│ │ p = 1) │ │ │
└───┴────────────────────────┴───────────┴─────────────┘
┌───┬───────────────────────────────┬─────────┐
│ │ per_fold │ 1.96*SE │
├───┼───────────────────────────────┼─────────┤
│ A │ [8.53, 8.8, 10.7, 9.43, 5.59] │ 1.84 │
│ B │ [6.52, 5.7, 7.65, 6.09, 4.11] │ 1.26 │
└───┴───────────────────────────────┴─────────┘
Note RootMeanSquaredError()
has alias rms
and LPLoss(1)
has aliases l1
, mae
.
Do measures()
to list all losses and scores and their aliases, or refer to the StatisticalMeasures.jl docs.
Basic fit/evaluate/predict by hand
Reference: Getting Started, Machines, Evaluating Model Performance, Performance Measures
crabs = load_crabs() |> DataFrames.DataFrame
schema(crabs)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ sp │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ sex │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ index │ Count │ Int64 │
│ FL │ Continuous │ Float64 │
│ RW │ Continuous │ Float64 │
│ CL │ Continuous │ Float64 │
│ CW │ Continuous │ Float64 │
│ BD │ Continuous │ Float64 │
└───────┴───────────────┴──────────────────────────────────┘
y, X = unpack(crabs, ==(:sp), !in([:index, :sex]); rng=123)
Tree = @load DecisionTreeClassifier pkg=DecisionTree
DecisionTreeClassifier(
max_depth = 2,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
display_depth = 5,
feature_importance = :impurity,
rng = Random.TaskLocalRNG())
Bind the model and data together in a machine, which will additionally, store the learned parameters (fitresults) when fit:
mach = machine(tree, X, y)
untrained Machine; caches model-specific representations of data
model: DecisionTreeClassifier(max_depth = 2, …)
args:
1: Source @490 ⏎ Table{AbstractVector{Continuous}}
2: Source @167 ⏎ AbstractVector{Multiclass{2}}
Split row indices into training and evaluation rows:
train, test = partition(eachindex(y), 0.7); # 70:30 split
([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 131, 132, 133, 134, 135, 136, 137, 138, 139, 140], [141, 142, 143, 144, 145, 146, 147, 148, 149, 150 … 191, 192, 193, 194, 195, 196, 197, 198, 199, 200])
Fit on the train data set and evaluate on the test data set:
fit!(mach, rows=train)
yhat = predict(mach, X[test,:])
LogLoss(tol=1e-4)(yhat, y[test])
0.5902424966321888
Note LogLoss()
has aliases log_loss
and cross_entropy
.
Predict on the new data set:
Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD = rand(3))
predict(mach, Xnew) # a vector of distributions
3-element UnivariateFiniteVector{Multiclass{2}, String, UInt32, Float64}:
UnivariateFinite{Multiclass{2}}(B=>0.523, O=>0.477)
UnivariateFinite{Multiclass{2}}(B=>0.523, O=>0.477)
UnivariateFinite{Multiclass{2}}(B=>0.523, O=>0.477)
predict_mode(mach, Xnew) # a vector of point-predictions
3-element CategoricalArray{String,1,UInt32}:
"B"
"B"
"B"
More performance evaluation examples
Evaluating model + data directly:
evaluate(tree, X, y,
resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 0.563 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 0.567 │
└───┴──────────────────────┴──────────────┴─────────────┘
If a machine is already defined, as above:
evaluate!(mach,
resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 0.563 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 0.567 │
└───┴──────────────────────┴──────────────┴─────────────┘
Using cross-validation:
evaluate!(mach, resampling=CV(nfolds=5, shuffle=true, rng=1234),
measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 0.927 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 0.665 │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬──────────────────────────────────┬─────────┐
│ │ per_fold │ 1.96*SE │
├───┼──────────────────────────────────┼─────────┤
│ A │ [1.54, 1.4, 0.566, 0.576, 0.551] │ 0.489 │
│ B │ [0.6, 0.7, 0.65, 0.675, 0.7] │ 0.041 │
└───┴──────────────────────────────────┴─────────┘
With user-specified train/test pairs of row indices:
f1, f2, f3 = 1:13, 14:26, 27:36
pairs = [(f1, vcat(f2, f3)), (f2, vcat(f3, f1)), (f3, vcat(f1, f2))];
evaluate!(mach,
resampling=pairs,
measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 10.7 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 0.583 │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬───────────────────────┬─────────┐
│ │ per_fold │ 1.96*SE │
├───┼───────────────────────┼─────────┤
│ A │ [9.42, 13.1, 9.7] │ 2.82 │
│ B │ [0.739, 0.261, 0.731] │ 0.379 │
└───┴───────────────────────┴─────────┘
Changing a hyperparameter and re-evaluating:
tree.max_depth = 3
evaluate!(mach,
resampling=CV(nfolds=5, shuffle=true, rng=1234),
measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│ │ measure │ operation │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ LogLoss( │ predict │ 1.76 │
│ │ tol = 2.22045e-16) │ │ │
│ B │ Accuracy() │ predict_mode │ 0.825 │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬─────────────────────────────────────┬─────────┐
│ │ per_fold │ 1.96*SE │
├───┼─────────────────────────────────────┼─────────┤
│ A │ [2.16, 1.23, 3.72, 1.33, 0.353] │ 1.24 │
│ B │ [0.825, 0.825, 0.875, 0.725, 0.875] │ 0.06 │
└───┴─────────────────────────────────────┴─────────┘
Inspecting training results
Fit an ordinary least square model to some synthetic data:
x1 = rand(100)
x2 = rand(100)
X = (x1=x1, x2=x2)
y = x1 - 2x2 + 0.1*rand(100);
OLS = @load LinearRegressor pkg=GLM
ols = OLS()
mach = machine(ols, X, y) |> fit!
trained Machine; caches model-specific representations of data
model: LinearRegressor(fit_intercept = true, …)
args:
1: Source @049 ⏎ Table{AbstractVector{Continuous}}
2: Source @786 ⏎ AbstractVector{Continuous}
Get a named tuple representing the learned parameters, human-readable if appropriate:
fitted_params(mach)
(features = [:x1, :x2],
coef = [0.986756058739098, -2.0083723474502184],
intercept = 0.06390983129173478,)
Get other training-related information:
report(mach)
(stderror = [0.007751036889875875, 0.010131170265078242, 0.010268842128706981],
dof_residual = 97.0,
vcov = [6.007857286821667e-5 -5.523326824572504e-5 -5.350107765914804e-5; -5.523326824572504e-5 0.00010264061094000556 1.1650125441746104e-5; -5.350107765914804e-5 1.1650125441746104e-5 0.00010544911866430733],
deviance = 0.08663960255520886,
coef_table = ──────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0639098 0.00775104 8.25 <1e-12 0.0485262 0.0792935
x1 0.986756 0.0101312 97.40 <1e-97 0.966648 1.00686
x2 -2.00837 0.0102688 -195.58 <1e-99 -2.02875 -1.98799
──────────────────────────────────────────────────────────────────────────────,)
Basic fit/transform for unsupervised models
Load data:
X, y = @load_iris # a table and a vector
train, test = partition(eachindex(y), 0.97, shuffle=true, rng=123)
([55, 1, 57, 6, 34, 61, 41, 35, 148, 56 … 9, 95, 93, 137, 73, 116, 68, 43, 50, 112], [30, 127, 75, 111])
Instantiate and fit the model/machine:
PCA = @load PCA
pca = PCA(maxoutdim=2)
mach = machine(pca, X)
fit!(mach, rows=train)
trained Machine; caches model-specific representations of data
model: PCA(maxoutdim = 2, …)
args:
1: Source @323 ⏎ Table{AbstractVector{Continuous}}
Transform selected data bound to the machine:
transform(mach, rows=test);
(x1 = [2.6255922258705566, -1.2645297850885582, -0.7206635869470076, -1.668268683584315],
x2 = [-0.19331607089765088, -0.17741504590051646, 0.15137821608529245, 0.24438619364248584],)
Transform new data:
Xnew = (sepal_length=rand(3), sepal_width=rand(3),
petal_length=rand(3), petal_width=rand(3));
transform(mach, Xnew)
(x1 = [5.052862790612286, 4.353901783357102, 5.015333655480174],
x2 = [-4.662175856474163, -4.752533493164413, -5.043859925190777],)
Inverting learned transformations
y = rand(100);
stand = Standardizer()
mach = machine(stand, y)
fit!(mach)
z = transform(mach, y);
@assert inverse_transform(mach, z) ≈ y # true
[ Info: Training machine(Standardizer(features = Symbol[], …), …).
Nested hyperparameter tuning
Reference: Tuning Models
Define a model with nested hyperparameters:
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree()
forest = EnsembleModel(model=tree, n=300)
ProbabilisticEnsembleModel(
model = DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
display_depth = 5,
feature_importance = :impurity,
rng = Random.TaskLocalRNG()),
atomic_weights = Float64[],
bagging_fraction = 0.8,
rng = Random.TaskLocalRNG(),
n = 300,
acceleration = CPU1{Nothing}(nothing),
out_of_bag_measure = Any[])
Define ranges for hyperparameters to be tuned:
r1 = range(forest, :bagging_fraction, lower=0.5, upper=1.0, scale=:log10)
NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25; on log10 scale)
r2 = range(forest, :(model.n_subfeatures), lower=1, upper=4) # nested
NumericRange(1 ≤ model.n_subfeatures ≤ 4; origin=2.5, unit=1.5)
Wrap the model in a tuning strategy:
tuned_forest = TunedModel(model=forest,
tuning=Grid(resolution=12),
resampling=CV(nfolds=6),
ranges=[r1, r2],
measure=BrierLoss())
ProbabilisticTunedModel(
model = ProbabilisticEnsembleModel(
model = DecisionTreeClassifier(max_depth = -1, …),
atomic_weights = Float64[],
bagging_fraction = 0.8,
rng = Random.TaskLocalRNG(),
n = 300,
acceleration = CPU1{Nothing}(nothing),
out_of_bag_measure = Any[]),
tuning = Grid(
goal = nothing,
resolution = 12,
shuffle = true,
rng = Random.TaskLocalRNG()),
resampling = CV(
nfolds = 6,
shuffle = false,
rng = Random.TaskLocalRNG()),
measure = BrierLoss(),
weights = nothing,
class_weights = nothing,
operation = nothing,
range = NumericRange{T, MLJBase.Bounded, Symbol} where T[NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25; on log10 scale), NumericRange(1 ≤ model.n_subfeatures ≤ 4; origin=2.5, unit=1.5)],
selection_heuristic = MLJTuning.NaiveSelection(nothing),
train_best = true,
repeats = 1,
n = nothing,
acceleration = CPU1{Nothing}(nothing),
acceleration_resampling = CPU1{Nothing}(nothing),
check_measure = true,
cache = true,
compact_history = true,
logger = nothing)
Bound the wrapped model to data:
mach = machine(tuned_forest, X, y)
untrained Machine; does not cache data
model: ProbabilisticTunedModel(model = ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …), …)
args:
1: Source @443 ⏎ Table{AbstractVector{Continuous}}
2: Source @566 ⏎ AbstractVector{Multiclass{3}}
Fitting the resultant machine optimizes the hyperparameters specified in range
, using the specified tuning
and resampling
strategies and performance measure
(possibly a vector of measures), and retrains on all data bound to the machine:
fit!(mach)
trained Machine; does not cache data
model: ProbabilisticTunedModel(model = ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …), …)
args:
1: Source @443 ⏎ Table{AbstractVector{Continuous}}
2: Source @566 ⏎ AbstractVector{Multiclass{3}}
Inspecting the optimal model:
F = fitted_params(mach)
(best_model = ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …),
best_fitted_params = (fitresult = WrappedEnsemble(atom = DecisionTreeClassifier(max_depth = -1, …), …),),)
F.best_model
ProbabilisticEnsembleModel(
model = DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 4,
post_prune = false,
merge_purity_threshold = 1.0,
display_depth = 5,
feature_importance = :impurity,
rng = Random.TaskLocalRNG()),
atomic_weights = Float64[],
bagging_fraction = 0.5,
rng = Random.TaskLocalRNG(),
n = 300,
acceleration = CPU1{Nothing}(nothing),
out_of_bag_measure = Any[])
Inspecting details of tuning procedure:
r = report(mach);
keys(r)
(:best_model, :best_history_entry, :history, :best_report, :plotting)
r.history[[1,end]]
2-element Vector{@NamedTuple{model::MLJEnsembles.ProbabilisticEnsembleModel{MLJDecisionTreeInterface.DecisionTreeClassifier}, measure::Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasures._BrierLossType}, typeof(StatisticalMeasures.l2_check)}}}, measurement::Vector{Float64}, per_fold::Vector{Vector{Float64}}, evaluation::CompactPerformanceEvaluation{MLJEnsembles.ProbabilisticEnsembleModel{MLJDecisionTreeInterface.DecisionTreeClassifier}, Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasures._BrierLossType}, typeof(StatisticalMeasures.l2_check)}}}, Vector{Float64}, Vector{typeof(predict)}, Vector{Vector{Float64}}, Vector{Vector{Vector{Float64}}}, CV}}}:
(model = ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …), measure = [BrierLoss()], measurement = [0.11625985185185157], per_fold = [[-0.0, -0.0, 0.16137333333333304, 0.15899199999999944, 0.1493466666666662, 0.22784711111111078]], evaluation = CompactPerformanceEvaluation(0.116,))
(model = ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …), measure = [BrierLoss()], measurement = [0.10091674074074057], per_fold = [[-0.0, -0.0, 0.1321244444444443, 0.15357955555555505, 0.13605155555555531, 0.1837448888888888]], evaluation = CompactPerformanceEvaluation(0.101,))
Visualizing these results:
using Plots
plot(mach)
Predicting on new data using the optimized model trained on all data:
predict(mach, Xnew)
3-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
UnivariateFinite{Multiclass{3}}(setosa=>0.723, versicolor=>0.257, virginica=>0.02)
UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
Constructing linear pipelines
Reference: Linear Pipelines
Constructing a linear (unbranching) pipeline with a learned target transformation/inverse transformation:
X, y = @load_reduced_ames
KNN = @load KNNRegressor
knn_with_target = TransformedTargetModel(model=KNN(K=3), transformer=Standardizer())
TransformedTargetModelDeterministic(
model = KNNRegressor(
K = 3,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()),
transformer = Standardizer(
features = Symbol[],
ignore = false,
ordered_factor = false,
count = false),
inverse = nothing,
cache = true)
pipe = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> knn_with_target
DeterministicPipeline(
f = Main.var"#15#16"(),
one_hot_encoder = OneHotEncoder(
features = Symbol[],
drop_last = false,
ordered_factor = true,
ignore = false),
transformed_target_model_deterministic = TransformedTargetModelDeterministic(
model = KNNRegressor(K = 3, …),
transformer = Standardizer(features = Symbol[], …),
inverse = nothing,
cache = true),
cache = true)
Evaluating the pipeline (just as you would any other model):
pipe.one_hot_encoder.drop_last = true # mutate a nested hyper-parameter
evaluate(pipe, X, y, resampling=Holdout(), measure=RootMeanSquaredError(), verbosity=2)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├────────────────────────┼───────────┼─────────────┤
│ RootMeanSquaredError() │ predict │ 51200.0 │
└────────────────────────┴───────────┴─────────────┘
Inspecting the learned parameters in a pipeline:
mach = machine(pipe, X, y) |> fit!
F = fitted_params(mach)
F.transformed_target_model_deterministic.model
(tree = NearestNeighbors.KDTree{StaticArraysCore.SVector{56, Float64}, Distances.Euclidean, Float64, StaticArraysCore.SVector{56, Float64}}
Number of points: 1456
Dimensions: 56
Metric: Distances.Euclidean(0.0)
Reordered: true,)
Constructing a linear (unbranching) pipeline with a static (unlearned) target transformation/inverse transformation:
Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0
tree_with_target = TransformedTargetModel(model=Tree(),
transformer=y -> log.(y),
inverse = z -> exp.(z))
pipe2 = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> tree_with_target
Creating a homogeneous ensemble of models
Reference: Homogeneous Ensembles
X, y = @load_iris
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree()
forest = EnsembleModel(model=tree, bagging_fraction=0.8, n=300)
mach = machine(forest, X, y)
evaluate!(mach, measure=LogLoss())
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌──────────────────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├──────────────────────┼───────────┼─────────────┤
│ LogLoss( │ predict │ 0.638 │
│ tol = 2.22045e-16) │ │ │
└──────────────────────┴───────────┴─────────────┘
┌────────────────────────────────────────────────┬─────────┐
│ per_fold │ 1.96*SE │
├────────────────────────────────────────────────┼─────────┤
│ [3.89e-15, 3.89e-15, 0.319, 1.65, 1.55, 0.308] │ 0.666 │
└────────────────────────────────────────────────┴─────────┘
Performance curves
Generate a plot of performance, as a function of some hyperparameter (building on the preceding example)
Single performance curve:
r = range(forest, :n, lower=1, upper=1000, scale=:log10)
curve = learning_curve(mach,
range=r,
resampling=Holdout(),
resolution=50,
measure=LogLoss(),
verbosity=0)
(parameter_name = "n",
parameter_scale = :log10,
parameter_values = [1, 2, 3, 4, 5, 6, 7, 8, 10, 11 … 281, 324, 373, 429, 494, 569, 655, 754, 869, 1000],
measurements = [4.004850376568572, 4.1126732713223415, 4.067922726718731, 4.049600921172184, 4.039561595661895, 4.03321150762541, 4.065589155789244, 4.08361146745936, 4.082148768470309, 4.095634751659402 … 1.2616322281076593, 1.267897311771652, 1.2657758739856044, 1.2579792206940394, 1.2596508547736494, 1.2627993787760856, 1.2544026227520193, 1.2523949871090791, 1.2533064841096802, 1.2542225428976999],)
using Plots
plot(curve.parameter_values, curve.measurements,
xlab=curve.parameter_name, xscale=curve.parameter_scale)
Multiple curves:
curve = learning_curve(mach,
range=r,
resampling=Holdout(),
measure=LogLoss(),
resolution=50,
rng_name=:rng,
rngs=4,
verbosity=0)
(parameter_name = "n",
parameter_scale = :log10,
parameter_values = [1, 2, 3, 4, 5, 6, 7, 8, 10, 11 … 281, 324, 373, 429, 494, 569, 655, 754, 869, 1000],
measurements = [7.20873067782343 4.805820451882287 4.004850376568572 8.009700753137146; 7.20873067782343 4.805820451882287 4.004850376568572 8.040507294495367; … ; 1.2543154364346034 1.2444775009706022 1.2585317606216673 1.2501519842667157; 1.2497159507192772 1.2446179852065242 1.2628839814044215 1.250266106166284],)
plot(curve.parameter_values, curve.measurements,
xlab=curve.parameter_name, xscale=curve.parameter_scale)