Evaluating Model Performance
MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.
In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.
For simultaneously evaluating multiple models and/or data sets, see Benchmarking.
Evaluating against a single measure
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = @load RidgeRegressor pkg=MultivariateStats
RidgeRegressor(
lambda = 1.0) @ 8…84
julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = MersenneTwister(UInt32[0x5e195684, 0x67952e4a, 0x3888593c, 0x4fe704ab]) @ 379) @ 1…97
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.277 │ [0.0722, 0.536, 0.223] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.00026, 0.0324, ..., 0.042], [0.69, 0.212, ..., 1.09], [0.134, 0.22, ..., 0.534]]]
Alternatively, instead of applying evaluate
to a model + data, one may call evaluate!
on an existing machine wrapping the model in data:
julia> mach = machine(model, X, y)
Machine{RidgeRegressor} @ 1…28
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.277 │ [0.0722, 0.536, 0.223] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.00026, 0.0324, ..., 0.042], [0.69, 0.212, ..., 1.09], [0.134, 0.22, ..., 0.534]]]
(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )
Multiple measures
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
┌───────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼───────────────────────┤
│ l1 │ 0.436 │ [0.216, 0.681, 0.41] │
│ rms │ 0.527 │ [0.269, 0.732, 0.473] │
│ rmslp1 │ 0.25 │ [0.139, 0.376, 0.165] │
└───────────┴───────────────┴───────────────────────┘
_.per_observation = [[[0.0161, 0.18, ..., 0.205], [0.83, 0.461, ..., 1.04], [0.366, 0.469, ..., 0.731]], missing, missing]
Custom measures and weighted measures
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = MersenneTwister(UInt32[0x5e195684, 0x67952e4a, 0x3888593c, 0x4fe704ab]) @ 379) @ 1…63
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/8HOpr/src/resampling.jl:543
┌─────────────────────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├─────────────────────────┼───────────────┼───────────────────────┤
│ my_loss │ 0.613 │ [0.214, 1.09, 0.534] │
│ my_per_observation_loss │ 0.436 │ [0.216, 0.681, 0.41] │
│ my_weighted_score │ 3.8 │ [6.5, 1.86, 3.04] │
│ l1 │ 0.629 │ [0.332, 0.992, 0.565] │
└─────────────────────────┴───────────────┴───────────────────────┘
_.per_observation = [missing, [[0.0161, 0.18, ..., 0.205], [0.83, 0.461, ..., 1.04], [0.366, 0.469, ..., 0.731]], missing, [[0.0161, 0.18, ..., 0.205], [0.83, 0.922, ..., 1.04], [0.366, 0.939, ..., 0.731]]]
User-specified train/test sets
Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
┌───────────┬───────────────┬────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────┤
│ l1 │ 0.404 │ [0.45, 0.358] │
│ l2 │ 0.218 │ [0.254, 0.183] │
└───────────┴───────────────┴────────────────┘
_.per_observation = [[[0.304, 0.843, ..., 0.672], [0.017, 0.143, ..., 0.4]], [[0.0925, 0.711, ..., 0.451], [0.000288, 0.0205, ..., 0.16]]]
Or define their own re-usable ResamplingStrategy
objects, - see Custom resampling strategies below.
Built-in resampling strategies
MLJBase.Holdout
— Typeholdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
Holdout resampling strategy, for use in evaluate!
, evaluate
and in tuning.
train_test_pairs(holdout, rows)
Returns the pair [(train, test)]
, where train
and test
are vectors such that rows=vcat(train, test)
and length(train)/length(rows)
is approximatey equal to fraction_train`.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the Holdout
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is specified.
MLJBase.CV
— Typecv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
Cross-validation resampling strategy, for use in evaluate!
, evaluate
and tuning.
train_test_pairs(cv, rows)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices), where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector. With no row pre-shuffling, the order of rows
is preserved, in the sense that rows
coincides precisely with the concatenation of the test
vectors, in the order they are generated. The first r
test vectors have length n + 1
, where n, r = divrem(length(rows), nfolds)
, and the remaining test vectors have length n
.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the CV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
MLJBase.StratifiedCV
— Typestratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
Stratified cross-validation resampling strategy, for use in evaluate!
, evaluate
and in tuning. Applies only to classification problems (OrderedFactor
or Multiclass
targets).
train_test_pairs(stratified_cv, rows, y)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices) where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector.
Unlike regular cross-validation, the distribution of the levels of the target y
corresponding to each train
and test
is constrained, as far as possible, to replicate that of y[rows]
as a whole.
Specifically, the data is split into a number of groups on which y
is constant, and each individual group is resampled according to the ordinary cross-validation strategy CV(nfolds=nfolds)
. To obtain the final (train, test)
pairs of row indices, the per-group pairs are collated in such a way that each collated train
and test
respects the original order of rows
(after shuffling, if shuffle=true
).
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the StratifedCV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
Custom resampling strategies
To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy
, and implement one of the following methods:
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
Each method takes a vector of indices rows
and return a vector [(t1, e1), (t2, e2), ... (tk, ek)]
of train/test pairs of row indices selected from rows
. Here X
, y
are the input and target data (ignored in simple strategies, such as Holdout
and CV
).
Here is the code for the Holdout
strategy as an example:
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
API
MLJBase.evaluate!
— Functionevaluate!(mach,
resampling=CV(),
measure=nothing,
weights=nothing,
operation=predict,
repeats = 1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)
Estimate the performance of a machine mach
wrapping a supervised model in data, using the specified resampling
strategy (defaulting to 6-fold cross-validation) and measure
, which can be a single measure or vector.
Do subtypes(MLJ.ResamplingStrategy)
to obtain a list of available resampling strategies. If resampling
is not an object of type MLJ.ResamplingStrategy
, then a vector of pairs (of the form (train_rows, test_rows)
is expected. For example, setting
resampling = [(1:100), (101:200)),
(101:200), (1:100)]
gives two-fold cross-validation using the first 200 rows of data.
The resampling strategy is applied repeatedly if repeats > 1
. For resampling = CV(nfolds=5)
, for example, this generates a total of 5n
test folds for evaluation and subsequent aggregation.
If resampling isa MLJ.ResamplingStrategy
then one may optionally restrict the data used in evaluation by specifying rows
.
An optional weights
vector may be passed for measures that support sample weights (MLJ.supports_weights(measure) == true
), which is ignored by those that don't.
Important: If mach
already wraps sample weights w
(as in mach = machine(model, X, y, w)
) then these weights, which are used for training, are automatically passed to the measures for evaluation. However, for evaluation purposes, any weights
specified as a keyword argument will take precedence over w
.
User-defined measures are supported; see the manual for details.
If no measure is specified, then default_measure(mach.model)
is used, unless this default is nothing
and an error is thrown.
The acceleration
keyword argument is used to specify the compute resource (a subtype of ComputationalResources.AbstractResource
) that will be used to accelerate/parallelize the resampling operation.
Although evaluate! is mutating, mach.model
and mach.args
are untouched.
Return value
A property-accessible object of type PerformanceEvaluation
with these properties:
measure
: the vector of specified measuresmeasurements
: the corresponding measurements, aggregated across the test folds using the aggregation method defined for each measure (doaggregation(measure)
to inspect)per_fold
: a vector of vectors of individual test fold evaluations (one vector per measure)per_observation
: a vector of vectors of individual observation evaluations of those measures for whichreports_each_observation(measure)
is true, which is otherwise reportedmissing
.
See also evaluate
MLJModelInterface.evaluate
— Functionsome meta-models may choose to implement the evaluate
operations