Evaluating Model Performance
MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.
In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.
For simultaneously evaluating multiple models and/or data sets, see Benchmarking.
Evaluating against a single measure
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = @load RidgeRegressor pkg=MultivariateStats
RidgeRegressor(
lambda = 1.0,
bias = true) @059
julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = MersenneTwister(UInt32[0xf52cdeeb, 0xb7b9fdb7, 0x56985e7f, 0xaa6313fc]) @ 184) @670
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.227 │ [0.269, 0.364, 0.0473] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.23, 0.324, ..., 0.0275], [0.314, 0.202, ..., 0.196], [0.00488, 0.0182, ..., 0.148]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
Alternatively, instead of applying evaluate
to a model + data, one may call evaluate!
on an existing machine wrapping the model in data:
julia> mach = machine(model, X, y)
Machine{RidgeRegressor} @330 trained 0 times.
args:
1: Source @390 ⏎ `Table{AbstractArray{Continuous,1}}`
2: Source @107 ⏎ `AbstractArray{Continuous,1}`
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.227 │ [0.269, 0.364, 0.0473] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.23, 0.324, ..., 0.0275], [0.314, 0.202, ..., 0.196], [0.00488, 0.0182, ..., 0.148]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )
Multiple measures
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l1 │ 0.413 │ [0.479, 0.579, 0.181] │
│ rms │ 0.476 │ [0.518, 0.603, 0.218] │
│ rmslp1 │ 0.196 │ [0.224, 0.236, 0.0947] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.48, 0.569, ..., 0.166], [0.56, 0.45, ..., 0.442], [0.0698, 0.135, ..., 0.385]], missing, missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
Custom measures and weighted measures
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = MersenneTwister(UInt32[0xf52cdeeb, 0xb7b9fdb7, 0x56985e7f, 0xaa6313fc]) @ 184) @741
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/Ov46j/src/resampling.jl:621
┌─────────────────────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├─────────────────────────┼───────────────┼───────────────────────┤
│ my_loss │ 0.462 │ [0.494, 0.743, 0.148] │
│ my_per_observation_loss │ 0.413 │ [0.479, 0.579, 0.181] │
│ my_weighted_score │ 4.31 │ [2.34, 2.17, 8.41] │
│ l1 │ 0.686 │ [0.655, 1.12, 0.282] │
└─────────────────────────┴───────────────┴───────────────────────┘
_.per_observation = [missing, [[0.48, 0.569, ..., 0.166], [0.56, 0.45, ..., 0.442], [0.0698, 0.135, ..., 0.385]], missing, [[0.48, 0.569, ..., 0.166], [0.56, 0.9, ..., 0.442], [0.0698, 0.27, ..., 0.385]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
User-specified train/test sets
Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
┌───────────┬───────────────┬────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────┤
│ l1 │ 0.386 │ [0.343, 0.428] │
│ l2 │ 0.196 │ [0.165, 0.227] │
└───────────┴───────────────┴────────────────┘
_.per_observation = [[[0.706, 0.482, ..., 0.437], [0.438, 0.523, ..., 0.48]], [[0.499, 0.233, ..., 0.191], [0.192, 0.273, ..., 0.23]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
Or define their own re-usable ResamplingStrategy
objects, - see Custom resampling strategies below.
Built-in resampling strategies
MLJBase.Holdout
— Typeholdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
Holdout resampling strategy, for use in evaluate!
, evaluate
and in tuning.
train_test_pairs(holdout, rows)
Returns the pair [(train, test)]
, where train
and test
are vectors such that rows=vcat(train, test)
and length(train)/length(rows)
is approximatey equal to fraction_train`.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the Holdout
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is specified.
MLJBase.CV
— Typecv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
Cross-validation resampling strategy, for use in evaluate!
, evaluate
and tuning.
train_test_pairs(cv, rows)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices), where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector. With no row pre-shuffling, the order of rows
is preserved, in the sense that rows
coincides precisely with the concatenation of the test
vectors, in the order they are generated. The first r
test vectors have length n + 1
, where n, r = divrem(length(rows), nfolds)
, and the remaining test vectors have length n
.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the CV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
MLJBase.StratifiedCV
— Typestratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
Stratified cross-validation resampling strategy, for use in evaluate!
, evaluate
and in tuning. Applies only to classification problems (OrderedFactor
or Multiclass
targets).
train_test_pairs(stratified_cv, rows, y)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices) where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector.
Unlike regular cross-validation, the distribution of the levels of the target y
corresponding to each train
and test
is constrained, as far as possible, to replicate that of y[rows]
as a whole.
The stratified train_test_pairs
algorithm is invariant to label renaming. For example, if you run replace!(y, 'a' => 'b', 'b' => 'a')
and then re-run train_test_pairs
, the returned (train, test)
pairs will be the same.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the StratifedCV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
Custom resampling strategies
To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy
, and implement one of the following methods:
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
Each method takes a vector of indices rows
and return a vector [(t1, e1), (t2, e2), ... (tk, ek)]
of train/test pairs of row indices selected from rows
. Here X
, y
are the input and target data (ignored in simple strategies, such as Holdout
and CV
).
Here is the code for the Holdout
strategy as an example:
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
API
MLJBase.evaluate!
— Functionevaluate!(mach,
resampling=CV(),
measure=nothing,
rows=nothing,
weights=nothing,
operation=predict,
repeats=1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)
Estimate the performance of a machine mach
wrapping a supervised model in data, using the specified resampling
strategy (defaulting to 6-fold cross-validation) and measure
, which can be a single measure or vector.
Do subtypes(MLJ.ResamplingStrategy)
to obtain a list of available resampling strategies. If resampling
is not an object of type MLJ.ResamplingStrategy
, then a vector of pairs (of the form (train_rows, test_rows)
is expected. For example, setting
resampling = [(1:100), (101:200)),
(101:200), (1:100)]
gives two-fold cross-validation using the first 200 rows of data.
The resampling strategy is applied repeatedly (Monte Carlo resampling) if repeats > 1
. For example, if repeats = 10
, then resampling = CV(nfolds=5, shuffle=true)
, generates a total of 50 (train, test)
pairs for evaluation and subsequent aggregation.
If resampling isa MLJ.ResamplingStrategy
then one may optionally restrict the data used in evaluation by specifying rows
.
An optional weights
vector may be passed for measures that support sample weights (MLJ.supports_weights(measure) == true
), which is ignored by those that don't. These weights are not to be confused with any any weights w
bound to mach
(as in mach = machine(model, X, y, w)
). To pass these to the performance evaluation measures you must explictly specify weights=w
in the evaluate!
call.
User-defined measures are supported; see the manual for details.
If no measure is specified, then default_measure(mach.model)
is used, unless this default is nothing
and an error is thrown.
The acceleration
keyword argument is used to specify the compute resource (a subtype of ComputationalResources.AbstractResource
) that will be used to accelerate/parallelize the resampling operation.
Although evaluate! is mutating, mach.model
and mach.args
are untouched.
Summary of key-word arguments
resampling
- resampling strategy (default isCV(nfolds=6)
)measure
/measures
- measure or vector of measures (losses, scores, etc)rows
- vector of observation indices from which both train and test folds are constructed (default is all observations)weights
- per-sample weights for measures (not to be confused with weights used in training)operation
-predict
,predict_mean
,predict_mode
orpredict_median
;predict
is the default but cannot be used with a deterministic measure ifmodel isa Probabilistic
repeats
- default is 1; set to a higher value for repeated (Monte Carlo) resamplingacceleration
- parallelization option; currently supported options are instances ofCPU1
(single-threaded computation)CPUThreads
(multi-threaded computation) andCPUProcesses
(multi-process computation); default isdefault_resource()
.force
- default isfalse
; set totrue
for force cold-restart of each training eventverbosity
level, an integer defaulting to 1.check_measure
- default istrue
Return value
A property-accessible object of type PerformanceEvaluation
with these properties:
measure
: the vector of specified measuresmeasurements
: the corresponding measurements, aggregated across the test folds using the aggregation method defined for each measure (doaggregation(measure)
to inspect)per_fold
: a vector of vectors of individual test fold evaluations (one vector per measure)per_observation
: a vector of vectors of individual observation evaluations of those measures for whichreports_each_observation(measure)
is true, which is otherwise reportedmissing
-fitted_params_per_fold
: a vector containing fitted pamarms(mach)
for each machine mach
trained during resampling.
report_per_fold
: a vector containingreport(mach)
for each machinemach
training in resampling
MLJModelInterface.evaluate
— Functionsome meta-models may choose to implement the evaluate
operations