Evaluating Model Performance
MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.
In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.
For simultaneously evaluating multiple models and/or data sets, see Benchmarking.
Evaluating against a single measure
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = @load RidgeRegressor pkg=MultivariateStats
MLJModels.MultivariateStats_.RidgeRegressor(lambda = 1.0,) @ 1…61
julia> cv=CV(nfolds=3)
CV(nfolds = 3,
shuffle = false,
rng = MersenneTwister(UInt32[0x026ce58d, 0xdedad331, 0xee6917e9, 0xcb3e2c68]) @ 241,) @ 1…53
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
(measure = MLJBase.L2[l2],
measurement = [0.20988159297625752],
per_fold = Array{Float64,1}[[0.03802450213646114, 0.2325617130366544, 0.35905856375565703]],
per_observation = Array{Array{Float64,1},1}[[[0.001391574564082668, 0.1432310430923185, 0.00741862776613265, 5.676312331076877e-5], [0.005921698614466024, 0.29286671033641093, 0.11993698694014059, 0.5115214562556001], [0.4876050015113042, 0.032870797791136656, 0.43108832586510365, 0.48467012985508373]]],)
Alternatively, instead of applying evaluate
to a model + data, one may call evaluate!
on an existing machine wrapping the model in data:
julia> mach = machine(model, X, y)
Machine{RidgeRegressor} @ 3…29
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
(measure = MLJBase.L2[l2],
measurement = [0.20988159297625752],
per_fold = Array{Float64,1}[[0.03802450213646114, 0.2325617130366544, 0.35905856375565703]],
per_observation = Array{Array{Float64,1},1}[[[0.001391574564082668, 0.1432310430923185, 0.00741862776613265, 5.676312331076877e-5], [0.005921698614466024, 0.29286671033641093, 0.11993698694014059, 0.5115214562556001], [0.4876050015113042, 0.032870797791136656, 0.43108832586510365, 0.48467012985508373]]],)
(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )
Multiple measures
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
(measure = MLJBase.Measure[l1, rms, rmslp1],
measurement = [0.3684520660876574, 0.45812835862480455, 0.20657881967954347],
per_fold = Array{Float64,1}[[0.12735704362857336, 0.41991266570667796, 0.5580864889277211], [0.19499872342264485, 0.4822465272416738, 0.5992149562182648], [0.09614912587579903, 0.23645566214490457, 0.25073590020879444]],
per_observation = Union{Missing, Array{Array{Float64,1},1}}[Array{Float64,1}[[0.03730381433691021, 0.3784587733060478, 0.08613145631029728, 0.007534130561038133], [0.0769525738001402, 0.5411716089526601, 0.3463191980530975, 0.715207282020814], [0.6982871912840047, 0.18130305510701317, 0.656573168706355, 0.6961825406135116]], missing, missing],)
Custom measures and weighted measures
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(fraction_train = 0.8,
shuffle = false,
rng = MersenneTwister(UInt32[0x026ce58d, 0xdedad331, 0xee6917e9, 0xcb3e2c68]) @ 241,) @ 1…48
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJ ~/build/alan-turing-institute/MLJ.jl/src/resampling.jl:433
(measure = Any[Main.ex-evaluation_of_supervised_models.my_loss, Main.ex-evaluation_of_supervised_models.my_per_observation_loss, Main.ex-evaluation_of_supervised_models.my_weighted_score, l1],
measurement = [0.38078583361974094, 0.3684520660876574, 5.058719280180763, 0.3559066428224096],
per_fold = Array{Float64,1}[[0.1432310430923185, 0.5115214562556001, 0.4876050015113042], [0.12735704362857336, 0.41991266570667796, 0.5580864889277211], [9.286875978357664, 3.550622408232978, 2.338659453951646], [0.11911192616491813, 0.4162086668407953, 0.5323993354615153]],
per_observation = Union{Missing, Array{Array{Float64,1},1}}[missing, Array{Float64,1}[[0.03730381433691021, 0.3784587733060478, 0.08613145631029728, 0.007534130561038133], [0.0769525738001402, 0.5411716089526601, 0.3463191980530975, 0.715207282020814], [0.6982871912840047, 0.18130305510701317, 0.656573168706355, 0.6961825406135116]], missing, Array{Float64,1}[[0.02984305146952817, 0.30276701864483824, 0.13781033009647564, 0.0060273044488305064], [0.04397289931436583, 0.6184818388030402, 0.59369005380531, 0.40868987544046514], [0.3990212521622884, 0.2072034915508722, 1.1255540034966085, 0.3978185946362923]]],)
User-specified train/test sets
Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
(measure = MLJBase.Measure[l1, l2],
measurement = [0.3951558812987779, 0.27273426964274006],
per_fold = Array{Float64,1}[[0.6226545270049493, 0.1676572355926065], [0.4824078918953378, 0.06306064739014229]],
per_observation = Array{Array{Float64,1},1}[[[0.3932251712408843, 0.9405619238100722, 0.6829129148956254, 0.06914163145280994, 0.9313979584138186, 0.7186875622164854], [0.024316949187180192, 0.36030826956059836, 0.0007248605052325718, 0.06064573681883245, 0.07080707764561467, 0.48914051983818085]], [[0.15462603529742278, 0.8846567325213041, 0.46637004933123977, 0.004780565199956196, 0.8675021569374293, 0.5165118120846747], [0.0005913140177719034, 0.1298220491137528, 5.254227520460193e-7, 0.00367790539429909, 0.005013642244712104, 0.2392584481475658]]],)
Or define their own re-usable ResamplingStrategy
objects, - see Custom resampling strategies below.
Built-in resampling strategies
MLJ.Holdout
— Type.holdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
Holdout resampling strategy, for use in evaluate!
, evaluate
and in tuning.
train_test_pairs(holdout, rows)
Returns the pair [(train, test)]
, where train
and test
are vectors such that rows=vcat(train, test)
and length(train)/length(rows)
is approximatey equal to fraction_train`.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the Holdout
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is specified.
MLJ.CV
— Type.cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
Cross-validation resampling strategy, for use in evaluate!
, evaluate
and tuning.
train_test_pairs(cv, rows)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices), where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector. With no row pre-shuffling, the order of rows
is preserved, in the sense that rows
coincides precisely with the concatenation of the test
vectors, in the order they are generated. All but the last test
vector have equal length.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the CV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
MLJ.StratifiedCV
— Type.stratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
Stratified cross-validation resampling strategy, for use in evaluate!
, evaluate
and in tuning. Applies only to classification problems (OrderedFactor
or Multiclass
targets).
train_test_pairs(stratified_cv, rows, y)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices) where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector.
Unlike regular cross-validation, the distribution of the levels of the target y
corresponding to each train
and test
is constrained, as far as possible, to replicate that of y[rows]
as a whole.
Specifically, the data is split into a number of groups on which y
is constant, and each individual group is resampled according to the ordinary cross-validation strategy CV(nfolds=nfolds)
. To obtain the final (train, test)
pairs of row indices, the per-group pairs are collated in such a way that each collated train
and test
respects the original order of rows
(after shuffling, if shuffle=true
).
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the StratifedCV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
Custom resampling strategies
To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy
, and implement one of the following methods:
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
Each method takes a vector of indices rows
and return a vector [(t1, e1), (t2, e2), ... (tk, ek)]
of train/test pairs of row indices selected from rows
. Here X
, y
are the input and target data (ignored in simple strategies, such as Holdout
and CV
).
Here is the code for the Holdout
strategy as an example:
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
API
MLJ.evaluate!
— Function.evaluate!(mach,
resampling=CV(),
measure=nothing,
weights=nothing,
operation=predict,
acceleration=DEFAULT_RESOURCE[],
force=false,
verbosity=1)
Estimate the performance of a machine mach
wrapping a supervised model in data, using the specified resampling
strategy (defaulting to 6-fold cross-validation) and measure
, which can be a single measure or vector.
Do subtypes(MLJ.ResamplingStrategy)
to obtain a list of available resampling strategies. If resampling
is not an object of type MLJ.ResamplingStrategy
, then a vector of pairs (of the form (train_rows, test_rows)
is expected. For example, setting
resampling = [(1:100), (101:200)),
(101:200), (1:100)]
gives two-fold cross-validation using the first 200 rows of data.
If resampling isa MLJ.ResamplingStrategy
then one may optionally restrict the data used in evaluation by specifying rows
.
An optional weights
vector may be passed for measures that support sample weights (MLJ.supports_weights(measure) == true
), which is ignored by those that don't.
Important: If mach
already wraps sample weights w
(as in mach = machine(model, X, y, w)
) then these weights, which are used for training, are automatically passed to the measures for evaluation. However, for evaluation purposes, any weights
specified as a keyword argument will take precedence over w
.
User-defined measures are supported; see the manual for details.
If no measure is specified, then default_measure(mach.model)
is used, unless this default is nothing
and an error is thrown.
The acceleration
keyword argument is used to specify the compute resource (a subtype of ComputationalResources.AbstractResource
) that will be used to accelerate/parallelize the resampling operation.
Although evaluate! is mutating, mach.model
and mach.args
are untouched.
MLJBase.evaluate
— Function.evaluate(model, X, y; measure=nothing, options...)
evaluate(model, X, y, w; measure=nothing, options...)
Evaluate the performance of a supervised model model
on input data X
and target y
, optionally specifying sample weights w
for training, where supported. The same weights are passed to measures that support sample weights, unless this behaviour is overridden by explicitly specifying the option weights=...
.
See the machine version evaluate!
for the complete list of options.