Evaluating Model Performance
MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores, using the evaluate or evaluate! methods. For more on available performance measures, see Performance Measures.
In addition to hold-out and cross-validation, the user can specify an explicit list of train/test pairs of row indices for resampling, or define new resampling strategies.
For simultaneously evaluating multiple models, see "Comparing models of different type and nested cross-validation".
For externally logging the outcomes of performance evaluation experiments, see Logging Workflows
Evaluating against a single measure
julia> using MLJjulia> X = (a=rand(12), b=rand(12), c=rand(12));julia> y = X.a + 2X.b + 0.05*rand(12);julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()RidgeRegressor( lambda = 1.0, bias = true)julia> cv = CV(nfolds=3)CV( nfolds = 3, shuffle = false, rng = Random.TaskLocalRNG())julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────┬───────────┬─────────────┐ │ measure │ operation │ measurement │ ├──────────┼───────────┼─────────────┤ │ LPLoss( │ predict │ 0.2 │ │ p = 2) │ │ │ └──────────┴───────────┴─────────────┘ ┌──────────────────────┬─────────┐ │ per_fold │ 1.96*SE │ ├──────────────────────┼─────────┤ │ [0.0491, 0.42, 0.13] │ 0.27 │ └──────────────────────┴─────────┘
Alternatively, instead of applying evaluate to a model + data, one may call evaluate! on an existing machine wrapping the model in data:
julia> mach = machine(model, X, y)untrained Machine; caches model-specific representations of data model: RidgeRegressor(lambda = 1.0, …) args: 1: Source @829 ⏎ Table{AbstractVector{Continuous}} 2: Source @980 ⏎ AbstractVector{Continuous}julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────┬───────────┬─────────────┐ │ measure │ operation │ measurement │ ├──────────┼───────────┼─────────────┤ │ LPLoss( │ predict │ 0.2 │ │ p = 2) │ │ │ └──────────┴───────────┴─────────────┘ ┌──────────────────────┬─────────┐ │ per_fold │ 1.96*SE │ ├──────────────────────┼─────────┤ │ [0.0491, 0.42, 0.13] │ 0.27 │ └──────────────────────┴─────────┘
(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )
Multiple measures
Multiple measures are specified as a vector:
julia> evaluate!( mach, resampling=cv, measures=[l1, rms, rmslp1], verbosity=0, )PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌───┬──────────────────────────────────────┬───────────┬─────────────┐ │ │ measure │ operation │ measurement │ ├───┼──────────────────────────────────────┼───────────┼─────────────┤ │ A │ LPLoss( │ predict │ 0.348 │ │ │ p = 1) │ │ │ │ B │ RootMeanSquaredError() │ predict │ 0.447 │ │ C │ RootMeanSquaredLogProportionalError( │ predict │ 0.193 │ │ │ offset = 1) │ │ │ └───┴──────────────────────────────────────┴───────────┴─────────────┘ ┌───┬────────────────────────┬─────────┐ │ │ per_fold │ 1.96*SE │ ├───┼────────────────────────┼─────────┤ │ A │ [0.194, 0.565, 0.285] │ 0.268 │ │ B │ [0.222, 0.648, 0.361] │ 0.301 │ │ C │ [0.0918, 0.299, 0.117] │ 0.157 │ └───┴────────────────────────┴─────────┘
Custom measures can also be provided.
Specifying weights
Per-observation weights can be passed to measures. If a measure does not support weights, the weights are ignored:
julia> holdout = Holdout(fraction_train=0.8)Holdout( fraction_train = 0.8, shuffle = false, rng = Random.TaskLocalRNG())julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];julia> evaluate!( mach, resampling=CV(nfolds=3), measure=[l2, rsquared], weights=weights, )┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported: │ RSquared() └ @ MLJBase ~/.julia/packages/MLJBase/F8Zzu/src/resampling.jl:1042 Evaluating over 3 folds: 67%[================> ] ETA: 0:00:00 Evaluating over 3 folds: 100%[=========================] Time: 0:00:00 PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌───┬────────────┬───────────┬─────────────┐ │ │ measure │ operation │ measurement │ ├───┼────────────┼───────────┼─────────────┤ │ A │ LPLoss( │ predict │ 0.464 │ │ │ p = 2) │ │ │ │ B │ RSquared() │ predict │ 0.453 │ └───┴────────────┴───────────┴─────────────┘ ┌───┬────────────────────────┬─────────┐ │ │ per_fold │ 1.96*SE │ ├───┼────────────────────────┼─────────┤ │ A │ [0.0577, 0.977, 0.358] │ 0.65 │ │ B │ [0.658, 0.297, 0.404] │ 0.257 │ └───┴────────────────────────┴─────────┘
In classification problems, use class_weights=... to specify a class weight dictionary.
User-specified train/test sets
Users can either provide an explicit list of train/test pairs of row indices for resampling, as in this example:
julia> fold1 = 1:6; fold2 = 7:12;julia> evaluate!( mach, resampling = [(fold1, fold2), (fold2, fold1)], measures=[l1, l2], verbosity=0, )PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌───┬──────────┬───────────┬─────────────┐ │ │ measure │ operation │ measurement │ ├───┼──────────┼───────────┼─────────────┤ │ A │ LPLoss( │ predict │ 0.38 │ │ │ p = 1) │ │ │ │ B │ LPLoss( │ predict │ 0.235 │ │ │ p = 2) │ │ │ └───┴──────────┴───────────┴─────────────┘ ┌───┬─────────────────┬─────────┐ │ │ per_fold │ 1.96*SE │ ├───┼─────────────────┼─────────┤ │ A │ [0.518, 0.242] │ 0.382 │ │ B │ [0.385, 0.0843] │ 0.417 │ └───┴─────────────────┴─────────┘
Or the user can define their own re-usable ResamplingStrategy objects; see Custom resampling strategies below.
Built-in resampling strategies
MLJBase.Holdout — Typeholdout = Holdout(; fraction_train=0.7, shuffle=nothing, rng=nothing)Instantiate a Holdout resampling strategy, for use in evaluate!, evaluate and in tuning.
train_test_pairs(holdout, rows)Returns the pair [(train, test)], where train and test are vectors such that rows=vcat(train, test) and length(train)/length(rows) is approximatey equal to fraction_train`.
Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the Holdout keyword constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.
If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is specified.
MLJBase.CV — Typecv = CV(; nfolds=6, shuffle=nothing, rng=nothing)Cross-validation resampling strategy, for use in evaluate!, evaluate and tuning.
train_test_pairs(cv, rows)Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices), where each train and test is a sub-vector of rows. The test vectors are mutually exclusive and exhaust rows. Each train vector is the complement of the corresponding test vector. With no row pre-shuffling, the order of rows is preserved, in the sense that rows coincides precisely with the concatenation of the test vectors, in the order they are generated. The first r test vectors have length n + 1, where n, r = divrem(length(rows), nfolds), and the remaining test vectors have length n.
Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the CV keyword constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.
If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is explicitly specified.
MLJBase.StratifiedCV — Typestratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)Stratified cross-validation resampling strategy, for use in evaluate!, evaluate and in tuning. Applies only to classification problems (OrderedFactor or Multiclass targets).
train_test_pairs(stratified_cv, rows, y)Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices) where each train and test is a sub-vector of rows. The test vectors are mutually exclusive and exhaust rows. Each train vector is the complement of the corresponding test vector.
Unlike regular cross-validation, the distribution of the levels of the target y corresponding to each train and test is constrained, as far as possible, to replicate that of y[rows] as a whole.
The stratified train_test_pairs algorithm is invariant to label renaming. For example, if you run replace!(y, 'a' => 'b', 'b' => 'a') and then re-run train_test_pairs, the returned (train, test) pairs will be the same.
Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the StratifedCV keywod constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.
If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is explicitly specified.
MLJBase.TimeSeriesCV — Typetscv = TimeSeriesCV(; nfolds=4)Cross-validation resampling strategy, for use in evaluate!, evaluate and tuning, when observations are chronological and not expected to be independent.
train_test_pairs(tscv, rows)Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices), where each train and test is a sub-vector of rows. The rows are partitioned sequentially into nfolds + 1 approximately equal length partitions, where the first partition is the first train set, and the second partition is the first test set. The second train set consists of the first two partitions, and the second test set consists of the third partition, and so on for each fold.
The first partition (which is the first train set) has length n + r, where n, r = divrem(length(rows), nfolds + 1), and the remaining partitions (all of the test folds) have length n.
Examples
julia> MLJBase.train_test_pairs(TimeSeriesCV(nfolds=3), 1:10)
3-element Vector{Tuple{UnitRange{Int64}, UnitRange{Int64}}}:
(1:4, 5:6)
(1:6, 7:8)
(1:8, 9:10)
julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)();
julia> data = @load_sunspots;
julia> X = (lag1 = data.sunspot_number[2:end-1],
lag2 = data.sunspot_number[1:end-2]);
julia> y = data.sunspot_number[3:end];
julia> tscv = TimeSeriesCV(nfolds=3);
julia> evaluate(model, X, y, resampling=tscv, measure=rmse, verbosity=0)
┌───────────────────────────┬───────────────┬────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────────────────┼───────────────┼────────────────────┤
│ RootMeanSquaredError @753 │ 21.7 │ [25.4, 16.3, 22.4] │
└───────────────────────────┴───────────────┴────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]MLJBase.InSample — Typein_sample = InSample()Instantiate an InSample resampling strategy, for use in evaluate!, evaluate and in tuning. In this strategy the train and test sets are the same, and consist of all observations specified by the rows keyword argument. If rows is not specified, all supplied rows are used.
Example
using MLJBase, MLJModels
X, y = make_blobs() # a table and a vector
model = ConstantClassifier()
train, test = partition(eachindex(y), 0.7) # train:test = 70:30Compute in-sample (training) loss:
evaluate(model, X, y, resampling=InSample(), rows=train, measure=brier_loss)Compute the out-of-sample loss:
evaluate(model, X, y, resampling=[(train, test),], measure=brier_loss)Or equivalently:
evaluate(model, X, y, resampling=Holdout(fraction_train=0.7), measure=brier_loss)Custom resampling strategies
To define a new resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy, and implement one of the following methods:
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)Each method takes a vector of indices rows and returns a vector [(t1, e1), (t2, e2), ... (tk, ek)] of train/test pairs of row indices selected from rows. Here X, y are the input and target data (ignored in simple strategies, such as Holdout and CV).
Here is the code for the Holdout strategy as an example:
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
endReference
MLJBase.evaluate! — Functionevaluate!(mach; resampling=CV(), measure=nothing, options...)Estimate the performance of a machine mach wrapping a supervised model in data, using the specified resampling strategy (defaulting to 6-fold cross-validation) and measure, which can be a single measure or vector. Returns a PerformanceEvaluation object.
Available resampling strategies are CV, Holdout, InSample, StratifiedCV and TimeSeriesCV. If resampling is not an instance of one of these, then a vector of tuples of the form (train_rows, test_rows) is expected. For example, setting
resampling = [(1:100, 101:200),
(101:200, 1:100)]gives two-fold cross-validation using the first 200 rows of data.
Any measure conforming to the StatisticalMeasuresBase.jl API can be provided, assuming it can consume multiple observations.
Although evaluate! is mutating, mach.model and mach.args are not mutated.
Additional keyword options
rows- vector of observation indices from which both train and test folds are constructed (default is all observations)operation/operations=nothing- One ofpredict,predict_mean,predict_mode,predict_median, orpredict_joint, or a vector of these of the same length asmeasure/measures. Automatically inferred if left unspecified. For example,predict_modewill be used for aMulticlasstarget, ifmodelis a probabilistic predictor, butmeasureis expects literal (point) target predictions. Operations actually applied can be inspected from theoperationfield of the object returned.weights- per-sampleRealweights for measures that support them (not to be confused with weights used in training, such as thewinmach = machine(model, X, y, w)).class_weights- dictionary ofRealper-class weights for use with measures that support these, in classification problems (not to be confused with weights used in training, such as thewinmach = machine(model, X, y, w)).repeats::Int=1: set to a higher value for repeated (Monte Carlo) resampling. For example, ifrepeats = 10, thenresampling = CV(nfolds=5, shuffle=true), generates a total of 50(train, test)pairs for evaluation and subsequent aggregation.acceleration=CPU1(): acceleration/parallelization option; can be any instance ofCPU1, (single-threaded computation),CPUThreads(multi-threaded computation) orCPUProcesses(multi-process computation); default isdefault_resource(). These types are owned by ComputationalResources.jl.force=false: set totrueto force cold-restart of each training eventverbosity::Int=1logging level; can be negativecheck_measure=true: whether to screen measures for possible incompatibility with the model. Will not catch all incompatibilities.per_observation=true: whether to calculate estimates for individual observations; iffalsetheper_observationfield of the returned object is populated withmissings. Setting tofalsemay reduce compute time and allocations.logger=default_logger()- a logger object for forwarding results to a machine learning tracking platform; seedefault_loggerfor details.compact=false- iftrue, the returned evaluation object excludes these fields:fitted_params_per_fold,report_per_fold,train_test_rows.
See also evaluate, PerformanceEvaluation, CompactPerformanceEvaluation.
MLJModelInterface.evaluate — Functionsome meta-models may choose to implement the evaluate operations
MLJBase.PerformanceEvaluation — TypePerformanceEvaluation <: AbstractPerformanceEvaluationType of object returned by evaluate (for models plus data) or evaluate! (for machines). Such objects encode estimates of the performance (generalization error) of a supervised model or outlier detection model, and store other information ancillary to the computation.
If evaluate or evaluate! is called with the compact=true option, then a CompactPerformanceEvaluation object is returned instead.
When evaluate/evaluate! is called, a number of train/test pairs ("folds") of row indices are generated, according to the options provided, which are discussed in the evaluate! doc-string. Rows correspond to observations. The generated train/test pairs are recorded in the train_test_rows field of the PerformanceEvaluation struct, and the corresponding estimates, aggregated over all train/test pairs, are recorded in measurement, a vector with one entry for each measure (metric) recorded in measure.
When displayed, a PerformanceEvaluation object includes a value under the heading 1.96*SE, derived from the standard error of the per_fold entries. This value is suitable for constructing a formal 95% confidence interval for the given measurement. Such intervals should be interpreted with caution. See, for example, Bates et al. (2021).
Fields
These fields are part of the public API of the PerformanceEvaluation struct.
model: model used to create the performance evaluation. In the case a tuning model, this is the best model found.measure: vector of measures (metrics) used to evaluate performancemeasurement: vector of measurements - one for each element ofmeasure- aggregating the performance measurements over all train/test pairs (folds). The aggregation method applied for a given measuremisStatisticalMeasuresBase.external_aggregation_mode(m)(commonlyMean()orSum())operation(e.g.,predict_mode): the operations applied for each measure to generate predictions to be evaluated. Possibilities are:predict,predict_mean,predict_mode,predict_median, orpredict_joint.per_fold: a vector of vectors of individual test fold evaluations (one vector per measure). Useful for obtaining a rough estimate of the variance of the performance estimate.per_observation: a vector of vectors of vectors containing individual per-observation measurements: for an evaluatione,e.per_observation[m][f][i]is the measurement for theith observation in thefth test fold, evaluated using themth measure. Useful for some forms of hyper-parameter optimization. Note that an aggregregated measurement for some measuremeasureis repeated across all observations in a fold ifStatisticalMeasures.can_report_unaggregated(measure) == true. Ifehas been computed with theper_observation=falseoption, thene_per_observationis a vector ofmissings.fitted_params_per_fold: a vector containingfitted params(mach)for each machinemachtrained during resampling - one machine per train/test pair. Use this to extract the learned parameters for each individual training event.report_per_fold: a vector containingreport(mach)for each machinemachtraining in resampling - one machine per train/test pair.train_test_rows: a vector of tuples, each of the form(train, test), wheretrainandtestare vectors of row (observation) indices for training and evaluation respectively.resampling: the user-specified resampling strategy to generate the train/test pairs (or literal train/test pairs if that was directly specified).repeats: the number of times the resampling strategy was repeated.
See also CompactPerformanceEvaluation.
MLJBase.CompactPerformanceEvaluation — TypeCompactPerformanceEvaluation <: AbstractPerformanceEvaluationType of object returned by evaluate (for models plus data) or evaluate! (for machines) when called with the option compact = true. Such objects have the same structure as the PerformanceEvaluation objects returned by default, except that the following fields are omitted to save memory: fitted_params_per_fold, report_per_fold, train_test_rows.
For more on the remaining fields, see PerformanceEvaluation.
MLJBase.default_logger — Functiondefault_logger()Return the current value of the default logger for use with supported machine learning tracking platforms, such as MLflow.
The default logger is used in calls to evaluate! and evaluate, and in the constructors TunedModel and IteratedModel, unless the logger keyword is explicitly specified.
When MLJBase is first loaded, the default logger is nothing.
default_logger(logger)Reset the default logger.
Example
Suppose an MLflow tracking service is running on a local server at http://127.0.0.1:500. Then in every evaluate call in which logger is not specified, the peformance evaluation is automatically logged to the service, as here:
using MLJ
logger = MLJFlow.Logger("http://127.0.0.1:5000/api")
default_logger(logger)
X, y = make_moons()
model = ConstantClassifier()
evaluate(model, X, y, measures=[log_loss, accuracy)])