obs
and Data Interfaces
The obs
method takes data intended as input to fit
, predict
or transform
, and transforms it to a learner-specific form guaranteed to implement a form of observation access designated by the learner. The transformed data can then passed on to the relevant method in place of the original input (after first resampling it, if the learner supports this). Using obs
may provide performance advantages over naive workflows in some cases (e.g., cross-validation).
obs(learner, data) # can be passed to `fit` instead of `data`
obs(model, data) # can be passed to `predict` or `transform` instead of `data`
Typical workflows
LearnAPI.jl makes no universal assumptions about the form of data
in a call like fit(learner, data)
. However, if we define
observations = obs(learner, data)
then, assuming the typical case that LearnAPI.data_interface(learner) == LearnAPI.RandomAccess()
, observations
implements the MLUtils.jl getobs
/numobs
interface, for grabbing and counting observations. Moreover, we can pass observations
to fit
in place of the original data, or first resample it using MLUtils.getobs
:
# equivalent to `model = fit(learner, data)`
model = fit(learner, observations)
# with resampling:
resampled_observations = MLUtils.getobs(observations, 1:10)
model = fit(learner, resampled_observations)
In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs
and MLUtils.getobs
to perform efficient cross-validation:
using LearnAPI
import MLUtils
learner = <some supervised learner>
data = <some data that `fit` can consume, with 30 observations>
train_test_folds = map([1:10, 11:20, 21:30]) do test
(setdiff(1:30, test), test)
end
fitobs = obs(learner, data)
never_trained = true
scores = map(train_test_folds) do (train, test)
# train using model-specific representation of data:
fitobs_subset = MLUtils.getobs(fitobs, train)
model = fit(learner, fitobs_subset)
# predict on the fold complement:
if never_trained
X = LearnAPI.features(learner, data)
global predictobs = obs(model, X)
global never_trained = false
end
predictobs_subset = MLUtils.getobs(predictobs, test)
ŷ = predict(model, Point(), predictobs_subset)
y = LearnAPI.target(learner, data)
return <score comparing ŷ with y[test]>
end
Implementation guide
method | comment | compulsory? | fallback |
---|---|---|---|
obs(learner, data) | here data is fit -consumable | not typically | returns data |
obs(model, data) | here data is predict -consumable | not typically | returns data |
A sample implementation is given in Providing a separate data front end.
Reference
LearnAPI.obs
— Functionobs(learner, data)
obs(model, data)
Return learner-specific representation of data
, suitable for passing to fit
(first signature) or to predict
and transform
(second signature), in place of data
. Here model
is the return value of fit(learner, ...)
for some LearnAPI.jl learner, learner
.
The returned object is guaranteed to implement observation access as indicated by LearnAPI.data_interface(learner)
, typically LearnAPI.RandomAccess()
.
Calling fit
/predict
/transform
on the returned objects may have performance advantages over calling directly on data
in some contexts.
Example
Usual workflow, using data-specific resampling methods:
data = (X, y) # a DataFrame and a vector
data_train = (Tables.select(X, 1:100), y[1:100])
model = fit(learner, data_train)
ŷ = predict(model, Point(), X[101:150])
Alternative workflow using obs
and the MLUtils.jl method getobs
to carry out subsampling (assumes LearnAPI.data_interface(learner) == RandomAccess()
):
import MLUtils
fit_observations = obs(learner, data)
model = fit(learner, MLUtils.getobs(fit_observations, 1:100))
predict_observations = obs(model, X)
ẑ = predict(model, Point(), MLUtils.getobs(predict_observations, 101:150))
@assert ẑ == ŷ
See also LearnAPI.data_interface
.
Extended help
New implementations
Implementation is typically optional.
For each supported form of data
in fit(learner, data)
, it must be true that model = fit(learner, observations)
is equivalent to model = fit(learner, data)
, whenever observations = obs(learner, data)
. For each supported form of data
in calls predict(model, ..., data)
and transform(model, data)
, where implemented, the calls predict(model, ..., observations)
and transform(model, observations)
must be supported alternatives with the same output, whenever observations = obs(model, data)
.
If LearnAPI.data_interface(learner) == RandomAccess()
(the default), then fit
, predict
and transform
must additionally accept obs
output that has been subsampled using MLUtils.getobs
, with the obvious interpretation applying to the outcomes of such calls (e.g., if all observations are subsampled, then outcomes should be the same as if using the original data).
It is required that obs(learner, _)
and obs(model, _)
are involutive, meaning both the following hold:
obs(learner, obs(learner, data)) == obs(learner, data)
obs(model, obs(model, data) == obs(model, obs(model, data)
If one overloads obs
, one typically needs additionally overloadings to guarantee involutivity.
The fallback for obs
is obs(model_or_learner, data) = data
, and the fallback for LearnAPI.data_interface(learner)
is LearnAPI.RandomAccess()
. For details refer to the LearnAPI.data_interface
document string.
In particular, if the data
to be consumed by fit
, predict
or transform
consists only of suitable tables and arrays, then obs
and LearnAPI.data_interface
do not need to be overloaded. However, the user will get no performance benefits by using obs
in that case.
Sample implementation
Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl manual.
Data interfaces
New implementations must overload LearnAPI.data_interface(learner)
if the output of obs
does not implement LearnAPI.RandomAccess()
. Arrays, most tables, and all tuples thereof, implement RandomAccess()
.
LearnAPI.RandomAccess
— TypeLearnAPI.RandomAccess
A data interface type. We say that data
implements the RandomAccess
interface if data
implements the methods getobs
and numobs
from MLUtils.jl. The first method allows one to grab observations specified by an arbitrary index set, as in MLUtils.getobs(data, [2, 3, 5])
, while the second method returns the total number of available observations, which is assumed to be known and finite.
All arrays implement RandomAccess
, with the last index being the observation index (observations-as-columns in matrices).
A Tables.jl compatible table data
implements RandomAccess
if Tables.istable(data)
is true and if data
implements DataAPI.nrow
. This includes many tables, and in particular, DataFrame
s. Tables that are also tuples are explicitly excluded.
Any tuple of objects implementing RandomAccess
also implements RandomAccess
.
If LearnAPI.data_interface(learner)
takes the value RandomAccess()
, then obs
(learner, ...)
is guaranteed to return objects implementing the RandomAccess
interface, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
Implementing RandomAccess
for new data types
Typically, to implement RandomAccess
for a new data type requires only implementing Base.getindex
and Base.length
, which are the fallbacks for MLUtils.getobs
and MLUtils.numobs
, and this avoids making MLUtils.jl a package dependency.
See also LearnAPI.FiniteIterable
, LearnAPI.Iterable
.
LearnAPI.FiniteIterable
— TypeLearnAPI.FiniteIterable
A data interface type. We say that data
implements the FiniteIterable
interface if it implements Julia's iterate
interface, including Base.length
, and if Base.IteratorSize(typeof(data)) == Base.HasLength()
. For example, this is true if:
data
implements theLearnAPI.RandomAccess
interface (arrays and most tables); ordata isa MLUtils.DataLoader
, which includes output fromMLUtils.eachobs
.
If LearnAPI.data_interface(learner)
takes the value FiniteIterable()
, then obs
(learner, ...)
is guaranteed to return objects implementing the FiniteIterable
interface, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
See also LearnAPI.RandomAccess
, LearnAPI.Iterable
.
LearnAPI.Iterable
— TypeLearnAPI.Iterable
A data interface type. We say that data
implements the Iterable
interface if it implements Julia's basic iterate
interface. (Such objects may not implement MLUtils.numobs
or Base.length
.)
If LearnAPI.data_interface(learner)
takes the value Iterable()
, then obs
(learner, ...)
is guaranteed to return objects implementing Iterable
, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
See also LearnAPI.FiniteIterable
, LearnAPI.RandomAccess
.