obs
and Data Interfaces
The obs
method takes data intended as input to fit
, predict
or transform
, and transforms it to a learner-specific form guaranteed to implement a form of observation access designated by the learner. The transformed data can then passed on to the relevant method in place of the original input (after first resampling it, if the learner supports this). Using obs
may provide performance advantages over naive workflows in some cases (e.g., cross-validation).
obs(learner, data) # can be passed to `fit` instead of `data`
obs(model, data) # can be passed to `predict` or `transform` instead of `data`
Typical workflows
LearnAPI.jl makes no universal assumptions about the form of data
in a call like fit(learner, data)
. However, if we define
observations = obs(learner, data)
then, assuming the typical case that LearnAPI.data_interface(learner) == LearnAPI.RandomAccess()
, observations
implements the MLCore.jl getobs
/numobs
interface, for grabbing and counting observations. Moreover, we can pass observations
to fit
in place of the original data, or first resample it using MLCore.getobs
:
# equivalent to `model = fit(learner, data)`
model = fit(learner, observations)
# with resampling:
resampled_observations = MLCore.getobs(observations, 1:10)
model = fit(learner, resampled_observations)
In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs
and MLCore.getobs
to perform efficient cross-validation:
using LearnAPI
import MLCore
learner = <some supervised learner>
data = <some data that `fit` can consume, with 30 observations>
train_test_folds = map([1:10, 11:20, 21:30]) do test
(setdiff(1:30, test), test)
end
fitobs = obs(learner, data)
never_trained = true
scores = map(train_test_folds) do (train, test)
# train using model-specific representation of data:
fitobs_subset = MLCore.getobs(fitobs, train)
model = fit(learner, fitobs_subset)
# predict on the fold complement:
if never_trained
X = LearnAPI.features(learner, data)
global predictobs = obs(model, X)
global never_trained = false
end
predictobs_subset = MLCore.getobs(predictobs, test)
ŷ = predict(model, Point(), predictobs_subset)
y = LearnAPI.target(learner, data)
return <score comparing ŷ with y[test]>
end
Implementation guide
method | comment | compulsory? | fallback |
---|---|---|---|
obs(learner, data) | here data is fit -consumable | not typically | returns data |
obs(model, data) | here data is predict -consumable | not typically | returns data |
A sample implementation is given in Providing a separate data front end.
Reference
LearnAPI.obs
— Functionobs(learner, data)
obs(model, data)
Return learner-specific representation of data
, suitable for passing to fit
, update
, update_observations
, or update_features
(first signature) or to predict
and transform
(second signature), in place of data
. Here model
is the return value of fit(learner, ...)
for some LearnAPI.jl learner, learner
.
The returned object is guaranteed to implement observation access as indicated by LearnAPI.data_interface(learner)
, typically LearnAPI.RandomAccess()
.
Calling fit
/predict
/transform
on the returned objects may have performance advantages over calling directly on data
in some contexts.
Example
Usual workflow, using data-specific resampling methods:
data = (X, y) # a DataFrame and a vector
data_train = (Tables.select(X, 1:100), y[1:100])
model = fit(learner, data_train)
ŷ = predict(model, Point(), X[101:150])
Alternative workflow using obs
and the MLCore.jl method getobs
to carry out subsampling (assumes LearnAPI.data_interface(learner) == RandomAccess()
):
import MLCore
fit_observations = obs(learner, data)
model = fit(learner, MLCore.getobs(fit_observations, 1:100))
predict_observations = obs(model, X)
ẑ = predict(model, Point(), MLCore.getobs(predict_observations, 101:150))
@assert ẑ == ŷ
See also LearnAPI.data_interface
.
Extended help
New implementations
Implementation is typically optional.
For each supported form of data
in fit(learner, data)
, it must be true that model = fit(learner, observations)
is equivalent to model = fit(learner, data)
, whenever observations = obs(learner, data)
. For each supported form of data
in calls predict(model, ..., data)
and transform(model, data)
, where implemented, the calls predict(model, ..., observations)
and transform(model, observations)
must be supported alternatives with the same output, whenever observations = obs(model, data)
.
If LearnAPI.data_interface(learner) == RandomAccess()
(the default), then fit
, predict
and transform
must additionally accept obs
output that has been subsampled using MLCore.getobs
, with the obvious interpretation applying to the outcomes of such calls (e.g., if all observations are subsampled, then outcomes should be the same as if using the original data).
It is required that obs(learner, _)
and obs(model, _)
are involutive, meaning both the following hold:
obs(learner, obs(learner, data)) == obs(learner, data)
obs(model, obs(model, data) == obs(model, obs(model, data)
If one overloads obs
, one typically needs additionally overloadings to guarantee involutivity.
The fallback for obs
is obs(model_or_learner, data) = data
, and the fallback for LearnAPI.data_interface(learner)
is LearnAPI.RandomAccess()
. For details refer to the LearnAPI.data_interface
document string.
In particular, if the data
to be consumed by fit
, predict
or transform
consists only of suitable tables and arrays, then obs
and LearnAPI.data_interface
do not need to be overloaded. However, the user will get no performance benefits by using obs
in that case.
Sample implementation
Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl manual.
Available data interfaces
LearnAPI.DataInterface
— TypeLearnAPI.DataInterface
Abstract supertype for singleton types designating an interface for accessing observations within a LearnAPI.jl data object.
New learner implementations must overload LearnAPI.data_interface(learner)
to return one of the instances below if the output of obs
does not implement the default LearnAPI.RandomAccess()
interface. Arrays, most tables, and all tuples thereof, implement RandomAccess()
.
Available instances:
LearnAPI.RandomAccess
— TypeLearnAPI.RandomAccess
A data interface type. We say that data
implements the RandomAccess
interface if data
implements the methods getobs
and numobs
from MLCore.jl. The first method allows one to grab observations specified by an arbitrary index set, as in MLCore.getobs(data, [2, 3, 5])
, while the second method returns the total number of available observations, which is assumed to be known and finite.
All arrays implement RandomAccess
, with the last index being the observation index (observations-as-columns in matrices).
A Tables.jl compatible table data
implements RandomAccess
if Tables.istable(data)
is true and if data
implements DataAPI.nrow
. This includes many tables, and in particular, DataFrame
s. Tables that are also tuples are explicitly excluded.
Any tuple of objects implementing RandomAccess
also implements RandomAccess
.
If LearnAPI.data_interface(learner)
takes the value RandomAccess()
, then obs
(learner, ...)
is guaranteed to return objects implementing the RandomAccess
interface, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
Implementing RandomAccess
for new data types
Typically, to implement RandomAccess
for a new data type requires only implementing Base.getindex
and Base.length
, which are the fallbacks for MLCore.getobs
and MLCore.numobs
, and this avoids making MLCore.jl a package dependency.
See also LearnAPI.FiniteIterable
, LearnAPI.Iterable
.
LearnAPI.FiniteIterable
— TypeLearnAPI.FiniteIterable
A data interface type. We say that data
implements the FiniteIterable
interface if it implements Julia's iterate
interface, including Base.length
, and if Base.IteratorSize(typeof(data)) == Base.HasLength()
. For example, this is true if data isa MLCore.DataLoader
, which includes the output of MLUtils.eachobs
.
If LearnAPI.data_interface(learner)
takes the value FiniteIterable()
, then obs
(learner, ...)
is guaranteed to return objects implementing the FiniteIterable
interface, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
See also LearnAPI.RandomAccess
, LearnAPI.Iterable
.
LearnAPI.Iterable
— TypeLearnAPI.Iterable
A data interface type. We say that data
implements the Iterable
interface if it implements Julia's basic iterate
interface. (Such objects may not implement MLCore.numobs
or Base.length
.)
If LearnAPI.data_interface(learner)
takes the value Iterable()
, then obs
(learner, ...)
is guaranteed to return objects implementing Iterable
, and the same holds for obs(model, ...)
, whenever LearnAPI.learner(model) == learner
.
See also LearnAPI.FiniteIterable
, LearnAPI.RandomAccess
.