`obs` and Data Interfaces

The obs method takes data intended as input to fit, predict or transform, and transforms it to a learner-specific form guaranteed to implement a form of observation access designated by the learner. The transformed data can then passed on to the relevant method in place of the original input (after first resampling it, if the learner supports this). Using obs may provide performance advantages over naive workflows in some cases (e.g., cross-validation).

obs(learner, data) # can be passed to `fit` instead of `data`
obs(model, data)   # can be passed to `predict` or `transform` instead of `data`

Data interfaces

Typical workflows

LearnAPI.jl makes no universal assumptions about the form of data in a call like fit(learner, data). However, if we define

observations = obs(learner, data)

then, assuming the typical case that LearnAPI.data_interface(learner) == LearnAPI.RandomAccess(), observations implements the MLCore.jl getobs/numobs interface, for grabbing and counting observations. Moreover, we can pass observations to fit in place of the original data, or first resample it using MLCore.getobs:

# equivalent to `model = fit(learner, data)`
model = fit(learner, observations)

# with resampling:
resampled_observations = MLCore.getobs(observations, 1:10)
model = fit(learner, resampled_observations)

In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs and MLCore.getobs to perform efficient cross-validation:

using LearnAPI
import MLCore

learner = <some supervised learner>

data = <some data that `fit` can consume, with 30 observations>

train_test_folds = map([1:10, 11:20, 21:30]) do test
    (setdiff(1:30, test), test)
end

fitobs = obs(learner, data)
never_trained = true

scores = map(train_test_folds) do (train, test)

    # train using model-specific representation of data:
    fitobs_subset = MLCore.getobs(fitobs, train)
    model = fit(learner, fitobs_subset)

    # predict on the fold complement:
    if never_trained
        X = LearnAPI.features(learner, data)
        global predictobs = obs(model, X)
        global never_trained = false
    end
    predictobs_subset = MLCore.getobs(predictobs, test)
    ŷ = predict(model, Point(), predictobs_subset)

    y = LearnAPI.target(learner, data)
    return <score comparing ŷ with y[test]>

end

Implementation guide

method	comment	compulsory?	fallback
`obs(learner, data)`	here `data` is `fit`-consumable	not typically	returns `data`
`obs(model, data)`	here `data` is `predict`-consumable	not typically	returns `data`

A sample implementation is given in Providing a separate data front end.

Reference

LearnAPI.obs — Function

obs(learner, data)
obs(model, data)

Return learner-specific representation of data, suitable for passing to fit, update, update_observations, or update_features (first signature) or to predict and transform (second signature), in place of data. Here model is the return value of fit(learner, ...) for some LearnAPI.jl learner, learner.

The returned object is guaranteed to implement observation access as indicated by LearnAPI.data_interface(learner), typically LearnAPI.RandomAccess().

Calling fit/predict/transform on the returned objects may have performance advantages over calling directly on data in some contexts.

Example

Usual workflow, using data-specific resampling methods:

data = (X, y) # a DataFrame and a vector
data_train = (Tables.select(X, 1:100), y[1:100])
model = fit(learner, data_train)
ŷ = predict(model, Point(), X[101:150])

Alternative workflow using obs and the MLCore.jl method getobs to carry out subsampling (assumes LearnAPI.data_interface(learner) == RandomAccess()):

import MLCore
fit_observations = obs(learner, data)
model = fit(learner, MLCore.getobs(fit_observations, 1:100))
predict_observations = obs(model, X)
ẑ = predict(model, Point(), MLCore.getobs(predict_observations, 101:150))
@assert ẑ == ŷ

See also LearnAPI.data_interface.

Extended help

New implementations

Implementation is typically optional.

For each supported form of data in fit(learner, data), it must be true that model = fit(learner, observations) is equivalent to model = fit(learner, data), whenever observations = obs(learner, data). For each supported form of data in calls predict(model, ..., data) and transform(model, data), where implemented, the calls predict(model, ..., observations) and transform(model, observations) must be supported alternatives with the same output, whenever observations = obs(model, data).

If LearnAPI.data_interface(learner) == RandomAccess() (the default), then fit, predict and transform must additionally accept obs output that has been subsampled using MLCore.getobs, with the obvious interpretation applying to the outcomes of such calls (e.g., if all observations are subsampled, then outcomes should be the same as if using the original data).

It is required that obs(learner, _) and obs(model, _) are involutive, meaning both the following hold:

obs(learner, obs(learner, data)) == obs(learner, data)
obs(model, obs(model, data) == obs(model, obs(model, data)

If one overloads obs, one typically needs additionally overloadings to guarantee involutivity.

The fallback for obs is obs(model_or_learner, data) = data, and the fallback for LearnAPI.data_interface(learner) is LearnAPI.RandomAccess(). For details refer to the LearnAPI.data_interface document string.

In particular, if the data to be consumed by fit, predict or transform consists only of suitable tables and arrays, then obs and LearnAPI.data_interface do not need to be overloaded. However, the user will get no performance benefits by using obs in that case.

Sample implementation

Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl manual.

source

Available data interfaces

LearnAPI.DataInterface — Type

LearnAPI.DataInterface

Abstract supertype for singleton types designating an interface for accessing observations within a LearnAPI.jl data object.

New learner implementations must overload LearnAPI.data_interface(learner) to return one of the instances below if the output of obs does not implement the default LearnAPI.RandomAccess() interface. Arrays, most tables, and all tuples thereof, implement RandomAccess().

Available instances:

LearnAPI.RandomAccess() (default)
LearnAPI.FiniteIterable()
LearnAPI.Iterable()

source

LearnAPI.RandomAccess — Type

LearnAPI.RandomAccess

A data interface type. We say that data implements the RandomAccess interface if data implements the methods getobs and numobs from MLCore.jl. The first method allows one to grab observations specified by an arbitrary index set, as in MLCore.getobs(data, [2, 3, 5]), while the second method returns the total number of available observations, which is assumed to be known and finite.

All arrays implement RandomAccess, with the last index being the observation index (observations-as-columns in matrices).

A Tables.jl compatible table data implements RandomAccess if Tables.istable(data) is true and if data implements DataAPI.nrow. This includes many tables, and in particular, DataFrames. Tables that are also tuples are explicitly excluded.

Any tuple of objects implementing RandomAccess also implements RandomAccess.

If LearnAPI.data_interface(learner) takes the value RandomAccess(), then obs(learner, ...) is guaranteed to return objects implementing the RandomAccess interface, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

Implementing RandomAccess for new data types

Typically, to implement RandomAccess for a new data type requires only implementing Base.getindex and Base.length, which are the fallbacks for MLCore.getobs and MLCore.numobs, and this avoids making MLCore.jl a package dependency.

source

LearnAPI.FiniteIterable — Type

LearnAPI.FiniteIterable

A data interface type. We say that data implements the FiniteIterable interface if it implements Julia's iterate interface, including Base.length, and if Base.IteratorSize(typeof(data)) == Base.HasLength(). For example, this is true if data isa MLCore.DataLoader, which includes the output of MLUtils.eachobs.

If LearnAPI.data_interface(learner) takes the value FiniteIterable(), then obs(learner, ...) is guaranteed to return objects implementing the FiniteIterable interface, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

source

LearnAPI.Iterable — Type

LearnAPI.Iterable

A data interface type. We say that data implements the Iterable interface if it implements Julia's basic iterate interface. (Such objects may not implement MLCore.numobs or Base.length.)

If LearnAPI.data_interface(learner) takes the value Iterable(), then obs(learner, ...) is guaranteed to return objects implementing Iterable, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

source

obs and Data Interfaces

Typical workflows

Implementation guide

Reference

Available data interfaces

`obs` and Data Interfaces