obs and Data Interfaces

The obs method takes data intended as input to fit, predict or transform, and transforms it to a learner-specific form guaranteed to implement a form of observation access designated by the learner. The transformed data can then passed on to the relevant method in place of the original input (after first resampling it, if the learner supports this). Using obs may provide performance advantages over naive workflows in some cases (e.g., cross-validation).

obs(learner, data) # can be passed to `fit` instead of `data`
obs(model, data)   # can be passed to `predict` or `transform` instead of `data`

Typical workflows

LearnAPI.jl makes no universal assumptions about the form of data in a call like fit(learner, data). However, if we define

observations = obs(learner, data)

then, assuming the typical case that LearnAPI.data_interface(learner) == LearnAPI.RandomAccess(), observations implements the MLUtils.jl getobs/numobs interface, for grabbing and counting observations. Moreover, we can pass observations to fit in place of the original data, or first resample it using MLUtils.getobs:

# equivalent to `model = fit(learner, data)`
model = fit(learner, observations)

# with resampling:
resampled_observations = MLUtils.getobs(observations, 1:10)
model = fit(learner, resampled_observations)

In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs and MLUtils.getobs to perform efficient cross-validation:

using LearnAPI
import MLUtils

learner = <some supervised learner>

data = <some data that `fit` can consume, with 30 observations>

train_test_folds = map([1:10, 11:20, 21:30]) do test
    (setdiff(1:30, test), test)
end

fitobs = obs(learner, data)
never_trained = true

scores = map(train_test_folds) do (train, test)

    # train using model-specific representation of data:
    fitobs_subset = MLUtils.getobs(fitobs, train)
    model = fit(learner, fitobs_subset)

    # predict on the fold complement:
    if never_trained
        X = LearnAPI.features(learner, data)
        global predictobs = obs(model, X)
        global never_trained = false
    end
    predictobs_subset = MLUtils.getobs(predictobs, test)
    ŷ = predict(model, Point(), predictobs_subset)

    y = LearnAPI.target(learner, data)
    return <score comparing ŷ with y[test]>

end

Implementation guide

methodcommentcompulsory?fallback
obs(learner, data)here data is fit-consumablenot typicallyreturns data
obs(model, data)here data is predict-consumablenot typicallyreturns data

A sample implementation is given in Providing a separate data front end.

Reference

LearnAPI.obsFunction
obs(learner, data)
obs(model, data)

Return learner-specific representation of data, suitable for passing to fit (first signature) or to predict and transform (second signature), in place of data. Here model is the return value of fit(learner, ...) for some LearnAPI.jl learner, learner.

The returned object is guaranteed to implement observation access as indicated by LearnAPI.data_interface(learner), typically LearnAPI.RandomAccess().

Calling fit/predict/transform on the returned objects may have performance advantages over calling directly on data in some contexts.

Example

Usual workflow, using data-specific resampling methods:

data = (X, y) # a DataFrame and a vector
data_train = (Tables.select(X, 1:100), y[1:100])
model = fit(learner, data_train)
ŷ = predict(model, Point(), X[101:150])

Alternative workflow using obs and the MLUtils.jl method getobs to carry out subsampling (assumes LearnAPI.data_interface(learner) == RandomAccess()):

import MLUtils
fit_observations = obs(learner, data)
model = fit(learner, MLUtils.getobs(fit_observations, 1:100))
predict_observations = obs(model, X)
ẑ = predict(model, Point(), MLUtils.getobs(predict_observations, 101:150))
@assert ẑ == ŷ

See also LearnAPI.data_interface.

Extended help

New implementations

Implementation is typically optional.

For each supported form of data in fit(learner, data), it must be true that model = fit(learner, observations) is equivalent to model = fit(learner, data), whenever observations = obs(learner, data). For each supported form of data in calls predict(model, ..., data) and transform(model, data), where implemented, the calls predict(model, ..., observations) and transform(model, observations) must be supported alternatives with the same output, whenever observations = obs(model, data).

If LearnAPI.data_interface(learner) == RandomAccess() (the default), then fit, predict and transform must additionally accept obs output that has been subsampled using MLUtils.getobs, with the obvious interpretation applying to the outcomes of such calls (e.g., if all observations are subsampled, then outcomes should be the same as if using the original data).

It is required that obs(learner, _) and obs(model, _) are involutive, meaning both the following hold:

obs(learner, obs(learner, data)) == obs(learner, data)
obs(model, obs(model, data) == obs(model, obs(model, data)

If one overloads obs, one typically needs additionally overloadings to guarantee involutivity.

The fallback for obs is obs(model_or_learner, data) = data, and the fallback for LearnAPI.data_interface(learner) is LearnAPI.RandomAccess(). For details refer to the LearnAPI.data_interface document string.

In particular, if the data to be consumed by fit, predict or transform consists only of suitable tables and arrays, then obs and LearnAPI.data_interface do not need to be overloaded. However, the user will get no performance benefits by using obs in that case.

Sample implementation

Refer to the "Anatomy of an Implementation" section of the LearnAPI.jl manual.

source

Data interfaces

New implementations must overload LearnAPI.data_interface(learner) if the output of obs does not implement LearnAPI.RandomAccess(). Arrays, most tables, and all tuples thereof, implement RandomAccess().

LearnAPI.RandomAccessType
LearnAPI.RandomAccess

A data interface type. We say that data implements the RandomAccess interface if data implements the methods getobs and numobs from MLUtils.jl. The first method allows one to grab observations specified by an arbitrary index set, as in MLUtils.getobs(data, [2, 3, 5]), while the second method returns the total number of available observations, which is assumed to be known and finite.

All arrays implement RandomAccess, with the last index being the observation index (observations-as-columns in matrices).

A Tables.jl compatible table data implements RandomAccess if Tables.istable(data) is true and if data implements DataAPI.nrow. This includes many tables, and in particular, DataFrames. Tables that are also tuples are explicitly excluded.

Any tuple of objects implementing RandomAccess also implements RandomAccess.

If LearnAPI.data_interface(learner) takes the value RandomAccess(), then obs(learner, ...) is guaranteed to return objects implementing the RandomAccess interface, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

Implementing RandomAccess for new data types

Typically, to implement RandomAccess for a new data type requires only implementing Base.getindex and Base.length, which are the fallbacks for MLUtils.getobs and MLUtils.numobs, and this avoids making MLUtils.jl a package dependency.

See also LearnAPI.FiniteIterable, LearnAPI.Iterable.

source
LearnAPI.FiniteIterableType
LearnAPI.FiniteIterable

A data interface type. We say that data implements the FiniteIterable interface if it implements Julia's iterate interface, including Base.length, and if Base.IteratorSize(typeof(data)) == Base.HasLength(). For example, this is true if:

  • data implements the LearnAPI.RandomAccess interface (arrays and most tables); or

  • data isa MLUtils.DataLoader, which includes output from MLUtils.eachobs.

If LearnAPI.data_interface(learner) takes the value FiniteIterable(), then obs(learner, ...) is guaranteed to return objects implementing the FiniteIterable interface, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

See also LearnAPI.RandomAccess, LearnAPI.Iterable.

source
LearnAPI.IterableType
LearnAPI.Iterable

A data interface type. We say that data implements the Iterable interface if it implements Julia's basic iterate interface. (Such objects may not implement MLUtils.numobs or Base.length.)

If LearnAPI.data_interface(learner) takes the value Iterable(), then obs(learner, ...) is guaranteed to return objects implementing Iterable, and the same holds for obs(model, ...), whenever LearnAPI.learner(model) == learner.

See also LearnAPI.FiniteIterable, LearnAPI.RandomAccess.

source