`obs`

The MLUtils.jl package provides two methods getobs and numobs for resampling data divided into multiple observations, including arrays and tables. The data objects returned below are guaranteed to implement this interface and can be passed to the relevant method (obsfit, obspredict or obstransform) possibly after resampling using MLUtils.getobs. This may provide performance advantages over naive workflows.

obs(fit, algorithm, data...) -> <combined data object for fit>
obs(predict, algorithm, data...) -> <combined data object for predict>
obs(transform, algorithm, data...) -> <combined data object for transform>

Typical workflows

LearnAPI.jl makes no assumptions about the form of data X and y in a call like fit(algorithm, X, y). The particular algorithm is free to articulate it's own requirements. However, in this example, the definition

obsdata = obs(fit, algorithm, X, y)

combines X and y in a single object guaranteed to implement the MLUtils.jl getobs/numobs interface, which can be passed to obsfit instead of fit, as is, or after resampling using MLUtils.getobs:

# equivalent to `mode = fit(algorithm, X, y)`:
model = obsfit(algorithm, obsdata)

# with resampling:
resampled_obsdata = MLUtils.getobs(obsdata, 1:100)
model = obsfit(algorithm, resampled_obsdata)

In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs and MLUtils.getobs to perform efficient cross-validation:

using LearnAPI
import MLUtils

X = <some data frame with 30 rows>
y = <some categorical vector with 30 rows>
algorithm = <some LearnAPI-compliant algorithm>

test_train_folds = map([1:10, 11:20, 21:30]) do test
    (test, setdiff(1:30, test))
end 

# create fixed model-specific representations of the whole data set:
fit_data = obs(fit, algorithm, X, y)
predict_data = obs(predict, algorithm, predict, X)

scores = map(train_test_folds) do (train_indices, test_indices)
    
	# train using model-specific representation of data:
	train_data = MLUtils.getobs(fit_data, train_indices)
	model = obsfit(algorithm, train_data)
	
	# predict on the fold complement:
	test_data = MLUtils.getobs(predict_data, test_indices)
	ŷ = obspredict(model, LiteralTarget(), test_data)

    return <score comparing ŷ with y[test]>
	
end

Note here that the output of obspredict will match the representation of y , i.e., there is no concept of an algorithm-specific representation of outputs, only inputs.

Implementation guide

method	compulsory?	fallback
`obs`	depends	slurps `data` argument

If the data consumed by fit, predict or transform consists only of tables and arrays (with last dimension the observation dimension) then overloading obs is optional. However, if an implementation overloads obs to return a (thinly wrapped) representation of user data that is closer to what the core algorithm actually uses, and overloads MLUtils.getobs (or, more typically Base.getindex) to make resampling of that representation efficient, then those optimizations become available to the user, without the user concerning herself with the details of the representation.

A sample implementation is given in the obs document-string below.

Reference

LearnAPI.obs — Function

obs(func, algorithm, data...)

Where func is fit, predict or transform, return a combined, algorithm-specific, representation of data..., which can be passed directly to obsfit, obspredict or obstransform, as shown in the example below.

The returned object implements the getobs/numobs observation-resampling interface provided by MLUtils.jl, even if data does not.

Calling func on the returned object may be cheaper than calling func directly on data.... And resampling the returned object using MLUtils.getobs may be cheaper than directly resampling the components of data (an operation not provided by the LearnAPI.jl interface).

Example

Usual workflow, using data-specific resampling methods:

X = <some `DataFrame`>
y = <some `Vector`>

Xtrain = Tables.select(X, 1:100)
ytrain = y[1:100]
model = fit(algorithm, Xtrain, ytrain)
ŷ = predict(model, LiteralTarget(), y[101:150])

Alternative workflow using obs:

import MLUtils

fitdata = obs(fit, algorithm, X, y)
predictdata = obs(predict, algorithm, X)

model = obsfit(algorithm, MLUtils.getobs(fitdata, 1:100))
ẑ = obspredict(model, LiteralTarget(), MLUtils.getobs(predictdata, 101:150))
@assert ẑ == ŷ

Extended help

New implementations

If the data to be consumed in standard user calls to fit, predict or transform consists only of tables and arrays (with last dimension the observation dimension) then overloading obs is optional, but the user will get no performance benefits by using it. The implementation of obs is optional under more general circumstances stated at the end.

The fallback for obs just slurps the provided data:

obs(func, alg, data...) = data

The only contractual obligation of obs is to return an object implementing the getobs/numobs interface. Generally it suffices to overload Base.getindex and Base.length. However, note that implementations of obsfit, obspredict, and obstransform depend on the form of output of obs.

If overloaded, you must include obs in the tuple returned by the LearnAPI.functions trait.

Sample implementation

Suppose that fit, for an algorithm of type Alg, is to have the primary signature

fit(algorithm::Alg, X, y)

where X is a table, y a vector. Internally, the algorithm is to call a lower level function

train(A, names, y)

where A = Tables.matrix(X)' and names are the column names of X. Then relevant parts of an implementation might look like this:

# thin wrapper for algorithm-specific representation of data:
struct ObsData{T}
    A::Matrix{T}
    names::Vector{Symbol}
    y::Vector{T}
end

# (indirect) implementation of `getobs/numobs`:
Base.getindex(data::ObsData, I) =
    ObsData(data.A[:,I], data.names, y[I])
Base.length(data::ObsData, I) = length(data.y)

# implementation of `obs`:
function LearnAPI.obs(::typeof(fit), ::Alg, X, y)
    table = Tables.columntable(X)
    names = Tables.columnnames(table) |> collect
    return ObsData(Tables.matrix(table)', names, y)
end

# implementation of `obsfit`:
function LearnAPI.obsfit(algorithm::Alg, data::ObsData; verbosity=1)
    coremodel = train(data.A, data.names, data.y)
    data.verbosity > 0 && @info "Training using these features: names."
    <construct final `model` using `coremodel`>
    return model
end

When is overloading obs optional?

Overloading obs is optional, for a given typeof(algorithm) and typeof(fun), if the components of data in the standard call func(algorithm_or_model, data...) are already expected to separately implement the getobs/numbobs interface. This is true for arrays whose last dimension is the observation dimension, and for suitable tables.

source