`obs`

The MLUtils.jl package provides two methods `getobs`

and `numobs`

for resampling data divided into multiple observations, including arrays and tables. The data objects returned below are guaranteed to implement this interface and can be passed to the relevant method (`obsfit`

, `obspredict`

or `obstransform`

) possibly after resampling using `MLUtils.getobs`

. This may provide performance advantages over naive workflows.

```
obs(fit, algorithm, data...) -> <combined data object for fit>
obs(predict, algorithm, data...) -> <combined data object for predict>
obs(transform, algorithm, data...) -> <combined data object for transform>
```

## Typical workflows

LearnAPI.jl makes no assumptions about the form of data `X`

and `y`

in a call like `fit(algorithm, X, y)`

. The particular `algorithm`

is free to articulate it's own requirements. However, in this example, the definition

`obsdata = obs(fit, algorithm, X, y)`

combines `X`

and `y`

in a single object guaranteed to implement the MLUtils.jl `getobs`

/`numobs`

interface, which can be passed to `obsfit`

instead of `fit`

, as is, or after resampling using `MLUtils.getobs`

:

```
# equivalent to `mode = fit(algorithm, X, y)`:
model = obsfit(algorithm, obsdata)
# with resampling:
resampled_obsdata = MLUtils.getobs(obsdata, 1:100)
model = obsfit(algorithm, resampled_obsdata)
```

In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call `obs`

and `MLUtils.getobs`

to perform efficient cross-validation:

```
using LearnAPI
import MLUtils
X = <some data frame with 30 rows>
y = <some categorical vector with 30 rows>
algorithm = <some LearnAPI-compliant algorithm>
test_train_folds = map([1:10, 11:20, 21:30]) do test
(test, setdiff(1:30, test))
end
# create fixed model-specific representations of the whole data set:
fit_data = obs(fit, algorithm, X, y)
predict_data = obs(predict, algorithm, predict, X)
scores = map(train_test_folds) do (train_indices, test_indices)
# train using model-specific representation of data:
train_data = MLUtils.getobs(fit_data, train_indices)
model = obsfit(algorithm, train_data)
# predict on the fold complement:
test_data = MLUtils.getobs(predict_data, test_indices)
ŷ = obspredict(model, LiteralTarget(), test_data)
return <score comparing ŷ with y[test]>
end
```

Note here that the output of `obspredict`

will match the representation of `y`

, i.e., there is no concept of an algorithm-specific representation of *outputs*, only inputs.

## Implementation guide

method | compulsory? | fallback |
---|---|---|

`obs` | depends | slurps `data` argument |

If the `data`

consumed by `fit`

, `predict`

or `transform`

consists only of tables and arrays (with last dimension the observation dimension) then overloading `obs`

is optional. However, if an implementation overloads `obs`

to return a (thinly wrapped) representation of user data that is closer to what the core algorithm actually uses, and overloads `MLUtils.getobs`

(or, more typically `Base.getindex`

) to make resampling of that representation efficient, then those optimizations become available to the user, without the user concerning herself with the details of the representation.

A sample implementation is given in the `obs`

document-string below.

## Reference

`LearnAPI.obs`

— Function`obs(func, algorithm, data...)`

Where `func`

is `fit`

, `predict`

or `transform`

, return a combined, algorithm-specific, representation of `data...`

, which can be passed directly to `obsfit`

, `obspredict`

or `obstransform`

, as shown in the example below.

The returned object implements the `getobs`

/`numobs`

observation-resampling interface provided by MLUtils.jl, even if `data`

does not.

Calling `func`

on the returned object may be cheaper than calling `func`

directly on `data...`

. And resampling the returned object using `MLUtils.getobs`

may be cheaper than directly resampling the components of `data`

(an operation not provided by the LearnAPI.jl interface).

**Example**

Usual workflow, using data-specific resampling methods:

```
X = <some `DataFrame`>
y = <some `Vector`>
Xtrain = Tables.select(X, 1:100)
ytrain = y[1:100]
model = fit(algorithm, Xtrain, ytrain)
ŷ = predict(model, LiteralTarget(), y[101:150])
```

Alternative workflow using `obs`

:

```
import MLUtils
fitdata = obs(fit, algorithm, X, y)
predictdata = obs(predict, algorithm, X)
model = obsfit(algorithm, MLUtils.getobs(fitdata, 1:100))
ẑ = obspredict(model, LiteralTarget(), MLUtils.getobs(predictdata, 101:150))
@assert ẑ == ŷ
```

See also `obsfit`

, `obspredict`

, `obstransform`

.

**Extended help**

**New implementations**

If the `data`

to be consumed in standard user calls to `fit`

, `predict`

or `transform`

consists only of tables and arrays (with last dimension the observation dimension) then overloading `obs`

is optional, but the user will get no performance benefits by using it. The implementation of `obs`

is optional under more general circumstances stated at the end.

The fallback for `obs`

just slurps the provided data:

`obs(func, alg, data...) = data`

The only contractual obligation of `obs`

is to return an object implementing the `getobs`

/`numobs`

interface. Generally it suffices to overload `Base.getindex`

and `Base.length`

. However, note that implementations of `obsfit`

, `obspredict`

, and `obstransform`

depend on the form of output of `obs`

.

If overloaded, you must include `obs`

in the tuple returned by the `LearnAPI.functions`

trait.

**Sample implementation**

Suppose that `fit`

, for an algorithm of type `Alg`

, is to have the primary signature

`fit(algorithm::Alg, X, y)`

where `X`

is a table, `y`

a vector. Internally, the algorithm is to call a lower level function

`train(A, names, y)`

where `A = Tables.matrix(X)'`

and `names`

are the column names of `X`

. Then relevant parts of an implementation might look like this:

```
# thin wrapper for algorithm-specific representation of data:
struct ObsData{T}
A::Matrix{T}
names::Vector{Symbol}
y::Vector{T}
end
# (indirect) implementation of `getobs/numobs`:
Base.getindex(data::ObsData, I) =
ObsData(data.A[:,I], data.names, y[I])
Base.length(data::ObsData, I) = length(data.y)
# implementation of `obs`:
function LearnAPI.obs(::typeof(fit), ::Alg, X, y)
table = Tables.columntable(X)
names = Tables.columnnames(table) |> collect
return ObsData(Tables.matrix(table)', names, y)
end
# implementation of `obsfit`:
function LearnAPI.obsfit(algorithm::Alg, data::ObsData; verbosity=1)
coremodel = train(data.A, data.names, data.y)
data.verbosity > 0 && @info "Training using these features: names."
<construct final `model` using `coremodel`>
return model
end
```

**When is overloading obs optional?**

Overloading `obs`

is optional, for a given `typeof(algorithm)`

and `typeof(fun)`

, if the components of `data`

in the standard call `func(algorithm_or_model, data...)`

are already expected to separately implement the `getobs`

/`numbobs`

interface. This is true for arrays whose last dimension is the observation dimension, and for suitable tables.