obs
The MLUtils.jl package provides two methods getobs
and numobs
for resampling data divided into multiple observations, including arrays and tables. The data objects returned below are guaranteed to implement this interface and can be passed to the relevant method (obsfit
, obspredict
or obstransform
) possibly after resampling using MLUtils.getobs
. This may provide performance advantages over naive workflows.
obs(fit, algorithm, data...) -> <combined data object for fit>
obs(predict, algorithm, data...) -> <combined data object for predict>
obs(transform, algorithm, data...) -> <combined data object for transform>
Typical workflows
LearnAPI.jl makes no assumptions about the form of data X
and y
in a call like fit(algorithm, X, y)
. The particular algorithm
is free to articulate it's own requirements. However, in this example, the definition
obsdata = obs(fit, algorithm, X, y)
combines X
and y
in a single object guaranteed to implement the MLUtils.jl getobs
/numobs
interface, which can be passed to obsfit
instead of fit
, as is, or after resampling using MLUtils.getobs
:
# equivalent to `mode = fit(algorithm, X, y)`:
model = obsfit(algorithm, obsdata)
# with resampling:
resampled_obsdata = MLUtils.getobs(obsdata, 1:100)
model = obsfit(algorithm, resampled_obsdata)
In some implementations, the alternative pattern above can be used to avoid repeating unnecessary internal data preprocessing, or inefficient resampling. For example, here's how a user might call obs
and MLUtils.getobs
to perform efficient cross-validation:
using LearnAPI
import MLUtils
X = <some data frame with 30 rows>
y = <some categorical vector with 30 rows>
algorithm = <some LearnAPI-compliant algorithm>
test_train_folds = map([1:10, 11:20, 21:30]) do test
(test, setdiff(1:30, test))
end
# create fixed model-specific representations of the whole data set:
fit_data = obs(fit, algorithm, X, y)
predict_data = obs(predict, algorithm, predict, X)
scores = map(train_test_folds) do (train_indices, test_indices)
# train using model-specific representation of data:
train_data = MLUtils.getobs(fit_data, train_indices)
model = obsfit(algorithm, train_data)
# predict on the fold complement:
test_data = MLUtils.getobs(predict_data, test_indices)
ŷ = obspredict(model, LiteralTarget(), test_data)
return <score comparing ŷ with y[test]>
end
Note here that the output of obspredict
will match the representation of y
, i.e., there is no concept of an algorithm-specific representation of outputs, only inputs.
Implementation guide
method | compulsory? | fallback |
---|---|---|
obs | depends | slurps data argument |
If the data
consumed by fit
, predict
or transform
consists only of tables and arrays (with last dimension the observation dimension) then overloading obs
is optional. However, if an implementation overloads obs
to return a (thinly wrapped) representation of user data that is closer to what the core algorithm actually uses, and overloads MLUtils.getobs
(or, more typically Base.getindex
) to make resampling of that representation efficient, then those optimizations become available to the user, without the user concerning herself with the details of the representation.
A sample implementation is given in the obs
document-string below.
Reference
LearnAPI.obs
— Functionobs(func, algorithm, data...)
Where func
is fit
, predict
or transform
, return a combined, algorithm-specific, representation of data...
, which can be passed directly to obsfit
, obspredict
or obstransform
, as shown in the example below.
The returned object implements the getobs
/numobs
observation-resampling interface provided by MLUtils.jl, even if data
does not.
Calling func
on the returned object may be cheaper than calling func
directly on data...
. And resampling the returned object using MLUtils.getobs
may be cheaper than directly resampling the components of data
(an operation not provided by the LearnAPI.jl interface).
Example
Usual workflow, using data-specific resampling methods:
X = <some `DataFrame`>
y = <some `Vector`>
Xtrain = Tables.select(X, 1:100)
ytrain = y[1:100]
model = fit(algorithm, Xtrain, ytrain)
ŷ = predict(model, LiteralTarget(), y[101:150])
Alternative workflow using obs
:
import MLUtils
fitdata = obs(fit, algorithm, X, y)
predictdata = obs(predict, algorithm, X)
model = obsfit(algorithm, MLUtils.getobs(fitdata, 1:100))
ẑ = obspredict(model, LiteralTarget(), MLUtils.getobs(predictdata, 101:150))
@assert ẑ == ŷ
See also obsfit
, obspredict
, obstransform
.
Extended help
New implementations
If the data
to be consumed in standard user calls to fit
, predict
or transform
consists only of tables and arrays (with last dimension the observation dimension) then overloading obs
is optional, but the user will get no performance benefits by using it. The implementation of obs
is optional under more general circumstances stated at the end.
The fallback for obs
just slurps the provided data:
obs(func, alg, data...) = data
The only contractual obligation of obs
is to return an object implementing the getobs
/numobs
interface. Generally it suffices to overload Base.getindex
and Base.length
. However, note that implementations of obsfit
, obspredict
, and obstransform
depend on the form of output of obs
.
If overloaded, you must include obs
in the tuple returned by the LearnAPI.functions
trait.
Sample implementation
Suppose that fit
, for an algorithm of type Alg
, is to have the primary signature
fit(algorithm::Alg, X, y)
where X
is a table, y
a vector. Internally, the algorithm is to call a lower level function
train(A, names, y)
where A = Tables.matrix(X)'
and names
are the column names of X
. Then relevant parts of an implementation might look like this:
# thin wrapper for algorithm-specific representation of data:
struct ObsData{T}
A::Matrix{T}
names::Vector{Symbol}
y::Vector{T}
end
# (indirect) implementation of `getobs/numobs`:
Base.getindex(data::ObsData, I) =
ObsData(data.A[:,I], data.names, y[I])
Base.length(data::ObsData, I) = length(data.y)
# implementation of `obs`:
function LearnAPI.obs(::typeof(fit), ::Alg, X, y)
table = Tables.columntable(X)
names = Tables.columnnames(table) |> collect
return ObsData(Tables.matrix(table)', names, y)
end
# implementation of `obsfit`:
function LearnAPI.obsfit(algorithm::Alg, data::ObsData; verbosity=1)
coremodel = train(data.A, data.names, data.y)
data.verbosity > 0 && @info "Training using these features: names."
<construct final `model` using `coremodel`>
return model
end
When is overloading obs
optional?
Overloading obs
is optional, for a given typeof(algorithm)
and typeof(fun)
, if the components of data
in the standard call func(algorithm_or_model, data...)
are already expected to separately implement the getobs
/numbobs
interface. This is true for arrays whose last dimension is the observation dimension, and for suitable tables.