Learning Networks
MLJ has a flexible interface for building networks from multiple machine learning elements, whose complexity extend beyond the "pipelines" of other machine learning toolboxes.
Overview
In the future the casual MLJ user will be able to build common pipeline architetures, such as linear compositites and stacks, with simple macro invocations. Handcrafting a learning network, as outlined below, is an advanced MLJ feature, assuming familiarity with the basics outlined in Getting Started. The syntax for building a learning network is essentially an extension of the basic syntax but with data containers replaced with nodes ("dynamic data").
In MLJ, a learning network is a graph whose nodes apply an operation, such as predict
or transform
, using a fixed machine (requiring training) - or which, alternatively, applies a regular (untrained) mathematical operation to its input(s). In practice, a learning network works with fixed sources for its training/evaluation data, but can be built and tested in stages. By contrast, an exported learning network is a learning network exported as a stand-alone, re-usable Model
object, to which all the MLJ Model
meta-algorithms can be applied (ensembling, systematic tuning, etc).
As we shall see, exporting a learning network as a reusable model, is quite simple. While one can entirely skip the build-and-train steps, experimenting with raw learning networks may be the best way to understand how the stand-alone models work under the hood.
In MLJ learning networks treat the flow of information during training and predicting separately. For this reason, simpler examples may appear more slighlty more complicated than in other approaches. However, in more sophisticated examples, such as stacking, the separation is essential.
Building a simple learning network
The diagram above depicts a learning network which standardises the input data X
, learns an optimal Box-Cox transformation for the target y
, predicts new target values using ridge regression, and then inverse-transforms those predictions, for later comparison with the original test data. The machines are labelled in yellow. We first need to import the RidgeRegressor model (you will need MLJModels
in your load path):
@load RidgeRegressor
To implement the network, we begin by loading data needed for training and evaluation into source nodes. For testing purposes, we'll use a small synthetic data set:
using Statistics, DataFrames
x1 = rand(300)
x2 = rand(300)
x3 = rand(300)
y = exp.(x1 - x2 -2x3 + 0.1*rand(300))
X = DataFrame(x1=x1, x2=x2, x3=x3)
ys = source(y)
Xs = source(X)
Source @ 3…40
We label nodes we will construct according to their outputs in the diagram. Notice that the nodes z
and yhat
use the same machine, namely box
, for different operations.
To construct the W
node we first need to define the machine stand
that it will use to transform inputs.
stand_model = Standardizer()
stand = machine(stand_model, Xs)
NodalMachine @ 6…82 = machine(Standardizer{} @ 1…82, 3…40)
Because Xs
is a node, instead of concrete data, we can call transform
on the machine without first training it, and the result is the new node W
, instead of concrete transformed data:
W = transform(stand, Xs)
Node @ 1…67 = transform(6…82, 3…40)
To get actual transformed data we call the node appropriately, which will require we first train the node. Training a node, rather than a machine, triggers training of all necessary machines in the network.
test, train = partition(eachindex(y), 0.8)
fit!(W, rows=train)
W() # transform all data
W(rows=test ) # transform only test data
W(X[3:4,:]) # transform any data, new or old
2×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼───────────┤
│ 1 │ -0.516373 │ 0.675257 │ 1.27734 │
│ 2 │ 0.63249 │ -1.70306 │ 0.0479891 │
If you like, you can think of W
(and the other nodes we will define) as "dynamic data": W
is data, in the sense that it an be called ("indexed") on rows, but dynamic, in the sense the result depends on the outcome of training events.
The other nodes of our network are defined similarly:
box_model = UnivariateBoxCoxTransformer() # for making data look normally-distributed
box = machine(box_model, ys)
z = transform(box, ys)
ridge_model = RidgeRegressor(lambda=0.1)
ridge =machine(ridge_model, W, z)
zhat = predict(ridge, W)
yhat = inverse_transform(box, zhat)
Node @ 1…07 = inverse_transform(1…09, predict(2…66, transform(6…82, 3…40)))
We are ready to train and evaluate the completed network. Notice that the standardizer, stand
, is not retrained, as MLJ remembers that it was trained earlier:
fit!(yhat, rows=train)
[ Info: Not retraining NodalMachine{Standardizer} @ 6…82. It is up-to-date.
[ Info: Training NodalMachine{UnivariateBoxCoxTransformer} @ 1…09.
[ Info: Training NodalMachine{RidgeRegressor} @ 2…66.
Node @ 1…07 = inverse_transform(1…09, predict(2…66, transform(6…82, 3…40)))
rms(y[test], yhat(rows=test)) # evaluate
0.022837595088079567
We can change a hyperparameters and retrain:
ridge_model.lambda = 0.01
fit!(yhat, rows=train)
[ Info: Not retraining NodalMachine{UnivariateBoxCoxTransformer} @ 1…09. It is up-to-date.
[ Info: Not retraining NodalMachine{Standardizer} @ 6…82. It is up-to-date.
[ Info: Updating NodalMachine{RidgeRegressor} @ 2…66.
Node @ 1…07 = inverse_transform(1…09, predict(2…66, transform(6…82, 3…40)))
And re-evaluate:
rms(y[test], yhat(rows=test))
0.039410306910269116
Notable feature. The machine,
ridge::NodalMachine{RidgeRegressor}
, is retrained, because its underlying model has been mutated. However, since the outcome of this training has no effect on the training inputs of the machinesstand
andbox
, these transformers are left untouched. (During construction, each node and machine in a learning network determines and records all machines on which it depends.) This behaviour, which extends to exported learning networks, means we can tune our wrapped regressor (using a holdout set) without re-computing transformations each time the hyperparameter is changed.
Exporting a learning network as a stand-alone model
Having satisfied that our learning network works on the synthetic data, we are ready to export it as a stand-alone model.
Method I: The @from_network macro
The following call simultaneously defines a new model subtype WrappedRidgeI <: Supervised
and creates an instance of this type wrapped_modelI
:
wrapped_ridgeI = @from_network WrappedRidgeI(ridge=ridge_model) <= (Xs, ys, yhat)
Any MLJ workflow can be applied to this composite model:
julia> params(wrapped_ridgeI)
(ridge = (lambda = 0.01,),)
using CSV
X, y = load_boston()()
evaluate(wrapped_ridgeI, X, y, resampling=CV(), measure=rms, verbosity=0)
6-element Array{Float64,1}:
3.0225867093289347
4.755707358891049
5.011312664189936
4.226827668908119
8.93385968738185
3.4788524973220545
Notes:
- A deep copy of the original learning network
ridge_model
has become the default value for the fieldridge
of the newWrappedRidgeI
struct. - The tuple
(Xs, ys, yhat)
must always follow the pattern (source node for inputs, source node for target, terminal prediction node), unless this is an unsupervised learning network, in which case the pattern is (soruce node for inputs, terminal transform node).
Method II:
In the method I above, only models appearing in the network will appear as hyperparamers of the exported composite model. There is a second more flexible method for exporting the network, which allows finer control over the exported Model
struct (see the example under Static operations on nodes below) and which also avoids macros. The two steps required are:
Define a new
mutable struct
model type.Wrap the learning network code in a model
fit
method.
All learning networks that make determinisic (respectively, probabilistic) predictions export as models of subtype DeterministicNetwork
(respectively, ProbabilisticNetwork
):
mutable struct WrappedRidgeII <: DeterministicNetwork
ridge_model
end
# keyword constructor
WrappedRidgeII(; ridge=RidgeRegressor) = WrappedRidgeII(ridge);
We now simply cut and paste its defining code into a model fit
method (as opposed to machine fit!
methods, which internally dispatch model fit
methods on bound data):
function MLJ.fit(model::WrappedRidgeII, verbosity::Integer, X, y)
Xs = source(X)
ys = source(y)
stand_model = Standardizer()
stand = machine(stand_model, Xs)
W = transform(stand, Xs)
box_model = UnivariateBoxCoxTransformer() # for making data look normally-distributed
box = machine(box_model, ys)
z = transform(box, ys)
ridge_model = model.ridge_model ###
ridge =machine(ridge_model, W, z)
zhat = predict(ridge, W)
yhat = inverse_transform(box, zhat)
fit!(yhat, verbosity=0)
return fitresults(Xs, ys, yhat)
end
The line marked ###
, where the new exported model's hyperparameter ridge
is spliced into the network, is the only modification.
What's going on here? MLJ's machine interface is built atop a more primitive model interface, implemented for each algorithm. Each supervised model type (eg,
RidgeRegressor
) requires modelfit
andpredict
methods, which are called by the corresponding machinefit!
andpredict
methods. We don't need to define a modelpredict
method here because MLJ provides a fallback which simply calls the terminating node of the network built infit
on the data supplied.
The export process is complete:
using CSV
X, y = load_boston()()
evaluate(wrapped_ridgeI, X, y, resampling=CV(), measure=rms, verbosity=0)
6-element Array{Float64,1}:
3.0225867093289347
4.755707358891049
5.011312664189936
4.226827668908119
8.93385968738185
3.4788524973220545
Another example of an exported learning network is given in the next subsection.
Static operations on nodes
Continuing to view nodes as "dynamic data", we can, in addition to applying "dynamic" operations like predict
and transform
to nodes, overload ordinary "static" operations as well. Common operations, like addition, scalar multiplication, exp
and log
work out-of-the box. To demonstrate this, consider the code below defining a composite model that: (i) One-hot encodes the input table X
; (ii) Log transforms the continuous target y
; (iii) Fits specified K-nearest neighbour and ridge regressor models to the data; (iv) Computes a weighted average of individual model predictions; and (v) Inverse transforms (exponentiates) the blended predictions.
Note, in particular, the lines defining zhat
and yhat
, which combine several static node operations.
@load RidgeRegressor
mutable struct KNNRidgeBlend <:DeterministicNetwork
knn_model
ridge_model
weights::Tuple{Float64, Float64}
end
function MLJ.fit(model::KNNRidgeBlend, verbosity::Integer, X, y)
Xs = source(X)
ys = source(y)
hot = machine(OneHotEncoder(), Xs)
# W, z, zhat and yhat are nodes in the network:
W = transform(hot, Xs) # one-hot encode the input
z = log(ys) # transform the target
ridge_model = model.ridge_model
knn_model = model.knn_model
ridge = machine(ridge_model, W, z)
knn = machine(knn_model, W, z)
# average the predictions of the KNN and ridge models
zhat = model.weights[1]*predict(ridge, W) + weights[2]*predict(knn, W)
# inverse the target transformation
yhat = exp(zhat)
fit!(yhat, verbosity=0)
return fitresults(Xs, ys, yhat)
end
using CSV
X, y = load_reduced_ames()()
knn_model = KNNRegressor(K=2)
ridge_model = RidgeRegressor(lambda=0.1)
weights = (0.9, 0.1)
blended_model = KNNRidgeBlend(knn_model, ridge_model, weights)
evaluate(blended_model, X, y, resampling=Holdout(fraction_train=0.7), measure=rmsl)
julia> evaluate!(mach, resampling=Holdout(fraction_train=0.7), measure=rmsl)
┌ Info: Evaluating using a holdout set.
│ fraction_train=0.7
│ shuffle=false
│ measure=MLJ.rmsl
│ operation=StatsBase.predict
└ Resampling from all rows.
mach = NodalMachine{OneHotEncoder} @ 1…14
mach = NodalMachine{RidgeRegressor} @ 1…87
mach = NodalMachine{KNNRegressor} @ 1…02
0.13108966715886725
A node
method allows us to overerload a given function to node arguments. Here are some examples taken from MLJ source (at work in the example above):
Base.log(v::Vector{<:Number}) = log.(v)
Base.log(X::AbstractNode) = node(log, X)
import Base.+
+(y1::AbstractNode, y2::AbstractNode) = node(+, y1, y2)
+(y1, y2::AbstractNode) = node(+, y1, y2)
+(y1::AbstractNode, y2) = node(+, y1, y2)
Here AbstractNode
is the common supertype of Node
and Source
.
As a final example, here's how to extend row shuffling to nodes:
using Random
Random.shuffle(X::AbstractNode) = node(Y -> MLJ.selectrows(Y, Random.shuffle(1:nrows(Y))), X)
X = (x1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
x2 = [:one, :two, :three, :four, :five, :six, :seven, :eight, :nine, :ten])
Xs = source(X)
W = shuffle(Xs)
Node @ 9…86 = #4(6…62)
W()
(x1 = [1, 4, 3, 6, 8, 5, 7, 2, 9, 10],
x2 = Symbol[:one, :four, :three, :six, :eight, :five, :seven, :two, :nine, :ten],)
The learning network API
Three julia types are part of learning networks: Source
, Node
and NodalMachine
. A NodalMachine
is returned by the machine
constructor when given nodal arguments instead of concrete data.
The definitions of Node
and NodalMachine
are coupled because every NodalMachine
has Node
objects in its args
field (the training arguments specified in the constructor) and every Node
must specify a NodalMachine
, unless it is static (see below).
Formally, a learning network defines two labeled directed acyclic graphs (DAG's) whose nodes are Node
or Source
objects, and whose labels are NodalMachine
objects. We obtain the first DAG from directed edges of the form $N1 -> N2$ whenever $N1$ is an argument of $N2$ (see below). Only this DAG is relevant when calling a node, as discussed in examples above and below. To form the second DAG (relevant when calling or calling fit!
on a node) one adds edges for which $N1$ is training argument of the the machine which labels $N1$. We call the second, larger DAG, the complete learning network below (but note only edges of the smaller network are explicitly drawn in diagrams, for simplicity).
Source nodes
Only source nodes reference concrete data. A Source
object has a single field, data
.
MLJ.source
— Method.MLJ.rebind!
— Function.rebind!(s)
Attach new data X
to an existing source node s
.
MLJ.sources
— Function.sources(W)
A vector of all sources referenced by calls N()
and fit!(N)
. These are the sources of the directed acyclic graph associated with the learning network terminating at N
.
Not to be confused with origins(N)
which refers to the same graph with edges corresponding to training arguments deleted.
MLJ.origins
— Function.origins(s)
Return a list of all origins of a node N
accessed by a call N()
. These are the source nodes of the acyclic directed graph (DAG) associated with the learning network terminating at N
, if edges corresponding to training arguments are excluded. A Node
object cannot be called on new data unless it has a unique origin.
Not to be confused with sources(N)
which refers to the same graph but without the training edge deletions.
origins(X)
Access the origins (source nodes) of a given node.
Nodal machines
The key components of a NodalMachine
object are:
A model, specifying a learning algorithm and hyperparameters.
Training arguments, which specify the nodes acting as proxies for training data on calls to
fit!
.A fit-result, for storing the outcomes of calls to
fit!
.
A nodal machine is trained in the same way as a regular machine with one difference: Instead of training the model on the wrapped data indexed on rows
, it is trained on the wrapped nodes called on rows
, with calling being a recursive operation on nodes within a learning network (see below).
Nodes
The key components of a Node
are:
An operation, which will either be static (a fixed function) or dynamic (such as
predict
ortransform
, dispatched on a nodal machineNodalMachine
).A nodal machine on which to dispatch the operation (void if the operation is static).
Upstream connections to other nodes (including source nodes) specified by arguments (one for each argument of the operation).
A dependency tape, listing of all upstream nodes in the complete learning network, with an order consistent with the learning network as a DAG.
MLJ.node
— Type.N = node(f::Function, args...)
Defines a Node
object N
wrapping a static operation f
and arguments args
. Each of the n
elements of args
must be a Node
or Source
object. The node N
has the following calling behaviour:
N() = f(args[1](), args[2](), ..., args[n]())
N(rows=r) = f(args[1](rows=r), args[2](rows=r), ..., args[n](rows=r))
N(X) = f(args[1](X), args[2](X), ..., args[n](X))
J = node(f, mach::NodalMachine, args...)
Defines a dynamic Node
object J
wrapping a dynamic operation f
(predict
, predict_mean
, transform
, etc), a nodal machine mach
and arguments args
. Its calling behaviour, which depends on the outcome of training mach
(and, implicitly, on training outcomes affecting its arguments) is this:
J() = f(mach, args[1](), args[2](), ..., args[n]())
J(rows=r) = f(mach, args[1](rows=r), args[2](rows=r), ..., args[n](rows=r))
J(X) = f(mach, args[1](X), args[2](X), ..., args[n](X))
Generally n=1
or n=2
in this latter case.
predict(mach, X::AbsractNode, y::AbstractNode)
predict_mean(mach, X::AbstractNode, y::AbstractNode)
predict_median(mach, X::AbstractNode, y::AbstractNode)
predict_mode(mach, X::AbstractNode, y::AbstractNode)
transform(mach, X::AbstractNode)
inverse_transform(mach, X::AbstractNode)
Shortcuts for J = node(predict, mach, X, y)
, etc.
Calling a node is a recursive operation which terminates in the call to a source node (or nodes). Calling nodes on new data X
fails unless the number of such nodes is one.
StatsBase.fit!
— Method.fit!(y; rows, verbosity, force)
Train the machines of all dynamic nodes in the learning network terminating at N
in an appropriate order.
StatsBase.fit!
— Method.fit!(mach::Machine; rows=nothing, verbosity=1, force=false)
When called for the first time, call MLJBase.fit
on mach.model
and store the returned fit-result and report. Subsequent calls do nothing unless: (i) force=true
, or (ii) the specified rows
are different from those used the last time a fit-result was computed, or (iii) mach.model
has changed since the last time a fit-result was computed (the machine is stale). In cases (i) or (ii) MLJBase.fit
is called on mach.model
. Otherwise, MLJBase.update
is called.
fit!(mach::NodalMachine; rows=nothing, verbosity=1, force=false)
When called for the first time, attempt to call MLJBase.fit
on fit.model
. This will fail if an argument of the machine depends ultimately on some other untrained machine for successful calling, but this is resolved by instead calling fit!
on fitting any node N
for which mach in machines(N)
is true, which trains all necessary machines in an appropriate order. Subsequent fit!
calls do nothing unless: (i) force=true
, or (ii) some machine on which mach
depends has computed a new fit-result since mach
last computed its fit-result, or (iii) the specified rows
have changed since the last time a fit-result was last computed, or (iv) mach
is stale (see below). In cases (i), (ii) or (iii), MLJBase.fit
is called. Otherwise MLJBase.update
is called.
A machine mach
is stale if mach.model
has changed since the last time a fit-result was computed, or if if one of its training arguments is stale
. A node N
is stale if N.machine
is stale or one of its arguments is stale. Source nodes are never stale.
Note that a nodal machine obtains its training data by calling its node arguments on the specified rows
(rather indexing its arguments on those rows) and that this calling is a recursive operation on nodes upstream of those arguments.
MLJ.@from_network
— Macro.@from_network NewCompositeModel(fld1=model1, fld2=model2, ...) <= (Xs, N)
@from_network NewCompositeModel(fld1=model1, fld2=model2, ...) <= (Xs, ys, N)
Create, respectively, a new stand-alone unsupervised or superivsed model type NewCompositeModel
using a learning network as a blueprint. Here Xs
, ys
and N
refer to the input source, node, target source node and terminating source node of the network. The model type NewCompositeModel
is equipped with fields named :fld1
, :fld2
, ..., which correspond to component models model1
, model2
appearing in the network (which must therefore be elements of models(N)
). Deep copies of the specified component models are used as default values in an automatically generated keyword constructor for NewCompositeModel
.
Return value: A new NewCompositeModel
instance, with default field values.
For details and examples refer to the "Learning Networks" section of the documentation.