NeuralNetworkBinaryClassifier
NeuralNetworkBinaryClassifier
A model type for constructing a neural network binary classifier, based on MLJFlux.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
NeuralNetworkBinaryClassifier = @load NeuralNetworkBinaryClassifier pkg=MLJFlux
Do model = NeuralNetworkBinaryClassifier()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in NeuralNetworkBinaryClassifier(builder=...)
.
NeuralNetworkBinaryClassifier
is for training a data-dependent Flux.jl neural network for making probabilistic predictions of a binary (Multiclass{2}
or OrderedFactor{2}
) target, given a table of Continuous
features. Users provide a recipe for constructing the network, based on properties of the data that is encountered, by specifying an appropriate builder
. See MLJFlux documentation for more on builders.
In addition to features with Continuous
scientific element type, this model supports categorical features in the input table. If present, such features are embedded into dense vectors by the use of an additional EntityEmbedder
layer after the input, as described in Entity Embeddings of Categorical Variables by Cheng Guo, Felix Berkhahn arXiv, 2016.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X, y)
Here:
X
provides input features and is either: (i) aMatrix
withContinuous
element scitype (typicallyFloat32
); or (ii) a table of input features (eg, aDataFrame
) whose columns haveContinuous
,Multiclass
orOrderedFactor
element scitype; check column scitypes withschema(X)
. If anyMulticlass
orOrderedFactor
features appear, the constructed network will use anEntityEmbedder
layer to transform them into dense vectors. IfX
is aMatrix
, it is assumed that columns correspond to features and rows corresponding to observations.y
is the target, which can be anyAbstractVector
whose element scitype isMulticlass{2}
orOrderedFactor{2}
; check the scitype withscitype(y)
Train the machine with fit!(mach, rows=...)
.
Hyper-parameters
builder=MLJFlux.Short()
: An MLJFlux builder that constructs a neural network. Possiblebuilders
include:MLJFlux.Linear
,MLJFlux.Short
, andMLJFlux.MLP
. See MLJFlux.jl documentation for examples of user-defined builders. See alsofinaliser
below.optimiser::Flux.Adam()
: AFlux.Optimise
optimiser. The optimiser performs the updating of the weights of the network. For further reference, see the Flux optimiser documentation. To choose a learning rate (the update rate of the optimizer), a good rule of thumb is to start out at10e-3
, and tune using powers of10
between1
and1e-7
.loss=Flux.binarycrossentropy
: The loss function which the network will optimize. Should be a function which can be called in the formloss(yhat, y)
. Possible loss functions are listed in the Flux loss function documentation. For a classification task, the most natural loss functions are:Flux.binarycrossentropy
: Standard binary classification loss, also known as the log loss.Flux.logitbinarycrossentropy
: Mathematically equal to crossentropy, but numerically more stable than finalising the outputs withσ
and then calculating crossentropy. You will need to specifyfinaliser=identity
to remove MLJFlux's default sigmoid finaliser, and understand that the output ofpredict
is then unnormalized (no longer probabilistic).Flux.tversky_loss
: Used with imbalanced data to give more weight to false negatives.Flux.binary_focal_loss
: Used with highly imbalanced data. Weights harder examples more than easier examples.
Currently MLJ measures are not supported values of
loss
.epochs::Int=10
: The duration of training, in epochs. Typically, one epoch represents one pass through the complete the training dataset.batch_size::int=1
: the batch size to be used for training, representing the number of samples per update of the network weights. Typically, batch size is between8
and512
. Increassing batch size may accelerate training ifacceleration=CUDALibs()
and a GPU is available.lambda::Float64=0
: The strength of the weight regularization penalty. Can be any value in the range[0, ∞)
.alpha::Float64=0
: The L2/L1 mix of regularization, in the range[0, 1]
. A value of 0 represents L2 regularization, and a value of 1 represents L1 regularization.rng::Union{AbstractRNG, Int64}
: The random number generator or seed used during training.optimizer_changes_trigger_retraining::Bool=false
: Defines what happens when re-fitting a machine if the associated optimiser has changed. Iftrue
, the associated machine will retrain from scratch onfit!
call, otherwise it will not.acceleration::AbstractResource=CPU1()
: Defines on what hardware training is done. For Training on GPU, useCUDALibs()
.finaliser=Flux.σ
: The final activation function of the neural network (applied after the network defined bybuilder
). Defaults toFlux.σ
.embedding_dims
: aDict
whose keys are names of categorical features, given as symbols, and whose values are numbers representing the desired dimensionality of the entity embeddings of such features: an integer value of7
, say, sets the embedding dimensionality to7
; a float value of0.5
, say, sets the embedding dimensionality toceil(0.5 * c)
, wherec
is the number of feature levels. Unspecified feature dimensionality defaults tomin(c - 1, 10)
.
Operations
predict(mach, Xnew)
: return predictions of the target given new featuresXnew
, which should have the same scitype asX
above. Predictions are probabilistic but uncalibrated.predict_mode(mach, Xnew)
: Return the modes of the probabilistic predictions returned above.transform(mach, Xnew)
: AssumingXnew
has the same schema asX
, transform the categorical features ofXnew
into denseContinuous
vectors using theMLJFlux.EntityEmbedder
layer present in the network. Does nothing in case the model was trained on an inputX
that lacks categorical features.
Fitted parameters
The fields of fitted_params(mach)
are:
chain
: The trained "chain" (Flux.jl model), namely the series of layers, functions, and activations which make up the neural network. This includes the final layer specified byfinaliser
(eg,softmax
).
Report
The fields of report(mach)
are:
training_losses
: A vector of training losses (penalised iflambda != 0
) in historical order, of lengthepochs + 1
. The first element is the pre-training loss.
Examples
In this example we build a classification model using the Iris dataset. This is a very basic example, using a default builder and no standardization. For a more advanced illustration, see NeuralNetworkRegressor
or ImageClassifier
, and examples in the MLJFlux.jl documentation.
using MLJ, Flux
import Optimisers
import RDatasets
First, we can load the data:
mtcars = RDatasets.dataset("datasets", "mtcars");
y, X = unpack(mtcars, ==(:VS), in([:MPG, :Cyl, :Disp, :HP, :WT, :QSec]));
Note that y
is a vector and X
a table.
y = categorical(y) ## classifier takes catogorical input
X_f32 = Float32.(X) ## To match floating point type of the neural network layers
NeuralNetworkBinaryClassifier = @load NeuralNetworkBinaryClassifier pkg=MLJFlux
bclf = NeuralNetworkBinaryClassifier()
Next, we can train the model:
mach = machine(bclf, X_f32, y)
fit!(mach)
We can train the model in an incremental fashion, altering the learning rate as we go, provided optimizer_changes_trigger_retraining
is false
(the default). Here, we also change the number of (total) iterations:
julia> bclf.optimiser
Adam(0.001, (0.9, 0.999), 1.0e-8)
bclf.optimiser = Optimisers.Adam(eta = bclf.optimiser.eta * 2)
bclf.epochs = bclf.epochs + 5
fit!(mach, verbosity=2) ## trains 5 more epochs
We can inspect the mean training loss using the cross_entropy
function:
training_loss = cross_entropy(predict(mach, X_f32), y)
And we can access the Flux chain (model) using fitted_params
:
chain = fitted_params(mach).chain
Finally, we can see how the out-of-sample performance changes over time, using MLJ's learning_curve
function:
r = range(bclf, :epochs, lower=1, upper=200, scale=:log10)
curve = learning_curve(
bclf,
X_f32,
y,
range=r,
resampling=Holdout(fraction_train=0.7),
measure=cross_entropy,
)
using Plots
plot(
curve.parameter_values,
curve.measurements,
xlab=curve.parameter_name,
xscale=curve.parameter_scale,
ylab = "Cross Entropy",
)
See also ImageClassifier
.