Transformers and Other Unsupervised Models

Several unsupervised models used to perform common transformations, such as one-hot encoding, are available in MLJ out-of-the-box. These are detailed in Built-in transformers below.

A transformer is static if it has no learned parameters. While such a transformer is tantamount to an ordinary function, realizing it as an MLJ static transformer (a subtype of Static <: Unsupervised) can be useful, especially if the function depends on parameters the user would like to manipulate (which become hyper-parameters of the model). The necessary syntax for defining your own static transformers is described in Static transformers below.

Some unsupervised models, such as clustering algorithms, have a predict method in addition to a transform method. We give an example of this in Transformers that also predict

Finally, we note that models that fit a distribution, or more generally a sampler object, to some data, which are sometimes viewed as unsupervised, are treated in MLJ as supervised models. See Models that learn a probability distribution for an example.

Built-in transformers

Standardizer
OneHotEncoder
ContinuousEncoder
FillImputer
UnivariateFillImputer
UnivariateBoxCoxTransformer
InteractionTransformer
UnivariateDiscretizer
UnivariateTimeTypeToContinuous
FeatureSelector.

Static transformers

A static transformer is a model for transforming data that does not generalize to new data (does not "learn") but which nevertheless has hyperparameters. For example, the DBSAN clustering model from Clustering.jl can assign labels to some collection of observations, cannot directly assign a label to some new observation.

The general user may define their own static models. The main use-case is insertion into a Linear Pipelines some parameter-dependent transformation. (If a static transformer has no hyper-parameters, it is tantamount to an ordinary function. An ordinary function can be inserted directly into a pipeline; the situation for learning networks is only slightly more complicated.

The following example defines a new model type Averager to perform the weighted average of two vectors (target predictions, for example). We suppose the weighting is normalized, and therefore controlled by a single hyper-parameter, mix.

mutable struct Averager <: Static
    mix::Float64
end

MLJ.transform(a::Averager, _, y1, y2) = (1 - a.mix)*y1 + a.mix*y2

Important. Note the sub-typing <: Static.

Such static transformers with (unlearned) parameters can have arbitrarily many inputs, but only one output. In the single input case, an inverse_transform can also be defined. Since they have no real learned parameters, you bind a static transformer to a machine without specifying training arguments; there is no need to fit! the machine:

mach = machine(Averager(0.5))
transform(mach, [1, 2, 3], [3, 2, 1])

3-element Vector{Float64}:
 2.0
 2.0
 2.0

Let's see how we can include our Averager in a learning network to mix the predictions of two regressors, with one-hot encoding of the inputs. Here's two regressors for mixing, and some dummy data for testing our learning network:

ridge = (@load RidgeRegressor pkg=MultivariateStats)()
knn = (@load KNNRegressor)()

import Random.seed!
seed!(112)
X = (
    x1=coerce(rand("ab", 100), Multiclass),
    x2=rand(100),
)
y = X.x2 + 0.05*rand(100)
schema(X)

┌───────┬───────────────┬────────────────────────────────┐
│ names │ scitypes      │ types                          │
├───────┼───────────────┼────────────────────────────────┤
│ x1    │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
│ x2    │ Continuous    │ Float64                        │
└───────┴───────────────┴────────────────────────────────┘

And the learning network:

Xs = source(X)
ys = source(y)

averager = Averager(0.5)

mach0 = machine(OneHotEncoder(), Xs)
W = transform(mach0, Xs) # one-hot encode the input

mach1 = machine(ridge, W, ys)
y1 = predict(mach1, W)

mach2 = machine(knn, W, ys)
y2 = predict(mach2, W)

mach4= machine(averager)
yhat = transform(mach4, y1, y2)

# test:
fit!(yhat)
Xnew = selectrows(X, 1:3)
yhat(Xnew)

3-element Vector{Float64}:
 0.6403223210037916
 0.9607694439597683
 0.8159225346205365

We next "export" the learning network as a standalone composite model type. First we need a struct for the composite model. Since we are restricting to Deterministic component regressors, the composite will also make deterministic predictions, and so gets the supertype DeterministicNetworkComposite:

mutable struct DoubleRegressor <: DeterministicNetworkComposite
    regressor1
    regressor2
    averager
end

As described in Learning Networks, we next paste the learning network into a prefit declaration, replace the component models with symbolic placeholders, and add a learning network "interface":

import MLJBase
function MLJBase.prefit(composite::DoubleRegressor, verbosity, X, y)
    Xs = source(X)
    ys = source(y)

    mach0 = machine(OneHotEncoder(), Xs)
    W = transform(mach0, Xs) # one-hot encode the input

    mach1 = machine(:regressor1, W, ys)
    y1 = predict(mach1, W)

    mach2 = machine(:regressor2, W, ys)
    y2 = predict(mach2, W)

    mach4= machine(:averager)
    yhat = transform(mach4, y1, y2)

    # learning network interface:
    (; predict=yhat)
end

The new model type can be evaluated like any other supervised model:

X, y = @load_reduced_ames;
composite = DoubleRegressor(ridge, knn, Averager(0.5))

DoubleRegressor(
  regressor1 = RidgeRegressor(
        lambda = 1.0, 
        bias = true), 
  regressor2 = KNNRegressor(
        K = 5, 
        algorithm = :kdtree, 
        metric = Distances.Euclidean(0.0), 
        leafsize = 10, 
        reorder = true, 
        weights = NearestNeighborModels.Uniform()), 
  averager = Averager(
        mix = 0.5))

composite.averager.mix = 0.25 # adjust mix from default of 0.5
evaluate(composite, X, y, measure=l1)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 17200.0     │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌────────────────────────────────────────────────────────┬─────────┐
│ per_fold                                               │ 1.96*SE │
├────────────────────────────────────────────────────────┼─────────┤
│ [15200.0, 15800.0, 18500.0, 16400.0, 18600.0, 18500.0] │ 1350.0  │
└────────────────────────────────────────────────────────┴─────────┘

A static transformer can also expose byproducts of the transform computation in the report of any associated machine. See Static models (models that do not generalize) for details.

Transformers that also predict

Some clustering algorithms learn to label data by identifying a collection of "centroids" in the training data. Any new input observation is labeled with the cluster to which it is closest (this is the output of predict) while the vector of all distances from the centroids defines a lower-dimensional representation of the observation (the output of transform). In the following example a K-means clustering algorithm assigns one of three labels 1, 2, 3 to the input features of the iris data set and compares them with the actual species recorded in the target (not seen by the algorithm).

import Random.seed!
seed!(123)

X, y = @load_iris
KMeans = @load KMeans pkg=Clustering
kmeans = KMeans()
mach = machine(kmeans, X) |> fit!

[ Info: For silent loading, specify `verbosity=0`.
import MLJClusteringInterface ✔
[ Info: Training machine(KMeans(k = 3, …), …).

Transforming:

Xsmall = transform(mach)
selectrows(Xsmall, 1:4) |> pretty

┌────────────┬────────────┬────────────┐
│ x1         │ x2         │ x3         │
│ Float64    │ Float64    │ Float64    │
│ Continuous │ Continuous │ Continuous │
├────────────┼────────────┼────────────┤
│ 11.6913    │ 0.021592   │ 25.599     │
│ 11.5503    │ 0.191992   │ 26.1626    │
│ 12.7403    │ 0.169992   │ 27.8716    │
│ 11.7129    │ 0.269192   │ 26.5595    │
└────────────┴────────────┴────────────┘

Predicting:

yhat = predict(mach)
compare = zip(yhat, y) |> collect

150-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 ⋮
 (3, "virginica")
 (1, "virginica")
 (3, "virginica")
 (3, "virginica")
 (3, "virginica")
 (1, "virginica")
 (3, "virginica")
 (3, "virginica")
 (1, "virginica")

compare[1:8]

8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")
 (2, "setosa")

compare[51:58]

8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
 (1, "versicolor")
 (1, "versicolor")
 (3, "versicolor")
 (1, "versicolor")
 (1, "versicolor")
 (1, "versicolor")
 (1, "versicolor")
 (1, "versicolor")

compare[101:108]

8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
 (3, "virginica")
 (1, "virginica")
 (3, "virginica")
 (3, "virginica")
 (3, "virginica")
 (3, "virginica")
 (1, "virginica")
 (3, "virginica")

Reference

MLJModels.Standardizer — Type

Standardizer

A model type for constructing a standardizer, based on MLJModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

Standardizer = @load Standardizer pkg=MLJModels

Do model = Standardizer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in Standardizer(features=...).

Use this model to standardize (whiten) a Continuous vector, or relevant columns of a table. The rescalings applied by this transformer to new data are always those learned during the training phase, which are generally different from what would actually standardize the new data.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

where

X: any Tables.jl compatible table or any abstract vector with Continuous element scitype (any abstract float vector). Only features in a table with Continuous scitype can be standardized; check column scitypes with schema(X).

Train the machine using fit!(mach, rows=...).

Hyper-parameters

features: one of the following, with the behavior indicated below:
- [] (empty, the default): standardize all features (columns) having Continuous element scitype
- non-empty vector of feature names (symbols): standardize only the Continuous features in the vector (if ignore=false) or Continuous features not named in the vector (ignore=true).
- function or other callable: standardize a feature if the callable returns true on its name. For example, Standardizer(features = name -> name in [:x1, :x3], ignore = true, count=true) has the same effect as Standardizer(features = [:x1, :x3], ignore = true, count=true), namely to standardize all Continuous and Count features, with the exception of :x1 and :x3.
Note this behavior is further modified if the ordered_factor or count flags are set to true; see below
ignore=false: whether to ignore or standardize specified features, as explained above
ordered_factor=false: if true, standardize any OrderedFactor feature wherever a Continuous feature would be standardized, as described above
count=false: if true, standardize any Count feature wherever a Continuous feature would be standardized, as described above

Operations

transform(mach, Xnew): return Xnew with relevant features standardized according to the rescalings learned during fitting of mach.
inverse_transform(mach, Z): apply the inverse transformation to Z, so that inverse_transform(mach, transform(mach, Xnew)) is approximately the same as Xnew; unavailable if ordered_factor or count flags were set to true.

Fitted parameters

The fields of fitted_params(mach) are:

features_fit - the names of features that will be standardized
means - the corresponding untransformed mean values
stds - the corresponding untransformed standard deviations

Report

The fields of report(mach) are:

features_fit: the names of features that will be standardized

Examples

using MLJ

X = (ordinal1 = [1, 2, 3],
     ordinal2 = coerce([:x, :y, :x], OrderedFactor),
     ordinal3 = [10.0, 20.0, 30.0],
     ordinal4 = [-20.0, -30.0, -40.0],
     nominal = coerce(["Your father", "he", "is"], Multiclass));

julia> schema(X)
┌──────────┬──────────────────┐
│ names    │ scitypes         │
├──────────┼──────────────────┤
│ ordinal1 │ Count            │
│ ordinal2 │ OrderedFactor{2} │
│ ordinal3 │ Continuous       │
│ ordinal4 │ Continuous       │
│ nominal  │ Multiclass{3}    │
└──────────┴──────────────────┘

stand1 = Standardizer();

julia> transform(fit!(machine(stand1, X)), X)
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)

stand2 = Standardizer(features=[:ordinal3, ], ignore=true, count=true);

julia> transform(fit!(machine(stand2, X)), X)
(ordinal1 = [-1.0, 0.0, 1.0],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [10.0, 20.0, 30.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)

source

MLJModels.OneHotEncoder — Type

OneHotEncoder

A model type for constructing a one-hot encoder, based on MLJModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

OneHotEncoder = @load OneHotEncoder pkg=MLJModels

Do model = OneHotEncoder() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...).

Use this model to one-hot encode the Multiclass and OrderedFactor features (columns) of some table, leaving other columns unchanged.

New data to be transformed may lack features present in the fit data, but no new features can be present.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column, col, is the same for training data and new data to be transformed.

To ensure all features are transformed into Continuous features, or dropped, use ContinuousEncoder instead.