Target Transformations

Some supervised models work best if the target variable has been standardized, i.e., rescaled to have zero mean and unit variance. Such a target transformation is learned from the values of the training target variable. In particular, one generally learns a different transformation when training on a proper subset of the training data. Good data hygiene prescribes that a new transformation should be computed each time the supervised model is trained on new data - for example in cross-validation.

Additionally, one generally wants to inverse transform the predictions of the supervised model for the final target predictions to be on the original scale.

All these concerns are addressed by wrapping the supervised model using TransformedTargetModel:

Ridge = @load RidgeRegressor pkg=MLJLinearModels verbosity=0
ridge = Ridge(fit_intercept=false)
ridge2 = TransformedTargetModel(ridge, transformer=Standardizer())
TransformedTargetModelDeterministic(
  model = RidgeRegressor(
        lambda = 1.0, 
        fit_intercept = false, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  transformer = Standardizer(
        features = Symbol[], 
        ignore = false, 
        ordered_factor = false, 
        count = false), 
  inverse = nothing, 
  cache = true)

Note that all the original hyperparameters, as well as those of the Standardizer, are accessible as nested hyper-parameters of the wrapped model, which can be trained or evaluated like any other:

X, y = make_regression(rng=1234, intercept=false)
y = y*10^5
mach = machine(ridge2, X, y)
fit!(mach, rows=1:60, verbosity=0)
predict(mach, rows=61:62)
2-element Vector{Float64}:
  -22108.94221844114
 -158721.15783508556

Training and predicting using ridge2 as above means:

  1. Standardizing the target y using the first 60 rows to get a new target z

  2. Training the original ridge model using the first 60 rows of X and z

  3. Calling predict on the machine trained in Step 2 on rows 61:62 of X

  4. Applying the inverse scaling learned in Step 1 to those predictions (to get the final output shown above)

Since both ridge and ridge2 return predictions on the original scale, we can meaningfully compare the corresponding mean absolute errors, which are indeed different in this case.

evaluate(ridge, X, y, measure=l1)
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 81700.0     │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌──────────────────────────────────────────────────────────┬─────────┐
│ per_fold                                                 │ 1.96*SE │
├──────────────────────────────────────────────────────────┼─────────┤
│ [67400.0, 74300.0, 112000.0, 52800.0, 76800.0, 108000.0] │ 20600.0 │
└──────────────────────────────────────────────────────────┴─────────┘
evaluate(ridge2, X, y, measure=l1)
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 83200.0     │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌──────────────────────────────────────────────────────────┬─────────┐
│ per_fold                                                 │ 1.96*SE │
├──────────────────────────────────────────────────────────┼─────────┤
│ [81300.0, 74400.0, 112000.0, 50400.0, 77100.0, 105000.0] │ 19600.0 │
└──────────────────────────────────────────────────────────┴─────────┘

Ordinary functions can also be used in target transformations but an inverse must be explicitly specified:

ridge3 = TransformedTargetModel(ridge, transformer=y->log.(y), inverse=z->exp.(z))
X, y = @load_boston
evaluate(ridge3, X, y, measure=l1)
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure  │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss(  │ predict   │ 6.33        │
│   p = 1) │           │             │
└──────────┴───────────┴─────────────┘
┌──────────────────────────────────────┬─────────┐
│ per_fold                             │ 1.96*SE │
├──────────────────────────────────────┼─────────┤
│ [5.33, 6.05, 7.38, 6.39, 7.93, 4.89] │ 1.02    │
└──────────────────────────────────────┴─────────┘

Without the log transform (ie, using ridge) we get the poorer mean absolute error, l1, of 3.9.

MLJBase.TransformedTargetModelFunction
TransformedTargetModel(model; transformer=nothing, inverse=nothing, cache=true)

Wrap the supervised or semi-supervised model in a transformation of the target variable.

Here transformer one of the following:

  • The Unsupervised model that is to transform the training target. By default (inverse=nothing) the parameters learned by this transformer are also used to inverse-transform the predictions of model, which means transformer must implement the inverse_transform method. If this is not the case, specify inverse=identity to suppress inversion.

  • A callable object for transforming the target, such as y -> log.(y). In this case a callable inverse, such as z -> exp.(z), should be specified.

Specify cache=false to prioritize memory over speed, or to guarantee data anonymity.

Specify inverse=identity if model is a probabilistic predictor, as inverse-transforming sample spaces is not supported. Alternatively, replace model with a deterministic model, such as Pipeline(model, y -> mode.(y)).

Examples

A model that normalizes the target before applying ridge regression, with predictions returned on the original scale:

@load RidgeRegressor pkg=MLJLinearModels
model = RidgeRegressor()
tmodel = TransformedTargetModel(model, transformer=Standardizer())

A model that applies a static log transformation to the data, again returning predictions to the original scale:

tmodel2 = TransformedTargetModel(model, transformer=y->log.(y), inverse=z->exp.(y))
source