Transformers and Other Unsupervised Models
Several unsupervised models used to perform common transformations, such as one-hot encoding, missing value imputation, and categorical encoding, are available in MLJ out-of-the-box (no need to load code with @load
). They are detailed in Built-in transformers below.
A transformer is static if it has no learned parameters. While such a transformer is tantamount to an ordinary function, realizing it as an MLJ static transformer (a subtype of Static <: Unsupervised
) can be useful, especially if the function depends on parameters the user would like to manipulate (which become hyper-parameters of the model). The necessary syntax for defining your own static transformers is described in Static transformers below.
Some unsupervised models, such as clustering algorithms, have a predict
method in addition to a transform
method. We give an example of this in Transformers that also predict
Built-in transformers
For tutorials on the transformers below, refer to the MLJTransforms documentation.
Transformer | Brief Description |
---|---|
Standardizer | Transforming columns of numerical features by standardization |
UnivariateBoxCoxTransformer | Apply BoxCox transformation given a single vector |
InteractionTransformer | Transforming columns of numerical features to create new interaction features |
UnivariateDiscretizer | Discretize a continuous vector into an ordered factor |
FillImputer | Fill in missing values of features belonging to any scientific type |
UnivariateFillImputer | Fill in missing values in a single vector |
UnivariateTimeTypeToContinuous | Transform a vector of time type into continuous type |
OneHotEncoder | Encode categorical variables into one-hot vectors |
ContinuousEncoder | Adds type casting functionality to OnehotEncoder |
OrdinalEncoder | Encode categorical variables into ordered integers |
FrequencyEncoder | Encode categorical variables into their normalized or unormalized frequencies |
TargetEncoder | Encode categorical variables into relevant target statistics |
ContrastEncoder | Allows defining a custom contrast encoder via a contrast matrix |
CardinalityReducer | Reduce cardinality of high cardinality categorical features by grouping infrequent categories |
MissingnessEncoder | Encode missing values of categorical features into new values |
Static transformers
A static transformer is a model for transforming data that does not generalize to new data (does not "learn") but which nevertheless has hyperparameters. For example, the DBSAN
clustering model from Clustering.jl can assign labels to some collection of observations, cannot directly assign a label to some new observation.
The general user may define their own static models. The main use-case is insertion into a Linear Pipelines some parameter-dependent transformation. (If a static transformer has no hyper-parameters, it is tantamount to an ordinary function. An ordinary function can be inserted directly into a pipeline; the situation for learning networks is only slightly more complicated.
The following example defines a new model type Averager
to perform the weighted average of two vectors (target predictions, for example). We suppose the weighting is normalized, and therefore controlled by a single hyper-parameter, mix
.
mutable struct Averager <: Static
mix::Float64
end
MLJ.transform(a::Averager, _, y1, y2) = (1 - a.mix)*y1 + a.mix*y2
Important. Note the sub-typing <: Static
.
Such static transformers with (unlearned) parameters can have arbitrarily many inputs, but only one output. In the single input case, an inverse_transform
can also be defined. Since they have no real learned parameters, you bind a static transformer to a machine without specifying training arguments; there is no need to fit!
the machine:
mach = machine(Averager(0.5))
transform(mach, [1, 2, 3], [3, 2, 1])
3-element Vector{Float64}:
2.0
2.0
2.0
Let's see how we can include our Averager
in a learning network to mix the predictions of two regressors, with one-hot encoding of the inputs. Here's two regressors for mixing, and some dummy data for testing our learning network:
ridge = (@load RidgeRegressor pkg=MultivariateStats)()
knn = (@load KNNRegressor)()
import Random.seed!
seed!(112)
X = (
x1=coerce(rand("ab", 100), Multiclass),
x2=rand(100),
)
y = X.x2 + 0.05*rand(100)
schema(X)
┌───────┬───────────────┬────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼────────────────────────────────┤
│ x1 │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
│ x2 │ Continuous │ Float64 │
└───────┴───────────────┴────────────────────────────────┘
And the learning network:
Xs = source(X)
ys = source(y)
averager = Averager(0.5)
mach0 = machine(OneHotEncoder(), Xs)
W = transform(mach0, Xs) # one-hot encode the input
mach1 = machine(ridge, W, ys)
y1 = predict(mach1, W)
mach2 = machine(knn, W, ys)
y2 = predict(mach2, W)
mach4= machine(averager)
yhat = transform(mach4, y1, y2)
# test:
fit!(yhat)
Xnew = selectrows(X, 1:3)
yhat(Xnew)
3-element Vector{Float64}:
0.6403223210037916
0.9607694439597683
0.8159225346205365
We next "export" the learning network as a standalone composite model type. First we need a struct for the composite model. Since we are restricting to Deterministic
component regressors, the composite will also make deterministic predictions, and so gets the supertype DeterministicNetworkComposite
:
mutable struct DoubleRegressor <: DeterministicNetworkComposite
regressor1
regressor2
averager
end
As described in Learning Networks, we next paste the learning network into a prefit
declaration, replace the component models with symbolic placeholders, and add a learning network "interface":
import MLJBase
function MLJBase.prefit(composite::DoubleRegressor, verbosity, X, y)
Xs = source(X)
ys = source(y)
mach0 = machine(OneHotEncoder(), Xs)
W = transform(mach0, Xs) # one-hot encode the input
mach1 = machine(:regressor1, W, ys)
y1 = predict(mach1, W)
mach2 = machine(:regressor2, W, ys)
y2 = predict(mach2, W)
mach4= machine(:averager)
yhat = transform(mach4, y1, y2)
# learning network interface:
(; predict=yhat)
end
The new model type can be evaluated like any other supervised model:
X, y = @load_reduced_ames;
composite = DoubleRegressor(ridge, knn, Averager(0.5))
DoubleRegressor(
regressor1 = RidgeRegressor(
lambda = 1.0,
bias = true),
regressor2 = KNNRegressor(
K = 5,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = NearestNeighborModels.Uniform()),
averager = Averager(
mix = 0.5))
composite.averager.mix = 0.25 # adjust mix from default of 0.5
evaluate(composite, X, y, measure=l1)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├──────────┼───────────┼─────────────┤
│ LPLoss( │ predict │ 17200.0 │
│ p = 1) │ │ │
└──────────┴───────────┴─────────────┘
┌────────────────────────────────────────────────────────┬─────────┐
│ per_fold │ 1.96*SE │
├────────────────────────────────────────────────────────┼─────────┤
│ [15200.0, 15800.0, 18500.0, 16400.0, 18600.0, 18500.0] │ 1350.0 │
└────────────────────────────────────────────────────────┴─────────┘
A static transformer can also expose byproducts of the transform computation in the report of any associated machine. See Static transformers for details.
Transformers that also predict
Some clustering algorithms learn to label data by identifying a collection of "centroids" in the training data. Any new input observation is labeled with the cluster to which it is closest (this is the output of predict
) while the vector of all distances from the centroids defines a lower-dimensional representation of the observation (the output of transform
). In the following example a K-means clustering algorithm assigns one of three labels 1, 2, 3 to the input features of the iris data set and compares them with the actual species recorded in the target (not seen by the algorithm).
import Random.seed!
seed!(123)
X, y = @load_iris
KMeans = @load KMeans pkg=Clustering
kmeans = KMeans()
mach = machine(kmeans, X) |> fit!
[ Info: For silent loading, specify `verbosity=0`.
import MLJClusteringInterface ✔
[ Info: Training machine(KMeans(k = 3, …), …).
Transforming:
Xsmall = transform(mach)
selectrows(Xsmall, 1:4) |> pretty
┌────────────┬────────────┬────────────┐
│ x1 │ x2 │ x3 │
│ Float64 │ Float64 │ Float64 │
│ Continuous │ Continuous │ Continuous │
├────────────┼────────────┼────────────┤
│ 11.6913 │ 0.021592 │ 25.599 │
│ 11.5503 │ 0.191992 │ 26.1626 │
│ 12.7403 │ 0.169992 │ 27.8716 │
│ 11.7129 │ 0.269192 │ 26.5595 │
└────────────┴────────────┴────────────┘
Predicting:
yhat = predict(mach)
compare = zip(yhat, y) |> collect
150-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
⋮
(3, "virginica")
(1, "virginica")
(3, "virginica")
(3, "virginica")
(3, "virginica")
(1, "virginica")
(3, "virginica")
(3, "virginica")
(1, "virginica")
compare[1:8]
8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
(2, "setosa")
compare[51:58]
8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
(1, "versicolor")
(1, "versicolor")
(3, "versicolor")
(1, "versicolor")
(1, "versicolor")
(1, "versicolor")
(1, "versicolor")
(1, "versicolor")
compare[101:108]
8-element Vector{Tuple{CategoricalValue{Int64, UInt32}, CategoricalValue{String, UInt32}}}:
(3, "virginica")
(1, "virginica")
(3, "virginica")
(3, "virginica")
(3, "virginica")
(3, "virginica")
(1, "virginica")
(3, "virginica")
Reference
MLJTransforms.Standardizer
— TypeStandardizer
A model type for constructing a standardizer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
Standardizer = @load Standardizer pkg=unknown
Do model = Standardizer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in Standardizer(features=...)
.
Use this model to standardize (whiten) a Continuous
vector, or relevant columns of a table. The rescalings applied by this transformer to new data are always those learned during the training phase, which are generally different from what would actually standardize the new data.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table or any abstract vector withContinuous
element scitype (any abstract float vector). Only features in a table withContinuous
scitype can be standardized; check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features
: one of the following, with the behavior indicated below:[]
(empty, the default): standardize all features (columns) havingContinuous
element scitypenon-empty vector of feature names (symbols): standardize only the
Continuous
features in the vector (ifignore=false
) orContinuous
features not named in the vector (ignore=true
).function or other callable: standardize a feature if the callable returns
true
on its name. For example,Standardizer(features = name -> name in [:x1, :x3], ignore = true, count=true)
has the same effect asStandardizer(features = [:x1, :x3], ignore = true, count=true)
, namely to standardize allContinuous
andCount
features, with the exception of:x1
and:x3
.
Note this behavior is further modified if the
ordered_factor
orcount
flags are set totrue
; see belowignore=false
: whether to ignore or standardize specifiedfeatures
, as explained aboveordered_factor=false
: iftrue
, standardize anyOrderedFactor
feature wherever aContinuous
feature would be standardized, as described abovecount=false
: iftrue
, standardize anyCount
feature wherever aContinuous
feature would be standardized, as described above
Operations
transform(mach, Xnew)
: returnXnew
with relevant features standardized according to the rescalings learned during fitting ofmach
.inverse_transform(mach, Z)
: apply the inverse transformation toZ
, so thatinverse_transform(mach, transform(mach, Xnew))
is approximately the same asXnew
; unavailable ifordered_factor
orcount
flags were set totrue
.
Fitted parameters
The fields of fitted_params(mach)
are:
features_fit
- the names of features that will be standardizedmeans
- the corresponding untransformed mean valuesstds
- the corresponding untransformed standard deviations
Report
The fields of report(mach)
are:
features_fit
: the names of features that will be standardized
Examples
using MLJ
X = (ordinal1 = [1, 2, 3],
ordinal2 = coerce([:x, :y, :x], OrderedFactor),
ordinal3 = [10.0, 20.0, 30.0],
ordinal4 = [-20.0, -30.0, -40.0],
nominal = coerce(["Your father", "he", "is"], Multiclass));
julia> schema(X)
┌──────────┬──────────────────┐
│ names │ scitypes │
├──────────┼──────────────────┤
│ ordinal1 │ Count │
│ ordinal2 │ OrderedFactor{2} │
│ ordinal3 │ Continuous │
│ ordinal4 │ Continuous │
│ nominal │ Multiclass{3} │
└──────────┴──────────────────┘
stand1 = Standardizer();
julia> transform(fit!(machine(stand1, X)), X)
(ordinal1 = [1, 2, 3],
ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
ordinal3 = [-1.0, 0.0, 1.0],
ordinal4 = [1.0, 0.0, -1.0],
nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)
stand2 = Standardizer(features=[:ordinal3, ], ignore=true, count=true);
julia> transform(fit!(machine(stand2, X)), X)
(ordinal1 = [-1.0, 0.0, 1.0],
ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
ordinal3 = [10.0, 20.0, 30.0],
ordinal4 = [1.0, 0.0, -1.0],
nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)
See also OneHotEncoder
, ContinuousEncoder
.
MLJTransforms.UnivariateBoxCoxTransformer
— TypeUnivariateBoxCoxTransformer
A model type for constructing a single variable Box-Cox transformer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
UnivariateBoxCoxTransformer = @load UnivariateBoxCoxTransformer pkg=unknown
Do model = UnivariateBoxCoxTransformer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in UnivariateBoxCoxTransformer(n=...)
.
Box-Cox transformations attempt to make data look more normally distributed. This can improve performance and assist in the interpretation of models which suppose that data is generated by a normal distribution.
A Box-Cox transformation (with shift) is of the form
x -> ((x + c)^λ - 1)/λ
for some constant c
and real λ
, unless λ = 0
, in which case the above is replaced with
x -> log(x + c)
Given user-specified hyper-parameters n::Integer
and shift::Bool
, the present implementation learns the parameters c
and λ
from the training data as follows: If shift=true
and zeros are encountered in the data, then c
is set to 0.2
times the data mean. If there are no zeros, then no shift is applied. Finally, n
different values of λ
between -0.4
and 3
are considered, with λ
fixed to the value maximizing normality of the transformed data.
Reference: Wikipedia entry for power transform.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, x)
where
x
: any abstract vector with element scitypeContinuous
; check the scitype withscitype(x)
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
n=171
: number of values of the exponentλ
to tryshift=false
: whether to include a preliminary constant translation in transformations, in the presence of zeros
Operations
transform(mach, xnew)
: apply the Box-Cox transformation learned when fittingmach
inverse_transform(mach, z)
: reconstruct the vectorz
whose transformation learned bymach
isz
Fitted parameters
The fields of fitted_params(mach)
are:
λ
: the learned Box-Cox exponentc
: the learned shift
Examples
using MLJ
using UnicodePlots
using Random
Random.seed!(123)
transf = UnivariateBoxCoxTransformer()
x = randn(1000).^2
mach = machine(transf, x)
fit!(mach)
z = transform(mach, x)
julia> histogram(x)
┌ ┐
[ 0.0, 2.0) ┤███████████████████████████████████ 848
[ 2.0, 4.0) ┤████▌ 109
[ 4.0, 6.0) ┤█▍ 33
[ 6.0, 8.0) ┤▍ 7
[ 8.0, 10.0) ┤▏ 2
[10.0, 12.0) ┤ 0
[12.0, 14.0) ┤▏ 1
└ ┘
Frequency
julia> histogram(z)
┌ ┐
[-5.0, -4.0) ┤█▎ 8
[-4.0, -3.0) ┤████████▊ 64
[-3.0, -2.0) ┤█████████████████████▊ 159
[-2.0, -1.0) ┤█████████████████████████████▊ 216
[-1.0, 0.0) ┤███████████████████████████████████ 254
[ 0.0, 1.0) ┤█████████████████████████▊ 188
[ 1.0, 2.0) ┤████████████▍ 90
[ 2.0, 3.0) ┤██▊ 20
[ 3.0, 4.0) ┤▎ 1
└ ┘
Frequency
MLJTransforms.InteractionTransformer
— TypeInteractionTransformer
A model type for constructing a interaction transformer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
InteractionTransformer = @load InteractionTransformer pkg=unknown
Do model = InteractionTransformer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in InteractionTransformer(order=...)
.
Generates all polynomial interaction terms up to the given order for the subset of chosen columns. Any column that contains elements with scitype <:Infinite
is a valid basis to generate interactions. If features
is not specified, all such columns with scitype <:Infinite
in the table are used as a basis.
In MLJ or MLJBase, you can transform features X
with the single call
transform(machine(model), X)
See also the example below.
Hyper-parameters
order
: Maximum order of interactions to be generated.features
: Restricts interations generation to those columns
Operations
transform(machine(model), X)
: Generates polynomial interaction terms out of tableX
using the hyper-parameters specified inmodel
.
Example
using MLJ
X = (
A = [1, 2, 3],
B = [4, 5, 6],
C = [7, 8, 9],
D = ["x₁", "x₂", "x₃"]
)
it = InteractionTransformer(order=3)
mach = machine(it)
julia> transform(mach, X)
(A = [1, 2, 3],
B = [4, 5, 6],
C = [7, 8, 9],
D = ["x₁", "x₂", "x₃"],
A_B = [4, 10, 18],
A_C = [7, 16, 27],
B_C = [28, 40, 54],
A_B_C = [28, 80, 162],)
it = InteractionTransformer(order=2, features=[:A, :B])
mach = machine(it)
julia> transform(mach, X)
(A = [1, 2, 3],
B = [4, 5, 6],
C = [7, 8, 9],
D = ["x₁", "x₂", "x₃"],
A_B = [4, 10, 18],)
MLJTransforms.UnivariateDiscretizer
— TypeUnivariateDiscretizer
A model type for constructing a single variable discretizer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
UnivariateDiscretizer = @load UnivariateDiscretizer pkg=unknown
Do model = UnivariateDiscretizer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in UnivariateDiscretizer(n_classes=...)
.
Discretization converts a Continuous
vector into an OrderedFactor
vector. In particular, the output is a CategoricalVector
(whose reference type is optimized).
The transformation is chosen so that the vector on which the transformer is fit has, in transformed form, an approximately uniform distribution of values. Specifically, if n_classes
is the level of discretization, then 2*n_classes - 1
ordered quantiles are computed, the odd quantiles being used for transforming (discretization) and the even quantiles for inverse transforming.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, x)
where
x
: any abstract vector withContinuous
element scitype; check scitype withscitype(x)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
n_classes
: number of discrete classes in the output
Operations
transform(mach, xnew)
: discretizexnew
according to the discretization learned when fittingmach
inverse_transform(mach, z)
: attempt to reconstruct fromz
a vector that transforms to givez
Fitted parameters
The fields of fitted_params(mach).fitesult
include:
odd_quantiles
: quantiles used for transforming (length isn_classes - 1
)even_quantiles
: quantiles used for inverse transforming (length isn_classes
)
Example
using MLJ
using Random
Random.seed!(123)
discretizer = UnivariateDiscretizer(n_classes=100)
mach = machine(discretizer, randn(1000))
fit!(mach)
julia> x = rand(5)
5-element Vector{Float64}:
0.8585244609846809
0.37541692370451396
0.6767070590395461
0.9208844241267105
0.7064611415680901
julia> z = transform(mach, x)
5-element CategoricalArrays.CategoricalArray{UInt8,1,UInt8}:
0x52
0x42
0x4d
0x54
0x4e
x_approx = inverse_transform(mach, z)
julia> x - x_approx
5-element Vector{Float64}:
0.008224506144777322
0.012731354778359405
0.0056265330571125816
0.005738175684445124
0.006835652575801987
MLJTransforms.FillImputer
— TypeFillImputer
A model type for constructing a fill imputer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
FillImputer = @load FillImputer pkg=unknown
Do model = FillImputer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FillImputer(features=...)
.
Use this model to impute missing
values in tabular data. A fixed "filler" value is learned from the training data, one for each column of the table.
For imputing missing values in a vector, use UnivariateFillImputer
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any table of input features (eg, aDataFrame
) whose features each have element scitypesUnion{Missing, T}
, whereT
is a subtype ofContinuous
,Multiclass
,OrderedFactor
orCount
. Check scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features
: a vector of names of features (symbols) for which imputation is to be attempted; default is empty, which is interpreted as "impute all".continuous_fill
: function or other callable to determine value to be imputed in the case ofContinuous
(abstract float) data; default is to applymedian
after skippingmissing
valuescount_fill
: function or other callable to determine value to be imputed in the case ofCount
(integer) data; default is to apply roundedmedian
after skippingmissing
valuesfinite_fill
: function or other callable to determine value to be imputed in the case ofMulticlass
orOrderedFactor
data (categorical vectors); default is to applymode
after skippingmissing
values
Operations
transform(mach, Xnew)
: returnXnew
with missing values imputed with the fill values learned when fittingmach
Fitted parameters
The fields of fitted_params(mach)
are:
features_seen_in_fit
: the names of features (features) encountered during trainingunivariate_transformer
: the univariate model applied to determine the fillers (it's fields contain the functions defining the filler computations)filler_given_feature
: dictionary of filler values, keyed on feature (column) names
Examples
using MLJ
imputer = FillImputer()
X = (a = [1.0, 2.0, missing, 3.0, missing],
b = coerce(["y", "n", "y", missing, "y"], Multiclass),
c = [1, 1, 2, missing, 3])
schema(X)
julia> schema(X)
┌───────┬───────────────────────────────┐
│ names │ scitypes │
├───────┼───────────────────────────────┤
│ a │ Union{Missing, Continuous} │
│ b │ Union{Missing, Multiclass{2}} │
│ c │ Union{Missing, Count} │
└───────┴───────────────────────────────┘
mach = machine(imputer, X)
fit!(mach)
julia> fitted_params(mach).filler_given_feature
(filler = 2.0,)
julia> fitted_params(mach).filler_given_feature
Dict{Symbol, Any} with 3 entries:
:a => 2.0
:b => "y"
:c => 2
julia> transform(mach, X)
(a = [1.0, 2.0, 2.0, 3.0, 2.0],
b = CategoricalValue{String, UInt32}["y", "n", "y", "y", "y"],
c = [1, 1, 2, 2, 3],)
See also UnivariateFillImputer
.
MLJTransforms.UnivariateFillImputer
— TypeUnivariateFillImputer
A model type for constructing a single variable fill imputer, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
UnivariateFillImputer = @load UnivariateFillImputer pkg=unknown
Do model = UnivariateFillImputer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in UnivariateFillImputer(continuous_fill=...)
.
Use this model to imputing missing
values in a vector with a fixed value learned from the non-missing values of training vector.
For imputing missing values in tabular data, use FillImputer
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, x)
where
x
: any abstract vector with element scitypeUnion{Missing, T}
whereT
is a subtype ofContinuous
,Multiclass
,OrderedFactor
orCount
; check scitype usingscitype(x)
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
continuous_fill
: function or other callable to determine value to be imputed in the case ofContinuous
(abstract float) data; default is to applymedian
after skippingmissing
valuescount_fill
: function or other callable to determine value to be imputed in the case ofCount
(integer) data; default is to apply roundedmedian
after skippingmissing
valuesfinite_fill
: function or other callable to determine value to be imputed in the case ofMulticlass
orOrderedFactor
data (categorical vectors); default is to applymode
after skippingmissing
values
Operations
transform(mach, xnew)
: returnxnew
with missing values imputed with the fill values learned when fittingmach
Fitted parameters
The fields of fitted_params(mach)
are:
filler
: the fill value to be imputed in all new data
Examples
using MLJ
imputer = UnivariateFillImputer()
x_continuous = [1.0, 2.0, missing, 3.0]
x_multiclass = coerce(["y", "n", "y", missing, "y"], Multiclass)
x_count = [1, 1, 1, 2, missing, 3, 3]
mach = machine(imputer, x_continuous)
fit!(mach)
julia> fitted_params(mach)
(filler = 2.0,)
julia> transform(mach, [missing, missing, 101.0])
3-element Vector{Float64}:
2.0
2.0
101.0
mach2 = machine(imputer, x_multiclass) |> fit!
julia> transform(mach2, x_multiclass)
5-element CategoricalArray{String,1,UInt32}:
"y"
"n"
"y"
"y"
"y"
mach3 = machine(imputer, x_count) |> fit!
julia> transform(mach3, [missing, missing, 5])
3-element Vector{Int64}:
2
2
5
For imputing tabular data, use FillImputer
.
MLJTransforms.UnivariateTimeTypeToContinuous
— TypeUnivariateTimeTypeToContinuous
A model type for constructing a single variable transformer that creates continuous representations of temporally typed data, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
UnivariateTimeTypeToContinuous = @load UnivariateTimeTypeToContinuous pkg=unknown
Do model = UnivariateTimeTypeToContinuous()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in UnivariateTimeTypeToContinuous(zero_time=...)
.
Use this model to convert vectors with a TimeType
element type to vectors of Float64
type (Continuous
element scitype).
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, x)
where
x
: any abstract vector whose element type is a subtype ofDates.TimeType
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
zero_time
: the time that is to correspond to 0.0 under transformations, with the type coinciding with the training data element type. If unspecified, the earliest time encountered in training is used.step::Period=Hour(24)
: time interval to correspond to one unit under transformation
Operations
transform(mach, xnew)
: apply the encoding inferred whenmach
was fit
Fitted parameters
fitted_params(mach).fitresult
is the tuple (zero_time, step)
actually used in transformations, which may differ from the user-specified hyper-parameters.
Example
using MLJ
using Dates
x = [Date(2001, 1, 1) + Day(i) for i in 0:4]
encoder = UnivariateTimeTypeToContinuous(zero_time=Date(2000, 1, 1),
step=Week(1))
mach = machine(encoder, x)
fit!(mach)
julia> transform(mach, x)
5-element Vector{Float64}:
52.285714285714285
52.42857142857143
52.57142857142857
52.714285714285715
52.857142
MLJTransforms.OneHotEncoder
— TypeOneHotEncoder
A model type for constructing a one-hot encoder, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
OneHotEncoder = @load OneHotEncoder pkg=unknown
Do model = OneHotEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OneHotEncoder(features=...)
.
Use this model to one-hot encode the Multiclass
and OrderedFactor
features (columns) of some table, leaving other columns unchanged.
New data to be transformed may lack features present in the fit data, but no new features can be present.
Warning: This transformer assumes that levels(col)
for any Multiclass
or OrderedFactor
column, col
, is the same for training data and new data to be transformed.
To ensure all features are transformed into Continuous
features, or dropped, use ContinuousEncoder
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table. Columns can be of mixed type but only those with element scitypeMulticlass
orOrderedFactor
can be encoded. Check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features
: a vector of symbols (feature names). If empty (default) then allMulticlass
andOrderedFactor
features are encoded. Otherwise, encoding is further restricted to the specified features (ignore=false
) or the unspecified features (ignore=true
). This default behavior can be modified by theordered_factor
flag.ordered_factor=false
: whentrue
,OrderedFactor
features are universally excludeddrop_last=true
: whether to drop the column corresponding to the final class of encoded features. For example, a three-class feature is spawned into three new features ifdrop_last=false
, but just two features otherwise.
Fitted parameters
The fields of fitted_params(mach)
are:
all_features
: names of all features encountered in trainingfitted_levels_given_feature
: dictionary of the levels associated with each feature encoded, keyed on the feature nameref_name_pairs_given_feature
: dictionary of pairsr => ftr
(such as0x00000001 => :grad__A
) wherer
is a CategoricalArrays.jl reference integer representing a level, andftr
the corresponding new feature name; the dictionary is keyed on the names of features that are encoded
Report
The fields of report(mach)
are:
features_to_be_encoded
: names of input features to be encodednew_features
: names of all output features
Example
using MLJ
X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
grade=categorical(["A", "B", "A", "C"], ordered=true),
height=[1.85, 1.67, 1.5, 1.67],
n_devices=[3, 2, 4, 3])
julia> schema(X)
┌───────────┬──────────────────┐
│ names │ scitypes │
├───────────┼──────────────────┤
│ name │ Multiclass{4} │
│ grade │ OrderedFactor{3} │
│ height │ Continuous │
│ n_devices │ Count │
└───────────┴──────────────────┘
hot = OneHotEncoder(drop_last=true)
mach = fit!(machine(hot, X))
W = transform(mach, X)
julia> schema(W)
┌──────────────┬────────────┐
│ names │ scitypes │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John │ Continuous │
│ name__Lee │ Continuous │
│ grade__A │ Continuous │
│ grade__B │ Continuous │
│ height │ Continuous │
│ n_devices │ Count │
└──────────────┴────────────┘
See also ContinuousEncoder
.
MLJTransforms.ContinuousEncoder
— TypeContinuousEncoder
A model type for constructing a continuous encoder, based on unknown.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
ContinuousEncoder = @load ContinuousEncoder pkg=unknown
Do model = ContinuousEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContinuousEncoder(drop_last=...)
.
Use this model to arrange all features (features) of a table to have Continuous
element scitype, by applying the following protocol to each feature ftr
:
If
ftr
is alreadyContinuous
retain it.If
ftr
isMulticlass
, one-hot encode it.If
ftr
isOrderedFactor
, replace it withcoerce(ftr, Continuous)
(vector of floating point integers), unlessordered_factors=false
is specified, in which case one-hot encode it.If
ftr
isCount
, replace it withcoerce(ftr, Continuous)
.If
ftr
has some other element scitype, or was not observed in fitting the encoder, drop it from the table.
Warning: This transformer assumes that levels(col)
for any Multiclass
or OrderedFactor
column, col
, is the same for training data and new data to be transformed.
To selectively one-hot-encode categorical features (without dropping features) use OneHotEncoder
instead.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
where
X
: any Tables.jl compatible table. features can be of mixed type but only those with element scitypeMulticlass
orOrderedFactor
can be encoded. Check column scitypes withschema(X)
.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
drop_last=true
: whether to drop the column corresponding to the final class of one-hot encoded features. For example, a three-class feature is spawned into three new features ifdrop_last=false
, but two just features otherwise.one_hot_ordered_factors=false
: whether to one-hot any feature withOrderedFactor
element scitype, or to instead coerce it directly to a (single)Continuous
feature using the order
Fitted parameters
The fields of fitted_params(mach)
are:
features_to_keep
: names of features that will not be dropped from the tableone_hot_encoder
: theOneHotEncoder
model instance for handling the one-hot encodingone_hot_encoder_fitresult
: the fitted parameters of theOneHotEncoder
model
Report
features_to_keep
: names of input features that will not be dropped from the tablenew_features
: names of all output features
Example
X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
grade=categorical(["A", "B", "A", "C"], ordered=true),
height=[1.85, 1.67, 1.5, 1.67],
n_devices=[3, 2, 4, 3],
comments=["the force", "be", "with you", "too"])
julia> schema(X)
┌───────────┬──────────────────┐
│ names │ scitypes │
├───────────┼──────────────────┤
│ name │ Multiclass{4} │
│ grade │ OrderedFactor{3} │
│ height │ Continuous │
│ n_devices │ Count │
│ comments │ Textual │
└───────────┴──────────────────┘
encoder = ContinuousEncoder(drop_last=true)
mach = fit!(machine(encoder, X))
W = transform(mach, X)
julia> schema(W)
┌──────────────┬────────────┐
│ names │ scitypes │
├──────────────┼────────────┤
│ name__Danesh │ Continuous │
│ name__John │ Continuous │
│ name__Lee │ Continuous │
│ grade │ Continuous │
│ height │ Continuous │
│ n_devices │ Continuous │
└──────────────┴────────────┘
julia> setdiff(schema(X).names, report(mach).features_to_keep) # dropped features
1-element Vector{Symbol}:
:comments
See also OneHotEncoder
MLJTransforms.OrdinalEncoder
— TypeOrdinalEncoder
A model type for constructing a ordinal encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
OrdinalEncoder = @load OrdinalEncoder pkg=MLJTransforms
Do model = OrdinalEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in OrdinalEncoder(features=...)
.
OrdinalEncoder
implements ordinal encoding which replaces the categorical values in the specified categorical features with integers (ordered arbitrarily). This will create an implicit ordering between categories which may not be a proper modelling assumption.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore themoutput_type
: The numerical concrete type of the encoded features. Default isFloat32
.
Operations
transform(mach, Xnew)
: Apply ordinal encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
index_given_feat_level
: A dictionary that maps each level for each column in a subset of the categorical features of X into an integer.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
# Check scitype coercion:
schema(X)
encoder = OrdinalEncoder(ordered_factor = false)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(A = [2, 1, 2, 3, 3],
B = [1.0, 2.0, 3.0, 4.0, 5.0],
C = [1, 1, 1, 2, 1],
D = [2, 1, 2, 1, 2],
E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)
See also TargetEncoder
MLJTransforms.FrequencyEncoder
— TypeFrequencyEncoder
A model type for constructing a frequency encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
FrequencyEncoder = @load FrequencyEncoder pkg=MLJTransforms
Do model = FrequencyEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in FrequencyEncoder(features=...)
.
FrequencyEncoder
implements frequency encoding which replaces the categorical values in the specified categorical features with their (normalized or raw) frequencies of occurrence in the dataset.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore themnormalize=false
: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.output_type=Float32
: The type of the output values. The default isFloat32
, but you can set it toFloat64
or any other type that can hold the frequency values.
Operations
transform(mach, Xnew)
: Apply frequency encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
statistic_given_feat_val
: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
# Check scitype coercions:
schema(X)
encoder = FrequencyEncoder(ordered_factor = false, normalize=true)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(A = [2, 1, 2, 2, 2],
B = [1.0, 2.0, 3.0, 4.0, 5.0],
C = [4, 4, 4, 1, 4],
D = [3, 2, 3, 2, 3],
E = CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 2, 3, 4, 5],)
See also TargetEncoder
MLJTransforms.TargetEncoder
— TypeTargetEncoder
A model type for constructing a target encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
TargetEncoder = @load TargetEncoder pkg=MLJTransforms
Do model = TargetEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TargetEncoder(features=...)
.
TargetEncoder
implements target encoding as defined in [1] to encode categorical variables into continuous ones using statistics from the target variable.
Training data
In MLJ (or MLJBase) bind an instance model
to data with
mach = machine(model, X, y)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
y
is the target, which can be anyAbstractVector
whose element scitype isContinuous
orCount
for regression problems andMulticlass
orOrderedFactor
for classification problems; check the scitype withschema(y)
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore themλ
: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]m
: An integer hyperparameter to compute shrinkage as described in [1]. Ifm=:auto
then m will be computed using
empirical Bayes estimation as described in [1]
Operations
transform(mach, Xnew)
: Apply target encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
task
: Whether the task isClassification
orRegression
y_statistic_given_feat_level
: A dictionary with the necessary statistics to encode each categorical feature. It maps each level in each categorical feature to a statistic computed over the target.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
# Define categorical features
A = ["g", "b", "g", "r", "r",]
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
C = ["f", "f", "f", "m", "f",]
D = [true, false, true, false, true,]
E = [1, 2, 3, 4, 5,]
# Define the target variable
y = ["c1", "c2", "c3", "c1", "c2",]
# Combine into a named tuple
X = (A = A, B = B, C = C, D = D, E = E)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Continuous,
:C => Multiclass,
:D => Multiclass,
:E => OrderedFactor,
)
y = coerce(y, Multiclass)
encoder = TargetEncoder(ordered_factor = false, lambda = 1.0, m = 0,)
mach = fit!(machine(encoder, X, y))
Xnew = transform(mach, X)
julia > schema(Xnew)
┌───────┬──────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼──────────────────┼─────────────────────────────────┤
│ A_1 │ Continuous │ Float64 │
│ A_2 │ Continuous │ Float64 │
│ A_3 │ Continuous │ Float64 │
│ B │ Continuous │ Float64 │
│ C_1 │ Continuous │ Float64 │
│ C_2 │ Continuous │ Float64 │
│ C_3 │ Continuous │ Float64 │
│ D_1 │ Continuous │ Float64 │
│ D_2 │ Continuous │ Float64 │
│ D_3 │ Continuous │ Float64 │
│ E │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
└───────┴──────────────────┴─────────────────────────────────┘
Reference
[1] Micci-Barreca, Daniele. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.
See also OneHotEncoder
MLJTransforms.ContrastEncoder
— TypeContrastEncoder
A model type for constructing a contrast encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
ContrastEncoder = @load ContrastEncoder pkg=MLJTransforms
Do model = ContrastEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in ContrastEncoder(features=...)
.
ContrastEncoder
implements the following contrast encoding methods for categorical features: dummy, sum, backward/forward difference, and Helmert coding. More generally, users can specify a custom contrast or hypothesis matrix, and each feature can be encoded using a different method.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedmode=:dummy
: The type of encoding to use. Can be one of:contrast
,:dummy
,:sum
,:backward_diff
,:forward_diff
,:helmert
or:hypothesis
.
If ignore=false
(features to be encoded are listed explictly in features
), then this can be a vector of the same length as features
to specify a different contrast encoding scheme for each feature
buildmatrix=nothing
: A function or other callable with signaturebuildmatrix(colname, k)
,
where colname
is the name of the feature levels and k
is it's length, and which returns contrast or hypothesis matrix with row/column ordering consistent with the ordering of levels(col)
. Only relevant if mode
is :contrast
or :hypothesis
.
ignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore them
Operations
transform(mach, Xnew)
: Apply contrast encoding to selectedMulticlass
orOrderedFactor features of
Xnewspecified by hyper-parameters, and return the new table. Features that are neither
Multiclassnor
OrderedFactor` are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
vector_given_value_given_feature
: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
using MLJ
# Define categorical dataset
X = (
name = categorical(["Ben", "John", "Mary", "John"]),
height = [1.85, 1.67, 1.5, 1.67],
favnum = categorical([7, 5, 10, 1]),
age = [23, 23, 14, 23],
)
# Check scitype coercions:
schema(X)
encoder = ContrastEncoder(
features = [:name, :favnum],
ignore = false,
mode = [:dummy, :helmert],
)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia > Xnew
(name_John = [1.0, 0.0, 0.0, 0.0],
name_Mary = [0.0, 1.0, 0.0, 1.0],
height = [1.85, 1.67, 1.5, 1.67],
favnum_5 = [0.0, 1.0, 0.0, -1.0],
favnum_7 = [2.0, -1.0, 0.0, -1.0],
favnum_10 = [-1.0, -1.0, 3.0, -1.0],
age = [23, 23, 14, 23],)
See also OneHotEncoder
MLJTransforms.CardinalityReducer
— TypeCardinalityReducer
A model type for constructing a cardinality reducer, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
CardinalityReducer = @load CardinalityReducer pkg=MLJTransforms
Do model = CardinalityReducer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CardinalityReducer(features=...)
.
CardinalityReducer
maps any level of a categorical feature that occurs with frequency < min_frequency
into a new level (e.g., "Other"). This is useful when some categorical features have high cardinality and many levels are infrequent. This assumes that the categorical features have raw types that are in Union{AbstractString, Char, Number}
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore themmin_frequency::Real=3
: Any level of a categorical feature that occurs with frequency <min_frequency
will be mapped to a new level. Could be
an integer or a float which decides whether raw counts or normalized frequencies are used.
label_for_infrequent::Dict{<:Type, <:Any}()= Dict( AbstractString => "Other", Char => 'O', )
: A
dictionary where the possible values for keys are the types in Char
, AbstractString
, and Number
and each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypes AbstractString
then the new value is "Other"
and if the raw type subtypes Char
then the new value is 'O'
and if the raw type subtypes Number
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
new_cat_given_col_val
: A dictionary that maps each level in a categorical feature to a new level (either itself or the new level specified inlabel_for_infrequent
)
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
# Define categorical features
A = [ ["a" for i in 1:100]..., "b", "b", "b", "c", "d"]
B = [ [0 for i in 1:100]..., 1, 2, 3, 4, 4]
# Combine into a named tuple
X = (A = A, B = B)
# Coerce A, C, D to multiclass and B to continuous and E to ordinal
X = coerce(X,
:A => Multiclass,
:B => Multiclass
)
encoder = CardinalityReducer(ordered_factor = false, min_frequency=3)
mach = fit!(machine(encoder, X))
Xnew = transform(mach, X)
julia> proportionmap(Xnew.A)
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, Float64} with 3 entries:
"Other" => 0.0190476
"b" => 0.0285714
"a" => 0.952381
julia> proportionmap(Xnew.B)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Float64} with 2 entries:
0 => 0.952381
-1 => 0.047619
See also FrequencyEncoder
MLJTransforms.MissingnessEncoder
— TypeMissingnessEncoder
A model type for constructing a missingness encoder, based on MLJTransforms.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
MissingnessEncoder = @load MissingnessEncoder pkg=MLJTransforms
Do model = MissingnessEncoder()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MissingnessEncoder(features=...)
.
MissingnessEncoder
maps any missing level of a categorical feature into a new level (e.g., "Missing"). By this, missingness will be treated as a new level by any subsequent model. This assumes that the categorical features have raw types that are in Char
, AbstractString
, and Number
.
Training data
In MLJ (or MLJBase) bind an instance unsupervised model
to data with
mach = machine(model, X)
Here:
X
is any table of input features (eg, aDataFrame
). Features to be transformed must have element scitypeMulticlass
orOrderedFactor
. Useschema(X)
to check scitypes.
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
features=[]: A list of names of categorical features given as symbols to exclude or include from encoding, according to the value of
ignore
, or a single symbol (which is treated as a vector with one symbol), or a callable that returns true for features to be included/excludedignore=true: Whether to exclude or include the features given in
features
ordered_factor=false: Whether to encode
OrderedFactor
or ignore themlabel_for_missing::Dict{<:Type, <:Any}()= Dict( AbstractString => "missing", Char => 'm', )
: A
dictionary where the possible values for keys are the types in Char
, AbstractString
, and Number
and where each value signifies the new level to map into given a column raw super type. By default, if the raw type of the column subtypes AbstractString
then missing values will be replaced with "missing"
and if the raw type subtypes Char
then the new value is 'm'
and if the raw type subtypes Number
then the new value is the lowest value in the column - 1.
Operations
transform(mach, Xnew)
: Apply cardinality reduction to selectedMulticlass
orOrderedFactor
features ofXnew
specified by hyper-parameters, and return the new table. Features that are neitherMulticlass
norOrderedFactor
are always left unchanged.
Fitted parameters
The fields of fitted_params(mach)
are:
label_for_missing_given_feature
: A dictionary that for each column, mapsmissing
into some value according tolabel_for_missing
Report
The fields of report(mach)
are:
- encoded_features: The subset of the categorical features of
X
that were encoded
Examples
import StatsBase.proportionmap
using MLJ
# Define a table with missing values
Xm = (
A = categorical(["Ben", "John", missing, missing, "Mary", "John", missing]),
B = [1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C= categorical([7, 5, missing, missing, 10, 0, missing]),
D = [23, 23, 44, 66, 14, 23, 11],
E = categorical([missing, 'g', 'r', missing, 'r', 'g', 'p'])
)
encoder = MissingnessEncoder()
mach = fit!(machine(encoder, Xm))
Xnew = transform(mach, Xm)
julia> Xnew
(A = ["Ben", "John", "missing", "missing", "Mary", "John", "missing"],
B = Union{Missing, Float64}[1.85, 1.67, missing, missing, 1.5, 1.67, missing],
C = [7, 5, -1, -1, 10, 0, -1],
D = [23, 23, 44, 66, 14, 23, 11],
E = ['m', 'g', 'r', 'm', 'r', 'g', 'p'],)
See also CardinalityReducer