Working with Categorical Data
Scientific types for discrete data
Recall that models articulate their data requirements using scientific types (see Getting Started or the ScientificTypes.jl documentation). There are three scientific types discrete data can have: Count
, OrderedFactor
and Multiclass
.
Count data
In MLJ you cannot use integers to represent (finite) categorical data. Integers are reserved for discrete data you want interpreted as Count <: Infinite
:
scitype([1, 4, 5, 6])
AbstractVector{Count} (alias for AbstractArray{Count, 1})
The Count
scientific type includes things like the number of phone calls, or city populations, and other "frequency" data of a generally unbounded nature.
That said, you may have data that is theoretically Count
, but which you coerce to OrderedFactor
to enable the use of more models, trusting to your knowledge of how those models work to inform an appropriate interpretation.
OrderedFactor and Multiclass data
Other integer data, such as the number of an animal's legs, or number of rooms in homes, are, generally, coerced to OrderedFactor <: Finite
. The other categorical scientific type is Multiclass <: Finite
, which is for unordered categorical data. Coercing data to one of these two forms is discussed under Detecting and coercing improperly represented categorical data below.
Binary data
There is no separate scientific type for binary data. Binary data is either OrderedFactor{2}
if ordered, and Multiclass{2}
otherwise. Data with type OrderedFactor{2}
is considered to have an intrinsic "positive" class, e.g., the outcome of a medical test, and the "pass/fail" outcome of an exam. MLJ measures, such as true_positive
assume the second class in the ordering is the "positive" class. Inspecting and changing order are discussed in the next section.
If data has type Bool
it is considered Count
data (as Bool <: Integer
) and, generally, users will want to coerce such data to Multiclass
or OrderedFactor
.
Detecting and coercing improperly represented categorical data
One inspects the scientific type of data using scitype
as shown above. To inspect all column scientific types in a table simultaneously, use schema
. (The scitype(X)
of a table X
contains a condensed form of this information used in type dispatch; see here.)
import DataFrames: DataFrame
X = DataFrame(
name = ["Siri", "Robo", "Alexa", "Cortana"],
gender = ["male", "male", "Female", "female"],
likes_soup = [true, false, false, true],
height = [152, missing, 148, 163],
rating = [2, 5, 2, 1],
outcome = ["rejected", "accepted", "accepted", "rejected"],
)
schema(X)
┌────────────┬───────────────────────┬───────────────────────┐
│ names │ scitypes │ types │
├────────────┼───────────────────────┼───────────────────────┤
│ name │ Textual │ String │
│ gender │ Textual │ String │
│ likes_soup │ Count │ Bool │
│ height │ Union{Missing, Count} │ Union{Missing, Int64} │
│ rating │ Count │ Int64 │
│ outcome │ Textual │ String │
└────────────┴───────────────────────┴───────────────────────┘
Coercing a single column:
X.outcome = coerce(X.outcome, OrderedFactor)
4-element CategoricalArray{String,1,UInt32}:
"rejected"
"accepted"
"accepted"
"rejected"
The machine type of the result is a CategoricalArray
. For more on this type see Under the hood: CategoricalValue and CategoricalArray below.
Inspecting the order of the levels:
levels(X.outcome)
2-element Vector{String}:
"accepted"
"rejected"
Since we wish to regard "accepted" as the positive class, it should appear second, which we correct with the levels!
function:
levels!(X.outcome, ["rejected", "accepted"])
levels(X.outcome)
2-element Vector{String}:
"rejected"
"accepted"
The order of levels should generally be changed early in your data science workflow and then not again. Similar remarks apply to adding levels (which is possible; see the CategorialArrays.jl documentation). MLJ supervised and unsupervised models assume levels and their order do not change.
Coercing all remaining types simultaneously:
Xnew = coerce(X, :gender => Multiclass,
:likes_soup => OrderedFactor,
:height => Continuous,
:rating => OrderedFactor)
schema(Xnew)
┌────────────┬────────────────────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├────────────┼────────────────────────────┼──────────────────────────────────┤
│ name │ Textual │ String │
│ gender │ Multiclass{3} │ CategoricalValue{String, UInt32} │
│ likes_soup │ OrderedFactor{2} │ CategoricalValue{Bool, UInt32} │
│ height │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ rating │ OrderedFactor{3} │ CategoricalValue{Int64, UInt32} │
│ outcome │ OrderedFactor{2} │ CategoricalValue{String, UInt32} │
└────────────┴────────────────────────────┴──────────────────────────────────┘
For DataFrame
s there is also in-place coercion, using coerce!
.
Tracking all levels
The key property of vectors of scientific type OrderedFactor
and Multiclass
is that the pool of all levels is not lost when separating out one or more elements:
v = Xnew.rating
4-element CategoricalArray{Int64,1,UInt32}:
2
5
2
1
levels(v)
3-element Vector{Int64}:
1
2
5
levels(v[1:2])
3-element Vector{Int64}:
1
2
5
levels(v[2])
3-element Vector{Int64}:
1
2
5
By tracking all classes in this way, MLJ avoids common pain points around categorical data, such as evaluating models on an evaluation set, only to crash your code because classes appear there which were not seen during training.
By drawing test, validation and training data from a common data structure (as described in Getting Started, for example) one ensures that all possible classes of categorical variables are tracked at all times. However, this does not mitigate problems with new production data, if categorical features there are missing classes or contain previously unseen classes.
New or missing levels in production data
Unpredictable behavior may result whenever Finite
categorical data presents in a production set with different classes (levels) from those presented during training
Consider, for example, the following naive workflow:
# train a one-hot encoder on some data:
x = coerce(["black", "white", "white", "black"], Multiclass)
X = DataFrame(x=x)
model = OneHotEncoder()
mach = machine(model, X) |> fit!
# one-hot encode new data with missing classes:
xproduction = coerce(["white", "white"], Multiclass)
Xproduction = DataFrame(x=xproduction)
Xproduction == X[2:3,:]
true
So far, so good. But the following operation throws an error:
julia> transform(mach, Xproduction) == transform(mach, X[2:3,:])
ERROR: Found category level mismatch in feature `x`. Consider using `levels!` to ensure fitted and transforming features have the same category levels.
The problem here is that levels(X.x)
and levels(Xproduction.x)
are different:
levels(X.x)
2-element Vector{String}:
"black"
"white"
levels(Xproduction.x)
1-element Vector{String}:
"white"
This could be anticipated by the fact that the training and production data have different schema:
schema(X)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ x │ Multiclass{2} │ CategoricalValue{String, UInt32} │
└───────┴───────────────┴──────────────────────────────────┘
schema(Xproduction)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes │ types │
├───────┼───────────────┼──────────────────────────────────┤
│ x │ Multiclass{1} │ CategoricalValue{String, UInt32} │
└───────┴───────────────┴──────────────────────────────────┘
One fix is to manually correct the levels of the production data:
levels!(Xproduction.x, levels(x))
transform(mach, Xproduction) == transform(mach, X[2:3,:])
true
Another solution is to pack all production data with dummy rows based on the training data (subsequently dropped) to ensure there are no missing classes. Currently, MLJ contains no general tooling to check and fix categorical levels in production data (although one can check that training data and production data have the same schema, to ensure the number of classes in categorical data is consistent).
Extracting an integer representation of Finite data
Occasionally, you may really want an integer representation of data that currently has scitype Finite
. For example, you are a developer wrapping an algorithm from an external package for use in MLJ, and that algorithm uses integer representations. Use the int
method for this purpose, and use decoder
to construct decoders for reversing the transformation:
v = coerce(["one", "two", "three", "one"], OrderedFactor);
levels!(v, ["one", "two", "three"]);
v_int = int(v)
4-element Vector{UInt32}:
0x00000001
0x00000002
0x00000003
0x00000001
d = decoder(v); # or decoder(v[1])
d.(v_int)
4-element CategoricalArray{String,1,UInt32}:
"one"
"two"
"three"
"one"
Under the hood: CategoricalValue and CategoricalArray
In MLJ the objects with OrderedFactor
or Multiclass
scientific type have machine type CategoricalValue
, from the CategoricalArrays.jl package. In some sense CategoricalValue
s are an implementation detail users can ignore for the most part, as shown above. However, you may want some basic understanding of these types, and those implementing MLJ's model interface for new algorithms will have to understand them. For the complete API, see the CategoricalArrays.jl documentation. Here are the basics:
To construct an OrderedFactor
or Multiclass
vector directly from raw labels, one uses categorical
:
v = categorical(['A', 'B', 'A', 'A', 'C'])
typeof(v)
CategoricalVector{Char, UInt32, Char, CategoricalValue{Char, UInt32}, Union{}} (alias for CategoricalArray{Char, 1, UInt32, Char, CategoricalValue{Char, UInt32}, Union{}})
(Equivalent to the idiomatically MLJ v = coerce(['A', 'B', 'A', 'A', 'C']), Multiclass)
.)
scitype(v)
AbstractVector{Multiclass{3}} (alias for AbstractArray{Multiclass{3}, 1})
v = categorical(['A', 'B', 'A', 'A', 'C'], ordered=true, compress=true)
5-element CategoricalArray{Char,1,UInt8}:
'A'
'B'
'A'
'A'
'C'
scitype(v)
AbstractVector{OrderedFactor{3}} (alias for AbstractArray{OrderedFactor{3}, 1})
When you index a CategoricalVector
you don't get a raw label, but instead an instance of CategoricalValue
. As explained above, this value knows the complete pool of levels from the vector from which it came. Use get(val)
to extract the raw label from a value val
.
Despite the distinction that exists between a value (element) and a label, the two are the same, from the point of ==
and in
:
v[1] == 'A' # true
'A' in v # true
Probabilistic predictions of categorical data
Recall from Getting Started that probabilistic classifiers ordinarily predict UnivariateFinite
distributions, not raw probabilities (which are instead accessed using the pdf
method.) Here's how to construct such a distribution yourself:
v = coerce(["yes", "no", "yes", "yes", "maybe"], Multiclass)
d = UnivariateFinite([v[2], v[1]], [0.9, 0.1])
UnivariateFinite{Multiclass{3}}(no=>0.9, yes=>0.1)
Or, equivalently,
d = UnivariateFinite(["no", "yes"], [0.9, 0.1], pool=v)
UnivariateFinite{Multiclass{3}}(no=>0.9, yes=>0.1)
This distribution tracks all levels, not just the ones to which you have assigned probabilities:
pdf(d, "maybe")
0.0
However, pdf(d, "dunno")
will throw an error.
You can declare pool=missing
, but then "maybe"
will not be tracked:
d = UnivariateFinite(["no", "yes"], [0.9, 0.1], pool=missing)
levels(d)
2-element Vector{String}:
"no"
"yes"
To construct a whole vector of UnivariateFinite
distributions, simply give the constructor a matrix of probabilities:
yes_probs = rand(5)
probs = hcat(1 .- yes_probs, yes_probs)
d_vec = UnivariateFinite(["no", "yes"], probs, pool=v)
5-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
UnivariateFinite{Multiclass{3}}(no=>0.492, yes=>0.508)
UnivariateFinite{Multiclass{3}}(no=>0.801, yes=>0.199)
UnivariateFinite{Multiclass{3}}(no=>0.24, yes=>0.76)
UnivariateFinite{Multiclass{3}}(no=>0.429, yes=>0.571)
UnivariateFinite{Multiclass{3}}(no=>0.77, yes=>0.23)
Or, equivalently:
d_vec = UnivariateFinite(["no", "yes"], yes_probs, augment=true, pool=v)
For more options, see UnivariateFinite
.
Reference
CategoricalDistributions.UnivariateFinite
— TypeUnivariateFinite(support,
probs;
pool=nothing,
augmented=false,
ordered=false)
Construct a discrete univariate distribution whose finite support is the elements of the vector support
, and whose corresponding probabilities are elements of the vector probs
. Alternatively, construct an abstract array of UnivariateFinite
distributions by choosing probs
to be an array of one higher dimension than the array generated.
Here the word "probabilities" is an abuse of terminology as there is no requirement that the that probabilities actually sum to one. The only requirement is that the probabilities have a common type T
for which zero(T)
is defined. In particular, UnivariateFinite
objects implement arbitrary non-negative, signed, or complex measures over finite sets of labelled points. A UnivariateDistribution
will be a bona fide probability measure when constructed using the augment=true
option (see below) or when fit
to data. And the probabilities of a UnivariateFinite
object d
must be non-negative, with a non-zero sum, for rand(d)
to be defined and interpretable.
Unless pool
is specified, support
should have type AbstractVector{<:CategoricalValue}
and all elements are assumed to share the same categorical pool, which may be larger than support
.
Important. All levels of the common pool have associated probabilities, not just those in the specified support
. However, these probabilities are always zero (see example below).
If probs
is a matrix, it should have a column for each class in support
(or one less, if augment=true
). More generally, probs
will be an array whose size is of the form (n1, n2, ..., nk, c)
, where c = length(support)
(or one less, if augment=true
) and the constructor then returns an array of UnivariateFinite
distributions of size (n1, n2, ..., nk)
.
using CategoricalDistributions, CategoricalArrays, Distributions
samples = categorical(['x', 'x', 'y', 'x', 'z'])
julia> Distributions.fit(UnivariateFinite, samples)
UnivariateFinite{Multiclass{3}}
┌ ┐
x ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.6
y ┤■■■■■■■■■■■■ 0.2
z ┤■■■■■■■■■■■■ 0.2
└ ┘
julia> d = UnivariateFinite([samples[1], samples[end]], [0.1, 0.9])
UnivariateFinite{Multiclass{3}(x=>0.1, z=>0.9)
UnivariateFinite{Multiclass{3}}
┌ ┐
x ┤■■■■ 0.1
z ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.9
└ ┘
julia> rand(d, 3)
3-element Array{Any,1}:
CategoricalValue{Symbol,UInt32} 'z'
CategoricalValue{Symbol,UInt32} 'z'
CategoricalValue{Symbol,UInt32} 'z'
julia> levels(samples)
3-element Array{Symbol,1}:
'x'
'y'
'z'
julia> pdf(d, 'y')
0.0
Specifying a pool
Alternatively, support
may be a list of raw (non-categorical) elements if pool
is:
some
CategoricalArray
,CategoricalValue
orCategoricalPool
, such thatsupport
is a subset oflevels(pool)
missing
, in which case a new categorical pool is created which hassupport
as its only levels.
In the last case, specify ordered=true
if the pool is to be considered ordered.
julia> UnivariateFinite(['x', 'z'], [0.1, 0.9], pool=missing, ordered=true)
UnivariateFinite{OrderedFactor{2}}
┌ ┐
x ┤■■■■ 0.1
z ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.9
└ ┘
samples = categorical(['x', 'x', 'y', 'x', 'z'])
julia> d = UnivariateFinite(['x', 'z'], [0.1, 0.9], pool=samples)
┌ ┐
x ┤■■■■ 0.1
z ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.9
└ ┘
julia> pdf(d, 'y') # allowed as `'y' in levels(samples)`
0.0
v = categorical(['x', 'x', 'y', 'x', 'z', 'w'])
probs = rand(100, 3)
probs = probs ./ sum(probs, dims=2)
julia> d1 = UnivariateFinite(['x', 'y', 'z'], probs, pool=v)
100-element UnivariateFiniteVector{Multiclass{4},Symbol,UInt32,Float64}:
UnivariateFinite{Multiclass{4}}(x=>0.194, y=>0.3, z=>0.505)
UnivariateFinite{Multiclass{4}}(x=>0.727, y=>0.234, z=>0.0391)
UnivariateFinite{Multiclass{4}}(x=>0.674, y=>0.00535, z=>0.321)
⋮
UnivariateFinite{Multiclass{4}}(x=>0.292, y=>0.339, z=>0.369)
Probability augmentation
If augment=true
the provided array is augmented by inserting appropriate elements ahead of those provided, along the last dimension of the array. This means the user only provides probabilities for the classes c2, c3, ..., cn
. The class c1
probabilities are chosen so that each UnivariateFinite
distribution in the returned array is a bona fide probability distribution.
julia> UnivariateFinite([0.1, 0.2, 0.3], augment=true, pool=missing)
3-element UnivariateFiniteArray{Multiclass{2}, String, UInt8, Float64, 1}:
UnivariateFinite{Multiclass{2}}(class_1=>0.9, class_2=>0.1)
UnivariateFinite{Multiclass{2}}(class_1=>0.8, class_2=>0.2)
UnivariateFinite{Multiclass{2}}(class_1=>0.7, class_2=>0.3)
d2 = UnivariateFinite(['x', 'y', 'z'], probs[:, 2:end], augment=true, pool=v)
julia> pdf(d1, levels(v)) ≈ pdf(d2, levels(v))
true
UnivariateFinite(prob_given_class; pool=nothing, ordered=false)
Construct a discrete univariate distribution whose finite support is the set of keys of the provided dictionary, prob_given_class
, and whose values specify the corresponding probabilities.
The type requirements on the keys of the dictionary are the same as the elements of support
given above with this exception: if non-categorical elements (raw labels) are used as keys, then pool=...
must be specified and cannot be missing
.
If the values (probabilities) are arrays instead of scalars, then an abstract array of UnivariateFinite
elements is created, with the same size as the array.