Categorical Encoders Performance: A Classic Comparison
Julia version is assumed to be 1.10.*
This demonstration is available as a Jupyter notebook or julia script (as well as the dataset) here.
This tutorial compares four fundamental categorical encoding approaches on a milk quality dataset: OneHot, Frequency, Target, and Ordinal encoders paired with SVM classification.
using Pkg;
Pkg.activate(@__DIR__);
using MLJ, LIBSVM, DataFrames, ScientificTypes
using Random, CSV, Plots
Activating project at `~/Documents/GitHub/MLJTransforms/docs/src/tutorials/classic_comparison`
Load and Prepare Data
Load the milk quality dataset which contains categorical features for quality prediction:
df = CSV.read("./milknew.csv", DataFrame)
first(df, 5)
Row | pH | Temprature | Taste | Odor | Fat | Turbidity | Colour | Grade |
---|---|---|---|---|---|---|---|---|
Float64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | String7 | |
1 | 6.6 | 35 | 1 | 0 | 1 | 0 | 254 | high |
2 | 6.6 | 36 | 0 | 1 | 0 | 1 | 253 | high |
3 | 8.5 | 70 | 1 | 1 | 1 | 1 | 246 | low |
4 | 9.5 | 34 | 1 | 1 | 0 | 1 | 255 | low |
5 | 6.6 | 37 | 0 | 0 | 0 | 0 | 255 | medium |
Check the scientific types to understand our data structure:
ScientificTypes.schema(df)
┌────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├────────────┼────────────┼─────────┤
│ pH │ Continuous │ Float64 │
│ Temprature │ Count │ Int64 │
│ Taste │ Count │ Int64 │
│ Odor │ Count │ Int64 │
│ Fat │ Count │ Int64 │
│ Turbidity │ Count │ Int64 │
│ Colour │ Count │ Int64 │
│ Grade │ Textual │ String7 │
└────────────┴────────────┴─────────┘
Automatically coerce columns with few unique values to categorical:
df = coerce(df, autotype(df, :few_to_finite))
schema(df)
┌────────────┬───────────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├────────────┼───────────────────┼───────────────────────────────────┤
│ pH │ OrderedFactor{16} │ CategoricalValue{Float64, UInt32} │
│ Temprature │ OrderedFactor{17} │ CategoricalValue{Int64, UInt32} │
│ Taste │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ Odor │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ Fat │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ Turbidity │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ Colour │ OrderedFactor{9} │ CategoricalValue{Int64, UInt32} │
│ Grade │ Multiclass{3} │ CategoricalValue{String7, UInt32} │
└────────────┴───────────────────┴───────────────────────────────────┘
Split Data
Separate features from target and create train/test split:
y, X = unpack(df, ==(:Grade); rng = 123)
train, test = partition(eachindex(y), 0.9, shuffle = true, rng = 100);
Setup Encoders and Classifier
Load the required models and create different encoding strategies:
SVC = @load SVC pkg = LIBSVM verbosity = 0
MLJLIBSVMInterface.SVC
Encoding Strategies Explained:
- OneHot: Creates binary columns for each category (sparse, interpretable)
- Frequency: Replaces categories with their occurrence frequency
- Target: Uses target statistics for each category
- Ordinal: Assigns integer codes to categories (assumes ordering)
onehot_model = OneHotEncoder(drop_last = true, ordered_factor = true)
freq_model = FrequencyEncoder(normalize = false, ordered_factor = true)
target_model = TargetEncoder(lambda = 0.9, m = 5, ordered_factor = true)
ordinal_model = OrdinalEncoder(ordered_factor = true)
svm = SVC()
SVC(
kernel = LIBSVM.Kernel.RadialBasis,
gamma = 0.0,
cost = 1.0,
cachesize = 200.0,
degree = 3,
coef0 = 0.0,
tolerance = 0.001,
shrinking = true)
Create four different pipelines to compare:
pipelines = [
("OneHot + SVM", onehot_model |> svm),
("FreqEnc + SVM", freq_model |> svm),
("TargetEnc + SVM", target_model |> svm),
("Ordinal + SVM", ordinal_model |> svm),
]
4-element Vector{Tuple{String, MLJBase.DeterministicPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}}:
("OneHot + SVM", DeterministicPipeline(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …))
("FreqEnc + SVM", DeterministicPipeline(frequency_encoder = FrequencyEncoder(features = Symbol[], …), …))
("TargetEnc + SVM", DeterministicPipeline(target_encoder = TargetEncoder(features = Symbol[], …), …))
("Ordinal + SVM", DeterministicPipeline(ordinal_encoder = OrdinalEncoder(features = Symbol[], …), …))
Evaluate Pipelines
Use 10-fold cross-validation to robustly estimate each pipeline's accuracy:
results = DataFrame(
pipeline = String[],
accuracy = Float64[],
std_error = Float64[],
ci_lower = Float64[],
ci_upper = Float64[],
)
for (name, pipe) in pipelines
println("Evaluating: $name")
eval_results = evaluate(
pipe,
X,
y,
resampling = CV(nfolds = 5, rng = 123),
measure = accuracy,
rows = train,
verbosity = 0,
)
acc = eval_results.measurement[1] # scalar mean
per_fold = eval_results.per_fold[1] # vector of fold results
se = std(per_fold) / sqrt(length(per_fold))
ci = 1.96 * se
push!(
results,
(
pipeline = name,
accuracy = acc,
std_error = se,
ci_lower = acc - ci,
ci_upper = acc + ci,
),
)
println(" Mean accuracy: $(round(acc, digits=4)) ± $(round(ci, digits=4))")
end
Evaluating: OneHot + SVM
Mean accuracy: 0.999 ± 0.0021
Evaluating: FreqEnc + SVM
Mean accuracy: 0.8804 ± 0.0286
Evaluating: TargetEnc + SVM
Mean accuracy: 0.9738 ± 0.0086
Evaluating: Ordinal + SVM
Mean accuracy: 0.9328 ± 0.0119
Sort results by accuracy (highest first) and display:
sort!(results, :accuracy, rev = true)
Row | pipeline | accuracy | std_error | ci_lower | ci_upper |
---|---|---|---|---|---|
String | Float64 | Float64 | Float64 | Float64 | |
1 | OneHot + SVM | 0.998951 | 0.00105263 | 0.996888 | 1.00101 |
2 | TargetEnc + SVM | 0.973767 | 0.00441017 | 0.965123 | 0.982411 |
3 | Ordinal + SVM | 0.932844 | 0.00606985 | 0.920947 | 0.944741 |
4 | FreqEnc + SVM | 0.880378 | 0.0145961 | 0.851769 | 0.908986 |
Display results with confidence intervals
println("\nResults with 95% Confidence Intervals (see caveats below):")
println("="^60)
for row in eachrow(results)
pipeline = row.pipeline
acc = round(row.accuracy, digits = 4)
ci_lower = round(row.ci_lower, digits = 4)
ci_upper = round(row.ci_upper, digits = 4)
println("$pipeline: $acc (95% CI: [$ci_lower, $ci_upper])")
end
results
Row | pipeline | accuracy | std_error | ci_lower | ci_upper |
---|---|---|---|---|---|
String | Float64 | Float64 | Float64 | Float64 | |
1 | OneHot + SVM | 0.998951 | 0.00105263 | 0.996888 | 1.00101 |
2 | TargetEnc + SVM | 0.973767 | 0.00441017 | 0.965123 | 0.982411 |
3 | Ordinal + SVM | 0.932844 | 0.00606985 | 0.920947 | 0.944741 |
4 | FreqEnc + SVM | 0.880378 | 0.0145961 | 0.851769 | 0.908986 |
Results Analysis
Performance Summary
The results show OneHot encoding performing best, followed by Target encoding, with Ordinal and Frequency encoders showing lower performance.
The confidence intervals should be interpreted with caution and primarily serve to illustrate uncertainty rather than provide definitive statistical significance tests. See Bengio & Grandvalet, 2004: "No Unbiased Estimator of the Variance of K-Fold Cross-Validation"). That said, reporting the interval is still more informative than reporting only the mean.
Prepare data for plotting
labels = results.pipeline
mean_acc = results.accuracy
ci_lower = results.ci_lower
ci_upper = results.ci_upper
4-element Vector{Float64}:
1.0010138399514
0.9824109872813186
0.9447405610093282
0.9089860558215551
Error bars: distance from mean to CI bounds
lower_err = mean_acc .- ci_lower
upper_err = ci_upper .- mean_acc
bar(
labels,
mean_acc,
yerror = (lower_err, upper_err),
legend = false,
xlabel = "Encoder + SVM",
ylabel = "Accuracy",
title = "Mean Accuracy with 95% Confidence Intervals",
ylim = (0, 1.05),
color = :skyblue,
size = (700, 400),
);
save the figure and load it
savefig("encoder_comparison.png");
This page was generated using Literate.jl.