Categorical Encoders Performance: A Classic Comparison

Julia version is assumed to be 1.10.*

This demonstration is available as a Jupyter notebook or julia script (as well as the dataset) here.

This tutorial compares four fundamental categorical encoding approaches on a milk quality dataset: OneHot, Frequency, Target, and Ordinal encoders paired with SVM classification.

using Pkg;
Pkg.activate(@__DIR__);

using MLJ, LIBSVM, DataFrames, ScientificTypes
using Random, CSV, Plots

  Activating project at `~/Documents/GitHub/MLJTransforms/docs/src/tutorials/classic_comparison`

Load and Prepare Data

Load the milk quality dataset which contains categorical features for quality prediction:

df = CSV.read("./milknew.csv", DataFrame)

first(df, 5)

5×8 DataFrame

Row	pH	Temprature	Taste	Odor	Fat	Turbidity	Colour	Grade
	Float64	Int64	Int64	Int64	Int64	Int64	Int64	String7
1	6.6	35	1	0	1	0	254	high
2	6.6	36	0	1	0	1	253	high
3	8.5	70	1	1	1	1	246	low
4	9.5	34	1	1	0	1	255	low
5	6.6	37	0	0	0	0	255	medium

Check the scientific types to understand our data structure:

ScientificTypes.schema(df)

┌────────────┬────────────┬─────────┐
│ names      │ scitypes   │ types   │
├────────────┼────────────┼─────────┤
│ pH         │ Continuous │ Float64 │
│ Temprature │ Count      │ Int64   │
│ Taste      │ Count      │ Int64   │
│ Odor       │ Count      │ Int64   │
│ Fat        │ Count      │ Int64   │
│ Turbidity  │ Count      │ Int64   │
│ Colour     │ Count      │ Int64   │
│ Grade      │ Textual    │ String7 │
└────────────┴────────────┴─────────┘

Automatically coerce columns with few unique values to categorical:

df = coerce(df, autotype(df, :few_to_finite))

schema(df)

┌────────────┬───────────────────┬───────────────────────────────────┐
│ names      │ scitypes          │ types                             │
├────────────┼───────────────────┼───────────────────────────────────┤
│ pH         │ OrderedFactor{16} │ CategoricalValue{Float64, UInt32} │
│ Temprature │ OrderedFactor{17} │ CategoricalValue{Int64, UInt32}   │
│ Taste      │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Odor       │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Fat        │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Turbidity  │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Colour     │ OrderedFactor{9}  │ CategoricalValue{Int64, UInt32}   │
│ Grade      │ Multiclass{3}     │ CategoricalValue{String7, UInt32} │
└────────────┴───────────────────┴───────────────────────────────────┘

Split Data

Separate features from target and create train/test split:

y, X = unpack(df, ==(:Grade); rng = 123)
train, test = partition(eachindex(y), 0.9, shuffle = true, rng = 100);

Setup Encoders and Classifier

Load the required models and create different encoding strategies:

SVC = @load SVC pkg = LIBSVM verbosity = 0

MLJLIBSVMInterface.SVC

Encoding Strategies Explained:

OneHot: Creates binary columns for each category (sparse, interpretable)
Frequency: Replaces categories with their occurrence frequency
Target: Uses target statistics for each category
Ordinal: Assigns integer codes to categories (assumes ordering)

onehot_model = OneHotEncoder(drop_last = true, ordered_factor = true)
freq_model = FrequencyEncoder(normalize = false, ordered_factor = true)
target_model = TargetEncoder(lambda = 0.9, m = 5, ordered_factor = true)
ordinal_model = OrdinalEncoder(ordered_factor = true)
svm = SVC()

SVC(
  kernel = LIBSVM.Kernel.RadialBasis, 
  gamma = 0.0, 
  cost = 1.0, 
  cachesize = 200.0, 
  degree = 3, 
  coef0 = 0.0, 
  tolerance = 0.001, 
  shrinking = true)

Create four different pipelines to compare:

pipelines = [
    ("OneHot + SVM", onehot_model |> svm),
    ("FreqEnc + SVM", freq_model |> svm),
    ("TargetEnc + SVM", target_model |> svm),
    ("Ordinal + SVM", ordinal_model |> svm),
]

4-element Vector{Tuple{String, MLJBase.DeterministicPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}}:
 ("OneHot + SVM", DeterministicPipeline(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …))
 ("FreqEnc + SVM", DeterministicPipeline(frequency_encoder = FrequencyEncoder(features = Symbol[], …), …))
 ("TargetEnc + SVM", DeterministicPipeline(target_encoder = TargetEncoder(features = Symbol[], …), …))
 ("Ordinal + SVM", DeterministicPipeline(ordinal_encoder = OrdinalEncoder(features = Symbol[], …), …))

Evaluate Pipelines

Use 10-fold cross-validation to robustly estimate each pipeline's accuracy:

results = DataFrame(
    pipeline = String[],
    accuracy = Float64[],
    std_error = Float64[],
    ci_lower = Float64[],
    ci_upper = Float64[],
)

for (name, pipe) in pipelines
    println("Evaluating: $name")
    eval_results = evaluate(
        pipe,
        X,
        y,
        resampling = CV(nfolds = 5, rng = 123),
        measure = accuracy,
        rows = train,
        verbosity = 0,
    )
    acc = eval_results.measurement[1]          # scalar mean
    per_fold = eval_results.per_fold[1]         # vector of fold results
    se = std(per_fold) / sqrt(length(per_fold))
    ci = 1.96 * se
    push!(
        results,
        (
            pipeline = name,
            accuracy = acc,
            std_error = se,
            ci_lower = acc - ci,
            ci_upper = acc + ci,
        ),
    )
    println("  Mean accuracy: $(round(acc, digits=4)) ± $(round(ci, digits=4))")
end

Evaluating: OneHot + SVM
  Mean accuracy: 0.999 ± 0.0021
Evaluating: FreqEnc + SVM
  Mean accuracy: 0.8804 ± 0.0286
Evaluating: TargetEnc + SVM
  Mean accuracy: 0.9738 ± 0.0086
Evaluating: Ordinal + SVM
  Mean accuracy: 0.9328 ± 0.0119

Sort results by accuracy (highest first) and display:

sort!(results, :accuracy, rev = true)

4×5 DataFrame

Row	pipeline	accuracy	std_error	ci_lower	ci_upper
	String	Float64	Float64	Float64	Float64
1	OneHot + SVM	0.998951	0.00105263	0.996888	1.00101
2	TargetEnc + SVM	0.973767	0.00441017	0.965123	0.982411
3	Ordinal + SVM	0.932844	0.00606985	0.920947	0.944741
4	FreqEnc + SVM	0.880378	0.0145961	0.851769	0.908986

Display results with confidence intervals

println("\nResults with 95% Confidence Intervals (see caveats below):")
println("="^60)
for row in eachrow(results)
    pipeline = row.pipeline
    acc = round(row.accuracy, digits = 4)
    ci_lower = round(row.ci_lower, digits = 4)
    ci_upper = round(row.ci_upper, digits = 4)
    println("$pipeline: $acc (95% CI: [$ci_lower, $ci_upper])")
end

results

4×5 DataFrame

Row	pipeline	accuracy	std_error	ci_lower	ci_upper
	String	Float64	Float64	Float64	Float64
1	OneHot + SVM	0.998951	0.00105263	0.996888	1.00101
2	TargetEnc + SVM	0.973767	0.00441017	0.965123	0.982411
3	Ordinal + SVM	0.932844	0.00606985	0.920947	0.944741
4	FreqEnc + SVM	0.880378	0.0145961	0.851769	0.908986

Results Analysis

Performance Summary

The results show OneHot encoding performing best, followed by Target encoding, with Ordinal and Frequency encoders showing lower performance.

The confidence intervals should be interpreted with caution and primarily serve to illustrate uncertainty rather than provide definitive statistical significance tests. See Bengio & Grandvalet, 2004: "No Unbiased Estimator of the Variance of K-Fold Cross-Validation"). That said, reporting the interval is still more informative than reporting only the mean.

Prepare data for plotting

labels = results.pipeline
mean_acc = results.accuracy
ci_lower = results.ci_lower
ci_upper = results.ci_upper

4-element Vector{Float64}:
 1.0010138399514
 0.9824109872813186
 0.9447405610093282
 0.9089860558215551

Error bars: distance from mean to CI bounds

lower_err = mean_acc .- ci_lower
upper_err = ci_upper .- mean_acc

bar(
    labels,
    mean_acc,
    yerror = (lower_err, upper_err),
    legend = false,
    xlabel = "Encoder + SVM",
    ylabel = "Accuracy",
    title = "Mean Accuracy with 95% Confidence Intervals",
    ylim = (0, 1.05),
    color = :skyblue,
    size = (700, 400),
);

save the figure and load it

savefig("encoder_comparison.png");

`encoder_comparison.png`

This page was generated using Literate.jl.