Categorical Encoders Performance: A Classic Comparison

Julia version is assumed to be 1.10.*

This demonstration is available as a Jupyter notebook or julia script (as well as the dataset) here.

This tutorial compares four fundamental categorical encoding approaches on a milk quality dataset: OneHot, Frequency, Target, and Ordinal encoders paired with SVM classification.

using Pkg;
Pkg.activate(@__DIR__);

using MLJ, LIBSVM, DataFrames, ScientificTypes
using Random, CSV, Plots
  Activating project at `~/Documents/GitHub/MLJTransforms/docs/src/tutorials/classic_comparison`

Load and Prepare Data

Load the milk quality dataset which contains categorical features for quality prediction:

df = CSV.read("./milknew.csv", DataFrame)

first(df, 5)
5×8 DataFrame
RowpHTempratureTasteOdorFat TurbidityColourGrade
Float64Int64Int64Int64Int64Int64Int64String7
16.6351010254high
26.6360101253high
38.5701111246low
49.5341101255low
56.6370000255medium

Check the scientific types to understand our data structure:

ScientificTypes.schema(df)
┌────────────┬────────────┬─────────┐
│ names      │ scitypes   │ types   │
├────────────┼────────────┼─────────┤
│ pH         │ Continuous │ Float64 │
│ Temprature │ Count      │ Int64   │
│ Taste      │ Count      │ Int64   │
│ Odor       │ Count      │ Int64   │
│ Fat        │ Count      │ Int64   │
│ Turbidity  │ Count      │ Int64   │
│ Colour     │ Count      │ Int64   │
│ Grade      │ Textual    │ String7 │
└────────────┴────────────┴─────────┘

Automatically coerce columns with few unique values to categorical:

df = coerce(df, autotype(df, :few_to_finite))

schema(df)
┌────────────┬───────────────────┬───────────────────────────────────┐
│ names      │ scitypes          │ types                             │
├────────────┼───────────────────┼───────────────────────────────────┤
│ pH         │ OrderedFactor{16} │ CategoricalValue{Float64, UInt32} │
│ Temprature │ OrderedFactor{17} │ CategoricalValue{Int64, UInt32}   │
│ Taste      │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Odor       │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Fat        │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Turbidity  │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Colour     │ OrderedFactor{9}  │ CategoricalValue{Int64, UInt32}   │
│ Grade      │ Multiclass{3}     │ CategoricalValue{String7, UInt32} │
└────────────┴───────────────────┴───────────────────────────────────┘

Split Data

Separate features from target and create train/test split:

y, X = unpack(df, ==(:Grade); rng = 123)
train, test = partition(eachindex(y), 0.9, shuffle = true, rng = 100);

Setup Encoders and Classifier

Load the required models and create different encoding strategies:

SVC = @load SVC pkg = LIBSVM verbosity = 0
MLJLIBSVMInterface.SVC

Encoding Strategies Explained:

  1. OneHot: Creates binary columns for each category (sparse, interpretable)
  2. Frequency: Replaces categories with their occurrence frequency
  3. Target: Uses target statistics for each category
  4. Ordinal: Assigns integer codes to categories (assumes ordering)
onehot_model = OneHotEncoder(drop_last = true, ordered_factor = true)
freq_model = FrequencyEncoder(normalize = false, ordered_factor = true)
target_model = TargetEncoder(lambda = 0.9, m = 5, ordered_factor = true)
ordinal_model = OrdinalEncoder(ordered_factor = true)
svm = SVC()
SVC(
  kernel = LIBSVM.Kernel.RadialBasis, 
  gamma = 0.0, 
  cost = 1.0, 
  cachesize = 200.0, 
  degree = 3, 
  coef0 = 0.0, 
  tolerance = 0.001, 
  shrinking = true)

Create four different pipelines to compare:

pipelines = [
    ("OneHot + SVM", onehot_model |> svm),
    ("FreqEnc + SVM", freq_model |> svm),
    ("TargetEnc + SVM", target_model |> svm),
    ("Ordinal + SVM", ordinal_model |> svm),
]
4-element Vector{Tuple{String, MLJBase.DeterministicPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}}:
 ("OneHot + SVM", DeterministicPipeline(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …))
 ("FreqEnc + SVM", DeterministicPipeline(frequency_encoder = FrequencyEncoder(features = Symbol[], …), …))
 ("TargetEnc + SVM", DeterministicPipeline(target_encoder = TargetEncoder(features = Symbol[], …), …))
 ("Ordinal + SVM", DeterministicPipeline(ordinal_encoder = OrdinalEncoder(features = Symbol[], …), …))

Evaluate Pipelines

Use 10-fold cross-validation to robustly estimate each pipeline's accuracy:

results = DataFrame(
    pipeline = String[],
    accuracy = Float64[],
    std_error = Float64[],
    ci_lower = Float64[],
    ci_upper = Float64[],
)

for (name, pipe) in pipelines
    println("Evaluating: $name")
    eval_results = evaluate(
        pipe,
        X,
        y,
        resampling = CV(nfolds = 5, rng = 123),
        measure = accuracy,
        rows = train,
        verbosity = 0,
    )
    acc = eval_results.measurement[1]          # scalar mean
    per_fold = eval_results.per_fold[1]         # vector of fold results
    se = std(per_fold) / sqrt(length(per_fold))
    ci = 1.96 * se
    push!(
        results,
        (
            pipeline = name,
            accuracy = acc,
            std_error = se,
            ci_lower = acc - ci,
            ci_upper = acc + ci,
        ),
    )
    println("  Mean accuracy: $(round(acc, digits=4)) ± $(round(ci, digits=4))")
end
Evaluating: OneHot + SVM
  Mean accuracy: 0.999 ± 0.0021
Evaluating: FreqEnc + SVM
  Mean accuracy: 0.8804 ± 0.0286
Evaluating: TargetEnc + SVM
  Mean accuracy: 0.9738 ± 0.0086
Evaluating: Ordinal + SVM
  Mean accuracy: 0.9328 ± 0.0119

Sort results by accuracy (highest first) and display:

sort!(results, :accuracy, rev = true)
4×5 DataFrame
Rowpipelineaccuracystd_errorci_lowerci_upper
StringFloat64Float64Float64Float64
1OneHot + SVM0.9989510.001052630.9968881.00101
2TargetEnc + SVM0.9737670.004410170.9651230.982411
3Ordinal + SVM0.9328440.006069850.9209470.944741
4FreqEnc + SVM0.8803780.01459610.8517690.908986

Display results with confidence intervals

println("\nResults with 95% Confidence Intervals (see caveats below):")
println("="^60)
for row in eachrow(results)
    pipeline = row.pipeline
    acc = round(row.accuracy, digits = 4)
    ci_lower = round(row.ci_lower, digits = 4)
    ci_upper = round(row.ci_upper, digits = 4)
    println("$pipeline: $acc (95% CI: [$ci_lower, $ci_upper])")
end

results
4×5 DataFrame
Rowpipelineaccuracystd_errorci_lowerci_upper
StringFloat64Float64Float64Float64
1OneHot + SVM0.9989510.001052630.9968881.00101
2TargetEnc + SVM0.9737670.004410170.9651230.982411
3Ordinal + SVM0.9328440.006069850.9209470.944741
4FreqEnc + SVM0.8803780.01459610.8517690.908986

Results Analysis

Performance Summary

The results show OneHot encoding performing best, followed by Target encoding, with Ordinal and Frequency encoders showing lower performance.

The confidence intervals should be interpreted with caution and primarily serve to illustrate uncertainty rather than provide definitive statistical significance tests. See Bengio & Grandvalet, 2004: "No Unbiased Estimator of the Variance of K-Fold Cross-Validation"). That said, reporting the interval is still more informative than reporting only the mean.

Prepare data for plotting

labels = results.pipeline
mean_acc = results.accuracy
ci_lower = results.ci_lower
ci_upper = results.ci_upper
4-element Vector{Float64}:
 1.0010138399514
 0.9824109872813186
 0.9447405610093282
 0.9089860558215551

Error bars: distance from mean to CI bounds

lower_err = mean_acc .- ci_lower
upper_err = ci_upper .- mean_acc

bar(
    labels,
    mean_acc,
    yerror = (lower_err, upper_err),
    legend = false,
    xlabel = "Encoder + SVM",
    ylabel = "Accuracy",
    title = "Mean Accuracy with 95% Confidence Intervals",
    ylim = (0, 1.05),
    color = :skyblue,
    size = (700, 400),
);

save the figure and load it

savefig("encoder_comparison.png");

`encoder_comparison.png`


This page was generated using Literate.jl.