KMedoids

KMedoids

A model type for constructing a K-medoids clusterer, based on Clustering.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

KMedoids = @load KMedoids pkg=Clustering

Do model = KMedoids() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in KMedoids(k=...).

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

Here:

  • X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X)

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • k=3: The number of centroids to use in clustering.

  • metric::SemiMetric=Distances.SqEuclidean: The metric used to calculate the clustering. Must have type PreMetric from Distances.jl.

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:

    • :kmpp: KMeans++
    • :kmenc: K-medoids initialization based on centrality
    • :rand: random
    • an instance of Clustering.SeedingAlgorithm from Clustering.jl
    • an integer vector of length k that provides the indices of points to use as initial medoids.

    See documentation of Clustering.jl.

Operations

  • predict(mach, Xnew): return cluster label assignments, given new features Xnew having the same Scitype as X above.
  • transform(mach, Xnew): instead return the mean pairwise distances from new samples to the cluster centers.

Fitted parameters

The fields of fitted_params(mach) are:

  • medoids: The coordinates of the cluster medoids.

Report

The fields of report(mach) are:

  • assignments: The cluster assignments of each point in the training data.
  • cluster_labels: The labels assigned to each cluster.

Examples

using MLJ
KMedoids = @load KMedoids pkg=Clustering

table = load_iris()
y, X = unpack(table, ==(:target), rng=123)
model = KMedoids(k=3)
mach = machine(model, X) |> fit!

yhat = predict(mach, X)
@assert yhat == report(mach).assignments

compare = zip(yhat, y) |> collect;
compare[1:8] ## clusters align with classes

center_dists = transform(mach, fitted_params(mach).medoids')

@assert center_dists[1][1] == 0.0
@assert center_dists[2][2] == 0.0
@assert center_dists[3][3] == 0.0

See also KMeans