DBSCAN

DBSCAN

A model type for constructing a DBSCAN clusterer (density-based spatial clustering of applications with noise), based on Clustering.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

DBSCAN = @load DBSCAN pkg=Clustering

Do model = DBSCAN() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in DBSCAN(radius=...).

DBSCAN is a clustering algorithm that groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). More information is available at the Clustering.jl documentation. Use predict to get cluster assignments. Point types - core, boundary or noise - are accessed from the machine report (see below).

This is a static implementation, i.e., it does not generalize to new data instances, and there is no training data. For clusterers that do generalize, see KMeans or KMedoids.

In MLJ or MLJBase, create a machine with

mach = machine(model)

Hyper-parameters

radius=1.0: query radius.
leafsize=20: number of points binned in each leaf node of the nearest neighbor k-d tree.
min_neighbors=1: minimum number of a core point neighbors.
min_cluster_size=1: minimum number of points in a valid cluster.

Operations

predict(mach, X): return cluster label assignments, as an unordered CategoricalVector. Here X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X). Note that points of type noise will always get a label of 0.

Report

After calling predict(mach), the fields of report(mach) are:

point_types: A CategoricalVector with the DBSCAN point type classification, one element per row of X. Elements are either 'C' (core), 'B' (boundary), or 'N' (noise).
nclusters: The number of clusters (excluding the noise "cluster")
cluster_labels: The unique list of cluster labels
clusters: A vector of Clustering.DbscanCluster objects from Clustering.jl, which have these fields:
- size: number of points in a cluster (core + boundary)
- core_indices: indices of points in the cluster core
- boundary_indices: indices of points on the cluster boundary

Examples

using MLJ

X, labels  = make_moons(400, noise=0.09, rng=1) ## synthetic data with 2 clusters; X
y = map(labels) do label
    label == 0 ? "cookie" : "monster"
end;
y = coerce(y, Multiclass);

DBSCAN = @load DBSCAN pkg=Clustering
model = DBSCAN(radius=0.13, min_cluster_size=5)
mach = machine(model)

## compute and output cluster assignments for observations in `X`:
yhat = predict(mach, X)

## get DBSCAN point types:
report(mach).point_types
report(mach).nclusters

## compare cluster labels with actual labels:
compare = zip(yhat, y) |> collect;
compare[1:10] ## clusters align with classes

## visualize clusters, noise in red:
points = zip(X.x1, X.x2) |> collect
colors = map(yhat) do i
   i == 0 ? :red :
   i == 1 ? :blue :
   i == 2 ? :green :
   i == 3 ? :yellow :
   :black
end
using Plots
scatter(points, color=colors)