DBSCAN
DBSCAN
A model type for constructing a DBSCAN clusterer (density-based spatial clustering of applications with noise), based on Clustering.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
DBSCAN = @load DBSCAN pkg=Clustering
Do model = DBSCAN()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in DBSCAN(radius=...)
.
DBSCAN is a clustering algorithm that groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). More information is available at the Clustering.jl documentation. Use predict
to get cluster assignments. Point types - core, boundary or noise - are accessed from the machine report (see below).
This is a static implementation, i.e., it does not generalize to new data instances, and there is no training data. For clusterers that do generalize, see KMeans
or KMedoids
.
In MLJ or MLJBase, create a machine with
mach = machine(model)
Hyper-parameters
radius=1.0
: query radius.leafsize=20
: number of points binned in each leaf node of the nearest neighbor k-d tree.min_neighbors=1
: minimum number of a core point neighbors.min_cluster_size=1
: minimum number of points in a valid cluster.
Operations
predict(mach, X)
: return cluster label assignments, as an unorderedCategoricalVector
. HereX
is any table of input features (eg, aDataFrame
) whose columns are of scitypeContinuous
; check column scitypes withschema(X)
. Note that points of typenoise
will always get a label of0
.
Report
After calling predict(mach)
, the fields of report(mach)
are:
point_types
: ACategoricalVector
with the DBSCAN point type classification, one element per row ofX
. Elements are either'C'
(core),'B'
(boundary), or'N'
(noise).nclusters
: The number of clusters (excluding the noise "cluster")cluster_labels
: The unique list of cluster labelsclusters
: A vector ofClustering.DbscanCluster
objects from Clustering.jl, which have these fields:size
: number of points in a cluster (core + boundary)core_indices
: indices of points in the cluster coreboundary_indices
: indices of points on the cluster boundary
Examples
using MLJ
X, labels = make_moons(400, noise=0.09, rng=1) ## synthetic data with 2 clusters; X
y = map(labels) do label
label == 0 ? "cookie" : "monster"
end;
y = coerce(y, Multiclass);
DBSCAN = @load DBSCAN pkg=Clustering
model = DBSCAN(radius=0.13, min_cluster_size=5)
mach = machine(model)
## compute and output cluster assignments for observations in `X`:
yhat = predict(mach, X)
## get DBSCAN point types:
report(mach).point_types
report(mach).nclusters
## compare cluster labels with actual labels:
compare = zip(yhat, y) |> collect;
compare[1:10] ## clusters align with classes
## visualize clusters, noise in red:
points = zip(X.x1, X.x2) |> collect
colors = map(yhat) do i
i == 0 ? :red :
i == 1 ? :blue :
i == 2 ? :green :
i == 3 ? :yellow :
:black
end
using Plots
scatter(points, color=colors)