DBSCAN
DBSCANA model type for constructing a DBSCAN clusterer (density-based spatial clustering of applications with noise), based on Clustering.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
DBSCAN = @load DBSCAN pkg=ClusteringDo model = DBSCAN() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in DBSCAN(radius=...).
DBSCAN is a clustering algorithm that groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). More information is available at the Clustering.jl documentation. Use predict to get cluster assignments. Point types - core, boundary or noise - are accessed from the machine report (see below).
This is a static implementation, i.e., it does not generalize to new data instances, and there is no training data. For clusterers that do generalize, see KMeans or KMedoids.
In MLJ or MLJBase, create a machine with
mach = machine(model)Hyper-parameters
radius=1.0: query radius.leafsize=20: number of points binned in each leaf node of the nearest neighbor k-d tree.min_neighbors=1: minimum number of a core point neighbors.min_cluster_size=1: minimum number of points in a valid cluster.
Operations
predict(mach, X): return cluster label assignments, as an unorderedCategoricalVector. HereXis any table of input features (eg, aDataFrame) whose columns are of scitypeContinuous; check column scitypes withschema(X). Note that points of typenoisewill always get a label of0.
Report
After calling predict(mach), the fields of report(mach) are:
point_types: ACategoricalVectorwith the DBSCAN point type classification, one element per row ofX. Elements are either'C'(core),'B'(boundary), or'N'(noise).nclusters: The number of clusters (excluding the noise "cluster")cluster_labels: The unique list of cluster labelsclusters: A vector ofClustering.DbscanClusterobjects from Clustering.jl, which have these fields:size: number of points in a cluster (core + boundary)core_indices: indices of points in the cluster coreboundary_indices: indices of points on the cluster boundary
Examples
using MLJ
X, labels = make_moons(400, noise=0.09, rng=1) ## synthetic data with 2 clusters; X
y = map(labels) do label
label == 0 ? "cookie" : "monster"
end;
y = coerce(y, Multiclass);
DBSCAN = @load DBSCAN pkg=Clustering
model = DBSCAN(radius=0.13, min_cluster_size=5)
mach = machine(model)
## compute and output cluster assignments for observations in `X`:
yhat = predict(mach, X)
## get DBSCAN point types:
report(mach).point_types
report(mach).nclusters
## compare cluster labels with actual labels:
compare = zip(yhat, y) |> collect;
compare[1:10] ## clusters align with classes
## visualize clusters, noise in red:
points = zip(X.x1, X.x2) |> collect
colors = map(yhat) do i
i == 0 ? :red :
i == 1 ? :blue :
i == 2 ? :green :
i == 3 ? :yellow :
:black
end
using Plots
scatter(points, color=colors)