Understanding the DBSCAN clustering algorithm

Richard Olson
6 min readAug 28, 2020

--

The DBSCAN algorithm is a type of clustering algorithm. Clustering algorithms are used as an unsupervised learning method in machine learning. Clustering algorithms try to put the data into groups or clusters that are similar to each other.

K-means clustering compared to DBSCAN

Below are two examples of clustering. One using the k-means clustering and the other using DBSCAN clustering. The same data points are used for each of the clustering algorithms.

In the DBSCAN below, the grey dots are the outliers, which did not get grouped with any of the clusters.

DBSCAN using sklearn with .5 eps and min_samples as 6
Kmeans clustering

The DBSCAN differs from K-means clustering by trying to group together data points that are in higher density areas into clusters while ignoring those points that are found in lesser dense areas. This means that some of the data points when using a DBSCAN may not belong to any cluster because the algorithm views them as an outlier.

The K-means clustering algorithm, on the other hand, will group the data points in the cluster that has the nearest centroid to it. The K-means cluster does not view any of the data points as outliers.

Writing My Own DBSCAN algorithm

To help myself deepen my understanding of the DBSCAN algorithm, I decided to write my own implementation. To understand my implementation, it is necessary to first explain what the rules of clustering are for the DBSCAN algorithm.

DBSCAN Parameters

The DBSCAN algorithm takes in two different parameters, “eps”, and “min_samples”. These two parameters are used to determine what each point of data is classified as.

Eps: the maximum radius or distance from a point that the user wants to use to determine if the other points around it are considered as neighbors to that point. Considering data points around another point will help the algorithm create the clusters.

Min_samples: the minimum number of data points that are needed to create a cluster. These data points need to be considered neighbors to each other

I will further discuss the importance of these two parameters when going over the classification of the different types of points in a DBSCAN.

Classifying the Data Points in DBSCAN

The DBSCAN algorithm uses the definition of 3 different kinds of points in order to create the clusters. These points are core points, border points, and outliers.

Core Point:

A point is considered to be a core point if the point’s number of neighbors is equal to or greater than the minimum number of samples(min_samples). Core points are always found within a cluster. Again, the minimum number of samples is a number chosen by the user and is passed in as a parameter to the DBSCAN class.

Border Point:

A border point is a point that does not have a sufficient number of neighbors to be considered a core point itself, but at least one of its neighbors is a core point. A border point will become part of the cluster that it’s core point neighbor is found in.

Outlier:

An outlier is a point that does not have a sufficient number of neighbors to be considered a core point and none of its neighbors are core points. Outliers are not grouped into any clusters.

A Visual Explanation of the Types of Points

Below is an image from Wikipedia which shows a visual of what a core point, a border point, and an outlier are.

In this example, the parameter “min_samples” has been set to 3. Therefore, each point needs at least 3 neighbors within the distance of “eps” to be considered as a core point. The image shows the “eps” parameter as the empty circle surrounding each data point.

In the image below, each of the red data points is considered a core point. While the yellow points are considered border points and the blue point is considered an outlier.

From Wikipedia

How I wrote my algorithm

The following is the pseudocode for the DBSCAN algorithm

For each point in the data:
Find the neighbors of the point.
Label the point a core point if it has the required
number of neighbors.
For each point labled as a core point: If the point not visited, mark as visited:
Start a cluster with the point.
For each neighbor of the point:
Add the neighbor to the cluster.
If the point's neighbor is a core point:
Label the core point as visited.
Add it's neigbors to the cluster

How does my implementation compare to the sklearn version of DBSCAN?

Below is a visualization of my algorithm and the sklearn algorithm on 2000 points of the same data:

My DBSCAN — took about 31.230083 seconds
Sklearn DBSCAN — took about 0.011992 seconds

The main difference — Speed

If you look at both of these images, you can see that they are quite similar with some slight differences in clusters. The greatest difference is the speed at which each algorithm took to run. The sklearn version took a mere .01 seconds to run while my version took about 31.2 seconds to finish.

A part of the difference may be from how each of the algorithms found the distance from each point to its neighbors. In my own algorithm, finding the neighbors was done by brute force. This meant that for each each point in the data set I would loop through the complete data set again to find the neighbors of that point. This made my algorithm run with an time complexity of O(n²).

In the sklearn, version, a parameter called “algorithm” is by default set to “auto”. This parameter is for choosing the method of finding the neighbors of each of the points. When the parameter is set to “auto”, sklearn will choose the method of finding the neighbors that is deemed most efficient.

By using a tree either the “kd-tree” or a “ball-tree”, the DBSCAN algorithm no longer has to loop through the entire set of data points for each point because the points are compartmentalized in the tree according to their points.

My own KD-tree

I decided to implement my own “kd-tree” for my algorithm of the DBSCAN to see how much faster it would make my algorithm run. Below is a visualization of the same data points as previously used but with my algorithm using a “kd-tree” that I implemented. This time my algorithm took 0.803998 seconds, which was a vast improvement in speed.

My DBSCAN — took about 0.803998 seconds

It is also interesting to note that the clusters in my algorithm when using the “kd-tree” look more like the clusters in the sklearn version of the DBSCAN. This is likely because of the change in how the algorithm finds the neighbors for each point.

If you would like to look at my code it can be found on my Github page here.

--

--

Richard Olson

Interested in Data Science. Interested in its life applications.