K-means+ & HDBSCAN

In this project, two different models were used to see their advantages over each other. The two models are used to make groupings and are in the area of unsupervised learning since they do not require true labels. The data was generated synthetically in order to obtain complex shapes in order to test the capabilities of each model.

Dataset

For this project, two databases were generated, each one with a different way to see the capabilities of each model. The shapes in this database were a double crescent with two classes and spots with three classes.

Analysis of data

It was decided to carry out two main tests where the shapes of two crescents were selected, thus containing two groups, while for the other test agglomerates or spots containing three different groups were chosen. The decision of this selection was based on having two different situations in which we know that k means++ does not work well in the cases of crescents, and for HDBSACN observing the adaptive behavior of the density clustering algorithms.

Training

K-means+

For the training of the K-Means+ model, the necessary centroids must first be positioned, in order to optimize the time and accuracy of the model. The figures show the generation of three centroids with a maximum distance between them for the data set of points and two centroids for the data set of crescents.

Training

HDBSCAN

Due to the nature of our HDBSCAN method, it is not trained as it searches for the best number of clusters in the data from the hierarchy which in turn is based on the density between data.

Results

K-means+

The following figures show the results of the K-Mean+ model, for the two databases (points and crescents). The ones on the left are using normalized data, the ones in the middle are the original data, and the one on the right is unnormalized. Both ends with their three centers. It can be seen that for blob-type data this model does well, whether it is with normalized data or not; while in the half-moon data, it does not do it well at all.

Normalized Original data Not Normalized

Results

HDBSCAN

The following images show the best results for the HDBSCAN model for the two databases. We see that for the point dataset, normalization gave better results and for the crescent-shaped dataset it was by not applying normalization.

Normalized Original data Not Normalized

3 neighbors and 30 elements minimum for the group.

Normalized Original data Not Normalized

5 neighbors and 30 elements minimum for group.

Results comparison

We see that the HDBSCAN method handles the two databases well, while K-mea+ cannot handle complex shapes with curved shapes since it cannot correctly generate centroids to do its clustering. In conclusion, the HDBSCAN method is better when it comes to grouping more complex information, but it implies more computational cost.

HDBSCAN, 5 neighbors and 30 elements minimum for a group.

Original data.

K-means ++, Normalization, 2 groups

HDBSCAN, 3 neighbors and 30 elements minimum for a group.

Original data.

K-means ++, Normalization, 3 groups