Data stream clustering refers to the clustering of data that arrives continually such as financial transactions, multimedia data, or telephonic records. It is usually studied as a “Streaming Algorithm.” The purpose of Data Stream Clustering is to contract a good clustering stream using a small amount of time and memory.
Technically, Clustering is the act of grouping elements using sets. The main purpose of this type of separation is to unite items that are similar to each other, using the comparison of a series of characteristics of these. When we talk about Data Stream, we can separate its methods into five categories, namely partitioning, hierarchical, density-based, grid-based and model-based, which will be detailed in the course of the article.
There is one more factor to take into account when talking about clusters. It is possible to divide the possible distances in four, being the minimum (or single connection), maximum (or complete connection), mean distance and the average, and each one has its characteristics regarding the cost of implementation and computational power, being that the minimum distance and mean distance are more common to use in Data Stream Clustering.
Similar to traditional data clustering, data stream clustering methods can be classified into five categories that are listed below:
This type of partitioning has the main characteristic of grouping data into a fixed number, previously defined, of clusters, using an iterative method to assign data from one group to another. It is important to note that the STREAM can execute clusters using limited time and memory, this method is intrinsically related to the k factor, which deals with the number of clusters defined at the beginning. Its use is specific to discover spherical clusters, that is, where the idea is to set the centre of each cluster such that it makes both uniform and minimal the angle between components.
Hierarchical methods have the objective of grouping a set of data through the use of a hierarchical tree of clusters, which can be defined as divisive or agglomerative. Some of the most famous hierarchical algorithms are BIRCH, CURE, ROCK and CHAMELEON hierarchical algorithm.
This type of method results in a set of clusters, where all are distinct from each other, while within each cluster, the objects contained are similar.
This type of method performs a survey to identify the density profile of the data to be grouped. Therefore, we take clusters as the regions with a high volume of described objects, being separated by regions where there is little or no data represented. One of the advantages of this type of method is that it allows us to find clusters that have an arbitrary format, without having to make a predefined number of clusters. The example of these methods includes DBSCAN, ÓPTICA and PreDeCon.
This type of clustering approach differs a lot from the traditional ones since it is not concerned with the data points themselves, but with the value space that surrounds it, which can be described through five basic steps, namely: the creation of the grid structure, by choosing a finite number of cells for it, calculating the cell density for each of the cells, classifying them according to the density of each one, identifying, from there, cluster centers and checking the cross-section of neighbouring cells. We can take as an example of this type of method the DENCLUE, which uses a grid of fixed size of the grid, the STRING and the WaveCluster.
This type of clustering method aims to optimize the relationship between the data and some statistical models. We can divide the present methods into smooth clustering and neural networks. Some examples of model-based methods are Expectation-Maximization (EM) and SWEM, the latter being based on EM using a sliding analysis window. It is important to point out that, whenever we talk about this type of grouping method, the first things that come to mind are the artificial intelligence and not without reason, after all, they are the most used in this branch of computing.