Abstract
The problem of group detection with no prior knowledge, i.e clustering, is one of the most important tasks in data analysis. It has been addressed in many applications in various fields. Data clustering becomes challenging when the group sizes are very different - this is called imbalanced data - with different densities and shapes. This task is even more difficult in the context of high-dimensional data since it is very hard do state any assumption about specific characteristics of groups (sizes, densities, or shapes). However, many clustering techniques are built upon some of these assumptions. For instance, the most popular k-means method can be shown as a particular case of the EM algorithm for data generated by Gaussian mixtures. In addition, many clustering algorithms (also k-means) require the ad-hoc specification of parameters, especially the number of clusters. This is almost impossible to know beforehand. Unfortunately, the final clustering solution usually depends on the choice of the predefined parameters.We propose an algorithm which identifies the clusters in imbalanced high-dimensional data. Our procedure incorporates an existing clustering method in order to detect the homogeneous set of initial clusters. These initial clusters are successively merged in order to build final clusters. Merging a pair of initial clusters is based on Local Outlier Factor (LOF) which captures the final clusters of arbitrary sizes without assumptions on cluster characteristics. The fact of small group sizes in imbalanced data makes the observations of those groups atypical. Therefore, our special focus is towards the ability of finding these interesting groups next to the description of the data structure. The usefulness of our approach is demonstrated with imbalanced media data sets, and it is shown that state-of-the-art methods are outperformed.
Reference
Brodinova, S., Zaharieva, M., Filzmoser, P., Ortner, T., & Breiteneder, C. (2016). Group Detection in the Context of Imbalanced Data. International Conference COMPUTER DATA ANALYSIS & MODELING, Minsk, Belarus, Non-EU. http://hdl.handle.net/20.500.12708/86323