Constrained or semi-supervised clustering uses constraints to influence the “pure” clustering process, i.e. one that only seeks to maximize the similarity of instances placed into clusters together. Earliest work focused on instance-level constraints, specifying that two instances had to or must not be in the same cluster. Later work added additional constraints, e.g. on the minimal size or the diameter of clusters. The last 15 years have shown the appeal of declarative paradigms (Constraint Programming, SAT, Integer Linear Programming) for solving data mining problems (pattern mining,
clustering) under constraints. Nevertheless, a trade-off has to be found between the expressiveness of the constraints and the efficiency of the solver. In contrast, we will address the question of the expressiveness of the constraints with regards to the efficiency of the solver compromise from the perspective of learning in spaces of dissimilarities, which make it possible to avoid searching for an “optimal” representation of constraints/data when these are described in large spaces mixing variables of different types. It requires defining a common representation so that data/constraints are projected into some smaller feature/constraint spaces while preserving their neighborhood properties. The objective is to assess the design of these models using random forests in order to overcome the identified major bottleneck related to the selection of the relevant representations according to different types of constraints/data available and to learning in that heterogeneous space. Random Forests, for which L. Heutte and S. Bernard have already a strong expertise, are good candidates for overcoming this bottleneck since they embed a similarity estimation mechanism that can be exploited for computing the dissimilarities between samples while taking into account their label (in supervised mode) or their constraints (in unsupervised mode) if needed. This latter mechanism has already proven to be efficient and flexible in a various range of learning tasks by tackling both the heterogeneity of data and the high dimensionality of data using some function to optimize the supervised learning task.
As for pattern mining, we will thus explore several directions to adapt this framework to unsupervised or semi-supervised data: the first one intends to build a good dissimilarity space using random forest based dissimilarity learning on the supervised data; the second one will investigate how different types of constraints can be used to guide the similarity/dissimilarity learning (e.g. cannot-link constraints between two samples lead to maximum dissimilarity whereas must-link constraints lead to full similarity) and how they can be used to refine the clustering in an interactive way to label unsupervised data depending on user feedback. Additionally, we have to fix which feedback we will exploit: a cluster being split, some instances reassigned, some instances cannot be put together, or, more interestingly, constraints on the patterns that describe the clusters.
- Deriving constraints from user feedback to use in a Random Forest based Dissimilarity learning framework. Given user feedback, for instance that an existing cluster should be
refined into two (or more) subclusters, of which one might be concretely defined by the user, the question is how to arrive at a clustering that lets those subclusters arise naturally.
Depending on the clustering method used, this might require changing the number of desired clusters (for k-Means, for instance), simply cutting the dendrogram somewhere else (in hierarchical clustering), or other changes to parameters. First, we will have to clearly define the semantics behind the feedback and then define the constraints or the parameters
allowing to express them. Once we will find that way to express the constraints as similarities/dissimilarities between samples (it can be as simple as it is mentioned earlier or take into account other types of constraints), they will be used to guide the Random Forest Dissimilarity learning to refine the metric used for clustering, i.e. clustering the data using an adaptive
dissimilarity (from a metric learning perspective) , and in an incremental way by taking benefit of both supervised and unsupervised data. Furthermore, in the current state of the art,
constraints (except must-link or cannot-link constraints) are usually imposed on all clusters, whereas feedback is often local for some specific clusters. As random forests are ensembles of trees, it is easy to explore the neighborhoods of different (sub-)clustering solutions in smaller subspaces in a more meaningful manner, until the required changes have been obtained. - Offering the user choice via the static/dynamic selection of (sub-)clusterings. Instead of a single clustering that the user gives feedback on, one can offer the user several choices and let him/her pick one or merge several that show desired characteristics in parts of them. To do this, we will explore the static or dynamic selection of some trees in the forests in order to compare several (sub-)clusterings. The random forest based dissimilarity learning framework allows to give a global similarity score or to return a single global score for each sub-clustering that the user needs to interpret. Merging different clusterings under some constraints given by the expert will also be tackled by considering purely declarative methods that give a guarantee to find a merged clustering satisfying all the constraints to the detriment of efficiency, and more efficient heuristic methods that come at the price of maybe not satisfying all the constraints.
- Using constraints as goodness criteria for clustering. Much existing work exploits constraints directly to control the clustering process towards solutions that satisfy constraints, often by directly manipulating the assignment of instances. Recent work, however, has instead proposed to generate a (potentially large) collection of potential clusterings with/ without taking constraints into account, and retaining only those that satisfy the constraints best.
The advantage of developing such solutions is that they allow much more flexibility in terms of used clustering techniques and give a diversity of solutions, yet the drawback is that complete constraint satisfaction is not guaranteed, something that might not be acceptable to a user. Following this idea, the random forest based dissimilarity learning framework allows by construction to offer a nice and elegant way to search for diverse (sub-)clustering ensembles using static or dynamic selection so that some constraints or all the constraints are satisfied, bringing thus an additional feedback option for users (interactive clustering).