Clustering and pattern mining are tightly coupled in our project to support interpretability/explainability and feedback options as well as ease the task of finding results.
- Chemoinformatics use cases For several reasons, we have chosen to focus on a chemoinformatics use case. First, as described at the beginning of the document, explorative data mining in chemoinformatics is the prototypical unsupervised/semi-supervised mining task. Second, molecular data can be described in different, sometimes challenging, representations, allowing us to develop and test approaches going beyond attribute-value data. Finally, having large sets of well-curated data is not always easy but is possible in this setting. One of our objectives is to further the identification of associations between structural and chemical properties, and biochemical behavior. The success of the process is strongly related to the capability of a chemoinformatician to parametrize a clustering. As such, the integrated system will be continuously employed for explorative data mining on molecular data with feedback given w.r.t. ease of use, plausibility of delivered results, and intuitiveness of the system’s behavior when reacting to the solution. In addition, expert knowledge will be a vital source of constraint formulations, as detailed below. A user would need to be able to choose a clustering algorithm, a similarity measure, and possibly to select any attribute or to weight it (in the attribute-value case). This means that, instead of optimizing the resulting system towards a particular method or similarity measure that plays well with the constraints we intend to enforce, we will include several algorithms (e.g. k-Means, DBScan…) and similarity measures. Next, the user should be able to express constraints based on graph matching. For instance, the expert could express that it is necessary that instances in clusters have common structural characteristics (so-called chemical scaffolds or pattern). The application will concern the histone deacetylase (HDAC) inhibitors. Of the epigenetic modifications identified so far in the nervous system, histone acetylation has been unequivocally associated with facilitating learning and memory. A role for histone acetylation in synaptic plasticity and memory formation is clearly established. Our objective is to define and analyse the chemical space associated to HDAC inhibitors (via an interplay of pattern mining, clustering and chemical expertise).
- Interpretability A cluster (or several) can (and should) often be represented in terms of discrete descriptors of contained instances, which is one way around the dimensionality problem: instead of trying to project a cloud of instances onto a small (<=3) set of dimensions, the user is presented with a conjunction of attribute-value combinations, a sequence, subtree or subgraph, or disjunctions (or sets) thereof. For structured data, the standard setting involves mining a set of such descriptors, i.e. patterns, beforehand and selecting a subset based on the clustering, which entail a trade-off: too few descriptors and the description will not give insights, especially if clusters change according to user feedback, too many and the clustering process will take long and be subject to the curse of dimensionality, i.e. in high-dimensional spaces all instances seem equally (dis)similar. In SCHISM, we instead cluster on the original presentation, derive patterns from formed clusters, and adjust or remine patterns when clusters change due to user feedback. A pattern, as we have indicated above, needs to be presented in the context of other patterns and the data it was mined from to allow for meaningful interpretation and feedback. Compared to the preceding section, the roles are therefore reversed: where the cluster was the result the user is mainly interested in and (sets of) patterns the descriptors that help interpret it, it is now the patterns that users consider and clusters (both in terms of the contained instances and other patterns derived from the clusters) that offer the supporting information.
- Feedback While we will develop new feedback options in SCHISM, we can already discuss existing ways of giving feedback and how they should impact on the two representations. At the cluster level, a user can designate instances that should not be part of a cluster or that should be assigned to an existing cluster. Apart from the fact that this will restart the clustering process (or refine the clustering), it will also impact on representative patterns whose support and discriminatory power will change. We will therefore develop methods that take those additions and subtractions into account in real time, update patterns’ statistics (and potentially their position in a ranking), and restart the pattern mining process if statistics change too strongly. This last point will be a major research interest: how to decide automatically when the mining process should be restarted, with such decisions based on work in statistically sound pattern discovery. A generalization of the just mentioned is to demand that a cluster be split into subclusters or to actively designate a subset of a cluster that should be formed and separated from the rest. An indirect way of going about changing the number of clusters or the cluster membership of certain instances is changing constraint settings on the diameter, intra-cluster similarity, or minimum size of clusters. As before, this will require restarting or refining the clustering process but the effects on patterns are not straight-forward. We will develop theoretical guarantees on the effect such changes have on pattern statistics and develop algorithms exploiting them. At the pattern level, users can indicate that they do not want certain elements in the pattern to occur (which will cause a reassessment of the pattern’s elements and of the instances covered), a rejection of a complete pattern (which, notably, is not the same as rejecting the instances that support it), or an elevation or reduction in terms of its position in a list w.r.t. other patterns, which indirectly translates into a statement about the measures used to decide this position, and therefore about the mining process itself. Such feedback, i.e. limiting the vocabulary of the pattern language, or establishing relative preferences will be used as input to learning algorithms that indicate how to change constraints. Users can also constrain more general measures, such as the size of the pattern, which will impact on the mining process because it limits the parts of the search space that can be explored.
- Mining process Clusters are necessarily limited by the description of the instances that can be placed into them – if a pattern is not part of the description, we obviously cannot describe a cluster in terms of it. The flip side is that changing available patterns changes what clusterings can possibly be expressed. Changing patterns therefore means that only subsets of the data need to be reclustered, easing the computational load. Members of the consortium have introduced the notion of descriptive clustering modelled as a bi-objective optimization problem, namely the requirement for compact clusters w.r.t. a given distance and interpretable ones w.r.t. tags associated to objects. Three optimization criteria have been defined depending on the density of the tags. That work defines an algorithm that maximizes cluster coherence and description clarity at the same time, finding the necessary balance. Any feedback that excludes certain instances of parts of instance descriptions changes the search space for this algorithm and allows new solutions to be discovered. That work is limited to unstructured descriptors, however, and we will exploit ways of adapting it to the structured case. For structured data, excluding a single atom has the same effect as excluding a single attribute-value combination (AV) for unstructured data but excluding a bond involving two atoms excludes more than two AVs. The effect of excluding structured instances is analogous. Patterns mined in an unsupervised manner need criteria about their unexpectedness, statistical significance etc. As soon as at least partial information regarding a partition of the data is available, one can start exploiting measures related to discriminating different groups in the data, e.g. a particular cluster against the rest or a subcluster again other parts of the parent cluster, or describing predefined subsets.
- Interface The interface that we are going to develop will support existing feedback options: rejecting a pattern or cluster straight-up, rejecting or manually reassigning instances, indicating must-link and cannot-link pairs, reordering patterns in a list to indicate importance/preference. We will also add so far unexplored options such as selecting a cluster and requiring that it be split or merged with an (potentially unspecified) other, selecting a pattern and demanding a specialization of it to be included in the result set, selecting a subset of instances and demanding that they form a cluster etc. While there are obvious visual options for how to present those interaction options, we do not know at this point in time whether users will find them intuitive and easy to use. We will therefore go through several iterations of the interface that we will evaluate by letting members of the consortium (both computer scientists and chemoinformaticians) but also outsiders use and evaluate it, for instance in the form of demo presentations at international conferences.