SpeakEasy Features and Applications

A unique aspect of the SpeakEasy clustering process is that it performs community detection by combining top-down and bottom-up approaches simultaneously. Specifically, during each label propagation round, both local and global information about the label popularity is taken into account. A table of global label frequencies is maintained to guide the algorithm towards adopting the globally most unexpected label among all the labels received by a node in a local label propagation step. Such selection produces stable communities and leads to fast convergence with no oscillations in large networks.

A second prominent feature of SpeakEasy is consensus clustering used to create the final clusters based on many replicate partitions. These partitions, derived from the same data set, differ because the label propagation algorithm is stochastic and they reflect the stability or instability of a given dataset. We utilize this variability to both identify multi-community nodes and to identify robust communities. In contrast, traditional clustering techniques output a single partition, and the stability of such a partition is unclear. For instance, clustering parameters employed in hierarchical-based clustering can dramatically alter results and may require manual “tuning” for good results. This process can lead to biological findings that are difficult to replicate because they are driven by data artifacts or specific parameter settings, rather than biological mechanisms. To output robust clusters which are not prone to data artifacts or noise, SpeakEasy performs stochastic label propagation on the same network many times. To identify the final data partition, we select the partition with the highest average value of the Adjusted Rand Index (ARI) with all other partitions. Subsequently, a co-occurrence matrix is analyzed for each node. If a node co-occurs with the nodes assigned to some other community with greater frequency than a certain user-selectable threshold, the node is also assigned to this community, thus becoming a multi-community node.

SpeakEasy is able to process networks with any type of links: weighted or unweighted, directed or undirected, positive or negative weights, or any combination of these features. Performance of SpeakEasy scales linearly with the number of edges. Biological networks with several thousand nodes and full weighted connectivity or sparse networks with several hundred thousand nodes can be processed on a typical laptop in a few minutes. More...

Network Clustering.

Applications

We have applied SpeakEasy to biological datasets from multiple physical scales, which are generated by different data acquisition techniques. These datasets range from the molecular level (physical protein networks and statistical gene interaction networks) to the cellular level (cell differentiation, Figure 1) to the organ level (brain fMRI). These applications represent many of the most common applications of community detection in biology. Output of SpeakEasy in these applications often includes both known communities and novel communities that are validated through ontology databases. More...