Proteins are central components of cell machinery and life.
Several researchers have pointed out that to fully understand
cell machinery, studying proteins in isolation is not enough (clusters of)
interactions need to be delineated as well, since it is strongly
believed that proteins work with other proteins to regulate and
support each other for specific functions.
Recent advances in technology have enabled scientists to determine,
identify and validate pair-wise protein interactions through a
range of experimental and in-silico methods.
Such data can be naturally represented in the form of multiple
interaction networks.
The task of extracting relevant groupings or functional modules from
such interaction networks, for the purposes of understanding
the behavior of organisms, protein function prediction
and drug design, is challenging and an active area of research.
The challenges are daunting. First, is the issue of data integration and data quality. Different experimental and in-silico methods can be used to compute interactions, each with its own strengths and weaknesses. Often, the overlap, in terms of common interactions across experimental settings, is not very high. An added complexity is that the data obtained from such methods is believed to be quite noisy, many interactions obtained even by a single methodology are conjectured to be false positives. Second, even if the network is assumed to be noise free, partitioning the network using classical graph partitioning or clustering schemes is inherently difficult. A common characteristic of ProteinProtein Interaction (PPI) networks is that, a few nodes (hubs) have very large degrees, while most other nodes have very few interactions. Applying traditional clustering approaches typically results in a clustering arrangement that is quite poor containing one or a few giant core clusters and several tiny clusters (possibly singleton clusters). Third, some proteins are believed to be multi-functionali. Effective strategies for soft clustering of these essential proteins are needed. This dictates the need to leverage or adapt soft clustering approaches.
In this talk, we make the case for an ensemble clustering framework to address these problems. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins.
Srinivasan Parthasarathy is an Associate professor in the Computer Science and Engineering Department at the Ohio State University (OSU). He heads the data mining research laboratory and has a joint appointment in the department of biomedical informatics at OSU. He is a recipient of an NSF CAREER award, a DOE Early Career Award, an Ameritech Faculty fellowship and an IBM Faculty Award. His papers have received several best paper awards from leading conferences in the field including IEEE Conference on Data Mining (ICDM), SIAM Conference on Data Mining (SDM), the Very Large Databases Conference (VLDB) and the ACM Conference on Knowledge Discovery and Data Mining (SIGKDD). He is a member of the ACM and the IEEE and serves on the editorial boards of IEEE Intelligent Systems, Data Mining and Knowledge Discovery: An International Journal and the International Journal of Data Mining and Bioinformatics. He also served as one of the program chairs of SIAM Data Mining in 2007.