Research Interests:

My main research interests are in bioinformatics and computational biology, focusing on machine learning, big data, genomics, and single-cell analytics. The following are some specific topics:

    Deep learning for Predicting Protein Subcellular Localization

    Spatial partitioning and localization of biological functions is a phenomenon fundamental to life. At the cellular level, proteins function at specific times and locations. These subcellular locations provide a specific chemical environment and context that are necessary to fulfill the protein function. Thus, knowledge of the spatial distribution of proteins at a subcellular level is essential for understanding protein function, interactions, and cellular mechanisms. We have been developing algorithms for predicting protein subcellular localization using fluorescence microscopy images. We participated in the CYTO 2017 Image Analysis Challenge which focused on this topic, and we achieved top prediction accuracy. Click here to view examples images in the data, and more info regarding the challenge.

    Dynamic Systems Modeling for Experimental Design and Model Reduction

    Mathematical modeling is an important tool for understanding complex biological processes. Typically, mathematical models of biological systems are highly complex with a large number of unknown parameters, whereas the amount of experimental data is almost always limited, not enough to constrain the parameters. As a result of this information gap between the model complexity and the data, parameter estimation and analysis are ill-posed and very challenging problems. To close this information gap, two intuitive strategies are Experimental Design (obtain more data) and Model reduction (simplify the model). We are working on a unified computational framework and geometric interpretation for both problems. We consider a mathematical model as a manifold living in a high-dimensional data space, and explore the projections and singularities of the manifold to perform experimental design and model reduction.

    SPADE Algorithm for Flow Cytometry and CyTOF Analysis

    Flow cytometry and the next-generation mass cytometry technologies capture the heterogeneity of biological systems by providing multiparametric measure-ments of single cells. Even as cytometry technology is rapidly advancing, methods for analyzing this complex data lag behind. Traditional flow cytometry analysis is often a subjective and labor-intensive process that requires user deep understanding of the cellular phenotypes underlying the data. Furthermore, the advent of mass cytometry is quickly increasing the dimensionality of the data, making the traditional analysis approaches a critical bottleneck. We developed a novel analytical approach, Spanning-tree Progression Analysis of Density-normalized Events (SPADE), to objectively analyze single-cell data in a robust and unsupervised manner. Briefly, SPADE views a single-cell cytometric dataset as a high-dimensional point cloud of cells, and uses topological methods to reveal the geometry of the cloud. Based on preliminary data, this geometry reveals distinct subpopulation of cells and a likely cellular hierarchy underlying the data. (Click here to download the SPADE software).

    Discovering Biological Progression underlying Gene Expression Data

    The majority of microarray data analysis methods in the literature focus on identifying differencec between sample groups (normal vs. cancer, treated vs. control), i.e. unsupervised clustering, supervised classfication and various forms of statistical tests. These methods are essentially asking the same question, what is the difference between group A and group B. The differences among samples within the same group have been ignored. To explore this information, we developed a new computational method, termed Sample Progression Discovery (SPD). SPD aims to identify an underlying progression among individual samples, both within and across sample groups. We view SPD as a hypothesis generation tool when applied to datasets where the progression is unclear. For example, when applied to a microarray dataset of cancer samples, SPD assumes that the cancer samples collected from individual patients represent different stages during an intrinsic progression underlying cancer development. The inferred relationship among the samples may therefore indicate a trajectory or hierarchy of cancer progression, which serves as a hypothesis to be tested. (Click here to download the SPD software).

    Simultaneous classification and class discovery

    Classification methods are commonly divided into two categories: unsupervised versus supervised. Because the class label information is not involved in unsupervised methods, they have the ability to discover new classes. However, they carry the risk of producing non-interpretable results. On the other hand, supervised methods will always find a decision rule that interprets the different classes. However, in supervised methods, the class label information plays such an important role that it confines the supervised methods by defining the number of possible classes. Consequently, supervised methods do not have the ability to discover new classes. The limitations of unsupervised and supervised methods motivated us to propose a semi-supervised classification method, which utilizes the class label information to a less important role so as to perform class discovery and classification simultaneously. (Click here to download code).

    Information Theoretic Approaches for Reconstructing Gene Regulatory Networks

    Information theoretic approaches are increasingly being used for reconstructing gene regulatory networks from gene expression microarray data. Most information theoretic approaches start by computing the pairwise mutual information between all possible pairs of genes, resulting in a mutual information matrix, which is then manipulated to identify regulatory relationships. Computing the mutual information matrix is quite time-consuming. For an example set consisting of 336 samples and 9563 genes, the state-of-art algorithm, ARACNE, takes about 142 hours to compute the mutual information matrix. We present two independent methods to reduce the computation time: one is based on spectral graph theory, and the other is by reformulation the order of calculations. The two methods reduce the computation time by 84% and 98%, respectively. (Click here to download code).