Research Interests:

My main research interests are in bioinformatics and computational biology, focusing on statistical signal processing, machine learning, control systems and optimization. The following are some specific topics that I've been working on.

  1. Extracting the cellular hierarchy underlying high-dimensional single-cell data

    Flow cytometry and the next-generation mass cytometry technologies capture the heterogeneity of biological systems by providing multiparametric measure-ments of single cells. Even as cytometry technology is rapidly advancing, methods for analyzing this complex data lag behind. Traditional flow cytometry analysis is often a subjective and labor-intensive process that requires users’ deep understanding of the cellular phenotypes underlying the data. Furthermore, the advent of mass cytometry is quickly increasing the dimensionality of the data, making the traditional analysis approaches a critical bottleneck. We developed a novel analytical approach, Spanning-tree Progression Analysis of Density-normalized Events (SPADE), to objectively analyze single-cell data in a robust and unsupervised manner. Briefly, SPADE views a single-cell cytometric dataset as a high-dimensional point cloud of cells, and uses topological methods to reveal the geometry of the cloud. Based on preliminary data, this geometry reveals distinct subpopulation of cells and a likely cellular hierarchy underlying the data.

  2. Discovering Biological Progression underlying Gene Expression Data

    The majority of microarray data analysis methods in the literature focus on identifying differencec between sample groups (normal vs. cancer, treated vs. control), i.e. unsupervised clustering, supervised classfication and various forms of statistical tests. These methods are essentially asking the same question, what is the difference between group A and group B. The differences among samples within the same group have been ignored. To explore this information, we developed a new computational method, termed Sample Progression Discovery (SPD). SPD aims to identify an underlying progression among individual samples, both within and across sample groups. We view SPD as a hypothesis generation tool when applied to datasets where the progression is unclear. For example, when applied to a microarray dataset of cancer samples, SPD assumes that the cancer samples collected from individual patients represent different stages during an intrinsic progression underlying cancer development. The inferred relationship among the samples may therefore indicate a trajectory or hierarchy of cancer progression, which serves as a hypothesis to be tested.

  3. Simultaneous classification and class discovery

    Classification methods are commonly divided into two categories: unsupervised versus supervised. Because the class label information is not involved in unsupervised methods, they have the ability to discover new classes. However, they carry the risk of producing non-interpretable results. On the other hand, supervised methods will always find a decision rule that interprets the different classes. However, in supervised methods, the class label information plays such an important role that it confines the supervised methods by defining the number of possible classes. Consequently, supervised methods do not have the ability to discover new classes. The limitations of unsupervised and supervised methods motivated us to propose a semi-supervised classification method, which utilizes the class label information to a less important role so as to perform class discovery and classification simultaneously.

  4. Information theoretic approaches for reconstructing gene regulatory networks.

    Information theoretic approaches are increasingly being used for reconstructing gene regulatory networks from gene expression microarray data. Most information theoretic approaches start by computing the pairwise mutual information between all possible pairs of genes, resulting in a mutual information matrix, which is then manipulated to identify regulatory relationships. Computing the mutual information matrix is quite time-consuming. For an example set consisting of 336 samples and 9563 genes, the state-of-art algorithm, ARACNE, takes about 142 hours to compute the mutual information matrix. We present two independent methods to reduce the computation time: one is based on spectral graph theory, and the other is by reformulation the order of calculations. The two methods reduce the computation time by 84% and 98%, respectively.

  5. Non-invasive PET parametric imaging

    PET is a method for imaging neural receptors, which helps monitoring biochemical processes. In PET study, an arterial input function is usually required, which involves very painful and risky invasive measurements. To avoid the input function, one possible approach is to manually derive a reference region. In our study, we provide an alternative approach. Based on a novel idea called activity-subspace, we are able to estimate the input function, without requiring any further measurement from PET experiments.