My main research interests are in bioinformatics and computational biology, focusing on statistical signal processing, machine learning, control systems and optimization. The following are some specific topics that I've been working on.
Flow cytometry and the next-generation mass
cytometry technologies capture the heterogeneity of biological systems by providing
multiparametric measure-ments of single cells. Even as cytometry technology is rapidly
advancing, methods for analyzing this complex data lag behind. Traditional flow cytometry
analysis is often a subjective and labor-intensive process that requires users’ deep understanding
of the cellular phenotypes underlying the data. Furthermore, the advent of mass cytometry
is quickly increasing the dimensionality of the data, making the traditional analysis approaches
a critical bottleneck. We developed a novel analytical approach, Spanning-tree Progression Analysis
of Density-normalized Events (SPADE), to objectively analyze single-cell data in a robust and
unsupervised manner. Briefly, SPADE views a single-cell cytometric dataset as a high-dimensional
point cloud of cells, and uses topological methods to reveal the geometry of the cloud. Based on
preliminary data, this geometry reveals distinct subpopulation of cells and a likely cellular hierarchy
underlying the data.
The majority of microarray data analysis methods in the literature
focus on identifying differencec between sample groups (normal vs. cancer, treated vs. control),
i.e. unsupervised clustering, supervised classfication and various forms of statistical tests. These
methods are essentially asking the same question, what is the difference between group A and group B.
The differences among samples within the same group have been ignored. To explore this information,
we developed a new computational method, termed Sample Progression Discovery (SPD). SPD aims to
identify an underlying progression among individual samples, both within and across sample groups. We
view SPD as a hypothesis generation tool when applied to datasets where the progression is unclear. For
example, when applied to a microarray dataset of cancer samples, SPD assumes that the cancer samples
collected from individual patients represent different stages during an intrinsic progression underlying cancer
development. The inferred relationship among the samples may therefore indicate a trajectory or hierarchy of
cancer progression, which serves as a hypothesis to be tested.
Classification methods are commonly divided
into two categories: unsupervised versus supervised. Because the class label
information is not involved in unsupervised methods, they have the ability
to discover new classes. However, they carry the risk of producing
non-interpretable results. On the other hand, supervised methods will always
find a decision rule that interprets the different classes. However, in
supervised methods, the class label information plays such an important role
that it confines the supervised methods by defining the number of possible
classes. Consequently, supervised methods do not have the ability to
discover new classes. The limitations of unsupervised and supervised methods
motivated us to propose a semi-supervised classification method, which
utilizes the class label information to a less important role so as to
perform class discovery and classification simultaneously.
Information theoretic approaches are
increasingly being used for reconstructing gene regulatory networks from
gene expression microarray data. Most information theoretic approaches start
by computing the pairwise mutual information between all possible pairs of
genes, resulting in a mutual information matrix, which is then manipulated
to identify regulatory relationships. Computing the mutual information
matrix is quite time-consuming. For an example set consisting of 336 samples
and 9563 genes, the state-of-art algorithm, ARACNE, takes about 142 hours to
compute the mutual information matrix. We present two independent methods to
reduce the computation time: one is based on spectral graph theory, and the
other is by reformulation the order of calculations. The two methods reduce
the computation time by 84% and 98%, respectively.
PET is a method for imaging neural receptors, which helps monitoring biochemical processes. In PET study, an arterial input function is usually required, which involves very painful and risky invasive measurements. To avoid the input function, one possible approach is to manually derive a reference region. In our study, we provide an alternative approach. Based on a novel idea called activity-subspace, we are able to estimate the input function, without requiring any further measurement from PET experiments.