Sample Progression Discovery (SPD) version 1
Peng Qiu1, Andrew Gentles2,
Sylvia Plevritis2
1Department of Bioinformatics and Computational Biology, University
of Texas MD Anderson Cancer Center
2Department of Radiology, Stanford University
We present a novel computational approach, Sample
Progression Discovery (SPD), to discover patterns of biological progression
underlying a microarray dataset. In contrast to the majority of microarray data
analysis methods which focus on identifying differences between sample groups
(i.e. normal vs. cancer, treated vs. control), SPD aims to identify an
underlying progression among individual samples, both within and across sample
groups. We applied SPD to gene expression data of cell cycle, B-cell
differentiation, and mouse embryonic stem cell differentiation, where the
underlying progression was known but hidden from SPD. We show that SPD correctly
identified the progression among samples and the gene modules that are
associated with the progression. We view SPD as a hypothesis generation tool,
when applied to datasets where the progression is unclear. For example, when
applied to a microarray dataset of cancer samples, SPD would assume that the
cancer samples collected from individual patients represent different stages
during an intrinsic progression underlying cancer development. The inferred
relationship among the samples may therefore indicate a pathway or hierarchy of
cancer progression, which serves as a hypothesis to be tested.
The algorithm is implemented in Matlab 7, with a Graphic User Interface on top
of it, designed and written by Peng Qiu.
A manuscript of this work is available: Peng Qiu, Anderw Gentles and Sylvia Plevritis, "Discovering Biological Progression underlying Microarray Samples", PLoS Computational Biology, vol. 7, issue 4, e1001123, 2011.
This software is new and still being developed. Updates and new versions will available at here
If you have any questions or find any problems in it, please email me at peng.qiu@bme.gatech.edu
Here is a screen shot of our software
Installation instructions:
This package requires Matlab 7. In order to give users
maximum freedom of manipulating this software, the raw .m files are provided.
(1) download the zip package at: SPD.zip
(last updated on April 28, 2011)
(2) unzip to your local machine
(3) open Matlab 7 and change the directory to where the package is unzipped, or add that directory to matlab path
(4) type "progression_GUI" and enter,
the GUI will show up
We also provide several example data and result files, which is available at
SPD_example_data_files.zip
License conditions:
The SPD software is free for academic use. A patent for SPD has been applied for on behalf of Stanford University. For license conditions, please contact the Office of Technology Licensing at Stanford (Kirsten Leute, kirsten.leute@stanford.edu).
User Manual:
1. Prepare input file:
Input file contains at least 3 variables:
probe_names : n * 1 cell array, each element is the name of one gene/feature
exp_names : 1 * m cell array, each element is the name of one sample/array
data : n * m expression data matrix
Two optional variables can be used to input clinical information or other sample
annotations
color_code_names: k * 1 cell array
color_code_vectors: k * m matrix that stores clinical info
For example:
color_code_names = [ {'Gender'};
{'Tumor Grade'};
];
color_code_ vectors= [ 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1;
1 2 3 2 2 2 1 1 3 3 3 2 1 1 1;
];
An input file can contain other variables, but the software will just ignore
them.
See examples at
SPD_example_data_files.zip
2. Load data
Users can load (1) a raw input data file defined above, or (2) a result file
generated by the software.
3. Filter genes
Filter genes according to user-defined threshold of standard deviation, and
number of acceptable nulls per gene. Any genes with smaller STD or more NULL
entries are excluded. A user can also choose to exclude "_x_at" probesets.
The software shows some basic information after the filtering: number of genes
left, number of samples, number of mull entries.
After filtering genes, null entries are imputed / filled using KNN imputation,
so that the subsequent steps do not have to worry about null entries.
If more sophisticated filters are needed, users can apply those filters when
preparing the input file, and then ignore the gene filtering functions in the
GUI.
4. Clustering genes (Step 1 in SPD)
This software has two clustering algorithms, "Divisive" and "Agglomerative". My suggestion is always use "Agglomerative".
(1) Button "Divisive" launches the top-down consensus k-means clustering method
that is described in the manuscript. Users can define the number of k-means
iterations in each consensus k-means partitioning during the top-down process.
Default is 200. Users can also define the stopping criterion, which is the
desired module coherence. The default is 0.7, but this could be
dataset-dependent. The button "corr hist" can be used to view a histogram of all
the pair-wise correlation (after genes are filtered). If the histogram has a
heavy tail, meaning many gene pairs share high correlation, higher coherence
parameter might be more appropriate.
(2) Button "Agglomerative" is another clustering algorithm, which is
agglomerative. This algorithm only needs one parameter, the module coherence.
This algorithm goes bottom-up.
The parameter "Minimum module size" can be used to exclude small clusters.
After the clustering algorithm finishes, the two buttons
"view modules Expr" and "Model quality" are enabled. You can use these two
buttons to visualize the clustering results.
5. Construct MSTs and compare modules and MSTs (step 2 and step 3 in SPD)
The "GO" button does the following:
(1) Construction one MST based on each gene module
(2) Compute the p-values of the statistical concordance between all the modules
and all the trees. Random permutation is performed to get the p-values. Users
can define the number of random permutations, before clicking the GO button.
6. Identify modules similar in terms of progression (step 4 in SPD)
To generate a progression-similarity matrix between gene modules, users need to
define a p-value threshold that determines whether the fit between a module and
a tree is significant.
Then, click the button "Show gene module adjacency", a figure will pop up, which
shows the similarity between modules, similarity in terms of progression.
Visually identify (high value) blocks along the diagonal. It is possible that
there exist more than one block along the diagonal line.
Zoom-in this pop up window to see which modules are associated to the visually
identified block(s).
Manually input the identified modules (a space-separated string that contains the IDs of the modules in the identified block) and click "add". After clicking "add", one
entry will be shown in the list-box. This entry is one progression pattern
defined by the modules that the user just "add"ed. (Multiple progression
patterns can be added, each correspond to one block that the user visually
identified).
To view the progression pattern, click "view progression". Another window will
pop up. If the number of samples is smaller than 50, this step will take ~ 30 sec.
If the number of samples is large, this step may take long, because my tree-visualization algorithm is slow.
In the pop-up window:
(1) Nodes/samples can be dragged around.
(2) After moving the nodes, the new layout can be saved. Then next time when the
user views this progression pattern, the saved layout will be displayed, instead
of running my slow tree-visualization algorithm. To save this new layout, you
need to click the "Save Layout" button in the view_tree window, and the "Save
Results" button in the main window.
(3) Nodes can be color-coded according to the modules that support this
progression pattern (blue means low value, red means high value).
(4) Nodes can also be color-coded by the clinical information from the input
data file: color_code_names, and color_code_vectors.
(5) Button "export gene list" writes two files in the current directory. Both
contain the same information, the list of genes in the modules that support the
progression.