Spectral Analysis for Class Discovery and Classification (SPACC) (version
2.0)
We proposed SPACC, a classifier that can perform both class discovery and
classification. The algorithm is implemented in Matlab 7, with a Graphic
User Interface on top of it, designed and written by Peng Qiu.
Our motivation is as follows. In the literature, the existing classification
methods can be mainly divided into two categories, unsupervised and supervised
methods. In unsupervised methods, samples are grouped into clusters or tree
structures, where the class label information does not affect the clustering
process. Since the class label is not used, unsupervised method has the
potential to discover subclasses that are beyond the known class label
information. However, there is lack of systematic way to interpret the agreement
and disagreement between unsupervised clusters and known class labels. On the
other hand, for supervised methods, the aim is to find the boundary that best
separates different classes, where the class labels play an important role.
Since the total number of classes is defined by the class labels, supervised
classifiers are not able to pick up possible data substructures within each
known class. In supervised methods, another issue is the robustness, the
adversary effect of data outliers and mislabeled samples needs to be carefully
handled. The limitations of unsupervised and supervised methods motivated us to
propose a novel classification method, which utilizes the class label
information to a less important role so as to perform class discovery and
classification simultaneously.
A manuscript of this work has been published:
Peng Qiu, and Sylvia K. Plevritis, "Simultaneous Class Discovery and Classification of Microarray Data using Spectral Analysis", Journal of Computational Biology, 16(7):935-944, 2009.
Installation instructions
This package requires Matlab 7. In order to give users maximum freedom of
manipulating this software, the raw .m files are provided.
The input file is a .mat file which has the following variables:
training_samples : each column is
one sample, each row is one feature
training_labels : class label
of each training sample training_samples_names : cell array of
the names of each training sample (optional)
testing_samples : each column
is one sample, each row is one feature
testing_labels : vector,
class label of each testing sample testing_samples_names
: cell array of the names of each testing sample (optional)
There are a few example input files included in the zip package.
Note: (1) This software does not perform feature selection. The
user needs to select relevant features when preparing the input
file. (2) Theoretically, this algorithm can handle infinite number
of features, as long as they are relevant. (3) If the numerical
range of the input data is too large, errors may occur, that error can be
avoided by normalization (reducing the numerical range of the input data).
(2) Browse and Load input file
In step 1 (top-left panel), if the input file is successfully loaded, the
number of samples in each class for the training set and testing set will be
displayed.
Each training and testing sample will be assigned an index that is defined
by the software. This information is shown in the message box at the bottom.
(3) Display training data
In step 2 (top-right panel), click "show training samples" button. The
training samples will be displayed in the left plot. Each training sample is
one marker in the plot, the class labels are reflected by the shape of the
markers. The proposed algorithm is applied to the training
samples, with results displayed in the right plot. The red polygons show the
convex hulls of each resulting clusters. It
is possible that one resulting cluster contains only one sample. Such cases
will not be indicated by red polygons.
The two plots are both PCA-plots, drawn using principal component analysis
(PCA). The horizontal and vertical axes are the coefficients of (by default)
the first two principal components. The users can choose other principal
components by modifying the two EditBoxes to the left of the button.
The left plot in step 2 is interactive. User can delete/undelete training
samples by clicking in the figure.
NOTE: although the samples are displayed using PCA, the
algorithm is based on all the features provided by the input
file.
NOTE: One very important component of the algorithm is how
to define the Laplacian matrix based on the input data, because the Laplacian
matrix is the basis for the iterative partitioning. The Laplacian matrix is
defined in the file "calculate_dist_la_matrix.m", users can modify this file
to define the Laplacian matrix in their own way.
(4) Display training and testing data
The plot in step 3 displays both training and testing data in a PCA-plot.
The purpose is for users to have a general picture.
(5) Classification
In step 4, by clicking the "classification" button,
classification is performed on the testing set.
In step 5, the plot shows the classification one testing sample (the red
marker). The shape of the red marker indicates the true label of the testing
sample. The blue polygon shows: the classification decision is made based on
the majority vote of which training samples.
Using the EditBox and buttons above the plot, users can navigate through
all testing samples.
List of incorrectly classified testing samples is shown in the list box.
By clicking in the list box, instances of incorrect classification will be
shown in the right plot.
(6) Output
In step 4, the "output results" will generate an output file named
"tmp_result_file.mat". The output file contains sample index (defined by the
software), sample name, and classification results.
In "tmp_result_file.mat", classification results of training sample is
"training_#". The number indicates the results of the training samples (the
red polygons in step 2). "-1" means outlier. The classification results of
testing samples is "correct", "incorrect".