Spectral Analysis for Class Discovery and Classification (SPACC)
(version 2.0)

We proposed SPACC, a classifier that can perform both class discovery and classification. The algorithm is implemented in Matlab 7, with a Graphic User Interface on top of it, designed and written by Peng Qiu.

Our motivation is as follows. In the literature, the existing classification methods can be mainly divided into two categories, unsupervised and supervised methods. In unsupervised methods, samples are grouped into clusters or tree structures, where the class label information does not affect the clustering process. Since the class label is not used, unsupervised method has the potential to discover subclasses that are beyond the known class label information. However, there is lack of systematic way to interpret the agreement and disagreement between unsupervised clusters and known class labels. On the other hand, for supervised methods, the aim is to find the boundary that best separates different classes, where the class labels play an important role. Since the total number of classes is defined by the class labels, supervised classifiers are not able to pick up possible data substructures within each known class. In supervised methods, another issue is the robustness, the adversary effect of data outliers and mislabeled samples needs to be carefully handled. The limitations of unsupervised and supervised methods motivated us to propose a novel classification method, which utilizes the class label information to a less important role so as to perform class discovery and classification simultaneously.

A manuscript of this work has been published:

Peng Qiu, and Sylvia K. Plevritis, "Simultaneous Class Discovery and Classification of Microarray Data using Spectral Analysis", Journal of Computational Biology, 16(7):935-944, 2009.

Installation instructions

This package requires Matlab 7. In order to give users maximum freedom of manipulating this software, the raw .m files are provided.

(1) download the zip package at:
http://www.stanford.edu/~qiupeng/software/SPACC/SPACC_source_code.zip (last updated on July 1, 2009)

(2) unzip to your local machine

(3) open Matlab 7 and change the directory to where the package is unzipped

(4) type "network_classification" and enter, the GUI will show up

We've also compiled the software in to a stand-alone executable format:
http://www.stanford.edu/~qiupeng/software/SPACC/SPACC_standalone.zip (last updated on July 1, 2009)
Simply un-zip and run SPACC.exe, the user interface will show up.

Manual

(1) Prepare input file:

The input file is a .mat file which has the following variables:
training_samples       : each column is one sample, each row is one feature
training_labels        : class label of each training sample
training_samples_names : cell array of the names of each training sample (optional)
testing_samples        : each column is one sample, each row is one feature
testing_labels         : vector, class label of each testing sample
testing_samples_names : cell array of the names of each testing sample (optional)
There are a few example input files included in the zip package.
Note:
(1) This software does not perform feature selection. The user needs to select relevant features when preparing the input file.
(2) Theoretically, this algorithm can handle infinite number of features, as long as they are relevant.
(3) If the numerical range of the input data is too large, errors may occur, that error can be avoided by normalization (reducing the numerical range of the input data).

(2) Browse and Load input file

In step 1 (top-left panel), if the input file is successfully loaded, the number of samples in each class for the training set and testing set will be displayed.
Each training and testing sample will be assigned an index that is defined by the software. This information is shown in the message box at the bottom.

(3) Display training data

In step 2 (top-right panel), click "show training samples" button. The training samples will be displayed in the left plot. Each training sample is one marker in the plot, the class labels are reflected by the shape of the markers. The proposed algorithm is applied to the training samples, with results displayed in the right plot. The red polygons show the convex hulls of each resulting clusters. It is possible that one resulting cluster contains only one sample. Such cases will not be indicated by red polygons.
The two plots are both PCA-plots, drawn using principal component analysis (PCA). The horizontal and vertical axes are the coefficients of (by default) the first two principal components. The users can choose other principal components by modifying the two EditBoxes to the left of the button.
The left plot in step 2 is interactive. User can delete/undelete training samples by clicking in the figure.
NOTE: although the samples are displayed using PCA, the algorithm is based on all the features provided by the input file.
NOTE: One very important component of the algorithm is how to define the Laplacian matrix based on the input data, because the Laplacian matrix is the basis for the iterative partitioning. The Laplacian matrix is defined in the file "calculate_dist_la_matrix.m", users can modify this file to define the Laplacian matrix in their own way.

(4) Display training and testing data

The plot in step 3 displays both training and testing data in a PCA-plot. The purpose is for users to have a general picture.

(5) Classification

In step 4, by clicking the "classification" button, classification is performed on the testing set.
In step 5, the plot shows the classification one testing sample (the red marker). The shape of the red marker indicates the true label of the testing sample. The blue polygon shows: the classification decision is made based on the majority vote of which training samples.
Using the EditBox and buttons above the plot, users can navigate through all testing samples.
List of incorrectly classified testing samples is shown in the list box. By clicking in the list box, instances of incorrect classification will be shown in the right plot.

(6) Output

In step 4, the "output results" will generate an output file named "tmp_result_file.mat". The output file contains sample index (defined by the software), sample name, and classification results.
In "tmp_result_file.mat", classification results of training sample is "training_#". The number indicates the results of the training samples (the red polygons in step 2). "-1" means outlier. The classification results of testing samples is "correct", "incorrect".

Spectral Analysis for Class Discovery and Classification (SPACC) (version 2.0)

Installation instructions

Manual

Spectral Analysis for Class Discovery and Classification (SPACC)
(version 2.0)