Source package for "Noise-Resistant Bicluster Recognition", accepted in ICDM 2013
--Prepared by Huan Sun, UCSB
--Users are very welcome to report any found bugs or problems to huansun@cs.ucsb.edu
1.Main files:
Each M-file is heavily commented. Please refer to each specific M-file for details.
(a) preprocessing
triplize.m: to discretize the original continuous data to {-1,0,1} according to [1].
(b) training
initializeParameters.m: to initialize the paramters in the model at the begining of the training proces
TrainAutoDecoder.m: to train autodecoder given gene expression data
ObjGradientBackPropagate.m: to compute the derivatives of the objective function by Backpropagation (sigmoid in the hidden layer)
ObjGradientBackPropagate_tanh.m: to compute the derivatives of the objective function by Backpropagation (tanh in the hidden layer)
minFunc Package: a function package to minimize an objective function, which can be downloaded in [2].
(c) finding biclusters
BicluterEmbedding.m: to find the gene set and condition set in a bicluster given the learnt parameters
ClusterFilter_block.m: to filter the severely overlapped biclusters
(d) evaluate biclusters
SampleRelevRecov.m: to evaluate the relevance and recovery score on the sample sets contained in biclusters
GeneSetAnalysis.R: to evaluate the enrichment of the gene sets contained in biclusters
2.Example illustration:
TestExample.m shows the pineline to apply AD to DLBCL dataset
Basically, there are 4 steps to train and evaluate AD:
(1) Get your data ready and input it to the "TrainAutoDecoder.m" to learn the parameters in AD.
(2) Based on the learnt parameters, try to determine the membership of genes and conditions using "BiclusterEmbedding.m".
(3) Remove the heavily overlapped biclusters by running "ClusterFilter_block.m".
(4) Evaluate the biclusters in terms of relevance and recovery using "SampleRelevRecov.m"
3.Dataset sources:
Breast cancer, multiple tissue, DLBCL, lung cancer:
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
Datasets used in our paper are also included in the "Datasets" directory.
References:
[1] G. Li, Q. Ma, H. Tang, A. Paterson, and Y. Xu, “Qubic: a qualitative biclustering algorithm for analyses of gene expression data,” Nucleic acids res, vol. 37, no. 15, pp. e101–e101, 2009
[2] http://www.cs.ubc.ca/~schmidtm