Parallel Computing in Statistical-Validation of Clustering Algorithm for the Analysis of High throughput Data
Date of Award
Master of Science (MS)
Mathematics and Statistics
Dr. Susmita Datta - Chair
Dr. Saied Belkasim
Dr. Gengsheng Qin
Currently, clustering applications use classical methods to partition a set of data (or objects) in a set of meaningful sub-classes, called clusters. A cluster is therefore a collection of objects which are “similar” among them, thus can be treated collectively as one group, and are “dissimilar” to the objects belonging to other clusters. However, there are a number of problems with clustering. Among them, as mentioned in [Datta03], dealing with large number of dimensions and large number of data items can be problematic because of computational time. In this thesis, we investigate all clustering algorithms used in [Datta03] and we present a parallel solution to minimize the computational time. We apply parallel programming techniques to the statistical algorithms as a natural extension to sequential programming technique using R. The proposed parallel model has been tested on a high throughput dataset. It is microarray data on the transcriptional profile during sporulation in budding yeast. It contains more than 6,000 genes. Our evaluation includes clustering algorithm scalability pertaining to datasets with varying dimensions, the speedup factor, and the efficiency of the parallel model over the sequential implementation. Our experiments show that the gene expression data follow the pattern predicted in [Datta03] that is Diana appears to be solid performer also the group means for each cluster coincides with that in [Datta03]. We show that our parallel model is applicable to the clustering algorithms and more useful in applications that deal with high throughput data, such as gene expression data.
Atlas, Mourad, "Parallel Computing in Statistical-Validation of Clustering Algorithm for the Analysis of High throughput Data." Thesis, Georgia State University, 2005.