Date of Award


Degree Type

Closed Dissertation

Degree Name

Doctor of Philosophy (PhD)


Computer Science

First Advisor

Alex Zelikovsky - Chair

Second Advisor

Andrey Perelygin

Third Advisor

Robert Harrison

Fourth Advisor

Anu Bourgeois


The accessibility of high-throughput biology data brought a great deal of attention to disease association studies. High density maps of single nucleotide polymorphism (SNP's) as well as massive genotype data with large number of individuals and number of SNP's become publicly available. By now most analysis of the new data is undertaken by the statistics community. In this dissertation, we pursue a different line of attack on genetic susceptibility to complex disease that adheres to the computer science community with an emphasis on design rather than analytical methodology. The main goal of disease association analysis is to identify gene variations contributing to the risk of and/or susceptibility to a particular disease. There are basically two main steps in susceptibility: (i) haplotyping of the population and (ii) predicting the genetic susceptibility to diseases. Although there exist many phasing methods for step (i), phasing and missing data recovery for data representing family trios is lagging behind, and most disease association studies are based on family trios. This study is devoted to the problem of assessing accumulated information targeting to predict genotype susceptibility to complex diseases with significantly high accuracy and statistical power. The dissertation proposes two new greedy and integer linear programming based solution methods for step (i). We also proposed several universal and ad hoc methods for step (ii). The quality of susceptibility prediction algorithm has been assessed using leave-one-out and leave-many-out tests and shown to be statistically significant based on randomization tests. The prediction of disease status can also be viewed as an integrated risk factor. A combinatorial prediction complexity measure has been proposed for case/control studies. The best prediction rate achieved by the proposed algorithms is 69.5% for Crohn's disease and 61.3% for autoimmune disorder, respectively, which are significantly higher than those achieved by universal prediction methods such as Support Vector Machine (SVM) and known statistic methods.