In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations.

]]>In this dissertation, the typical content dissemination scenarios in MSNs are investigated. According to the content properties, the corresponding user requirements are analyzed. First, a Bayesian framework is formulated to model the factors that influence users behavior on streaming video dissemination. An effective dissemination path detection algorithm is derived to detect the reliable and efficient video transmission paths. Second, the authorized content is investigated. We analyze the characteristics of the authorized content, and model the dissemination problem as a new graph problem, namely, Maximum Weighted Connected subgraph with node Quota (MWCQ), and propose two effective algorithms to solve it. Third, the authorized content dissemination problem in Opportunistic Social Networks(OSNs) is studied, based on the prediction of social connection pattern. We then analyze the influence of social connections on the content acquirement, and propose a novel approach, User Set Selection(USS) algorithm, to help social users to achieve fast and accurate content acquirement through social connections.

]]>The storage models of most existing graph database systems view graphs as indivisible structures and hence do not allow a hierarchical layering of the graph. This adversely affects query performance for large graphs as there is no way to filter the graph on a higher level without actually accessing the entire information from the disk. Distributing the storage and processing is one way to extract better performance. But current distributed solutions to this problem are not entirely effective, again due to the indivisible representation of graphs adopted in the storage format. This causes unnecessary latency due to increased inter-processor communication.

In this dissertation, we propose an optimized distributed graph storage system for scalable and faster querying of big graph data. We start with our unique physical storage model, in which the graph is decomposed into three different levels of abstraction, each with a different storage hierarchy. We use a hybrid storage model to store the most critical component and restrict the I/O trips to only when absolutely necessary. This lets us actively make use of multi-level filters while querying, without the need of comprehensive indexes. Our results show that our system outperforms established graph databases for several class of queries. We show that this separation also eases the difficulties in distributing graph data and go on propose a more efficient distributed model for querying general purpose graph data using the Spark framework.

]]>For smart grid, this work provides a complete solution for system management based on novel in-situ data analytics designs. We first propose methodologies for two important tasks of power system monitoring: grid topology change and power-line outage detection. To address the issue of low measurement redundancy in topology identification, particularly in the low-level distribution network, we develop a maximum a posterior based mechanism, which is capable of embedding prior information on the breakers status to enhance the identification accuracy. In power-line outage detection, existing approaches suer from high computational complexity and security issues raised from centralized implementation. Instead, this work presents a distributed data analytics framework, which carries out in-network processing and invokes low computational complexity, requiring only simple matrix-vector multiplications. To complete the system functionality, we also propose a new power grid restoration strategy involving data analytics for topology reconfiguration and resource planning after faults or changes.

In seismic imaging system, we develop several innovative in-situ seismic imaging schemes in which each sensor node computes the tomography based on its partial information and through gossip with local neighbors. The seismic data are generated in a distributed fashion originally. Dierent from the conventional approach involving data collection and then processing in order, our proposed in-situ data computing methodology is much more ecient. The underlying mechanisms avoid the bottleneck problem on bandwidth since all the data are processed distributed in nature and only limited decisional information is communicated. Furthermore, the proposed algorithms can deliver quicker insights than the state-of-arts in seismic imaging. Hence they are more promising solutions for real-time in-situ data analytics, which is highly demanded in disaster monitoring related applications. Through extensive experiments, we demonstrate that the proposed data computing methods are able to achieve near-optimal high quality seismic tomography, retain low communication cost, and provide real-time seismic data analytics.

]]>Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.

In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.

]]>The second part of this thesis presents a set of tools to differentially analyze metabolic pathways from RNA-Seq data. Metabolic pathways are series of chemical reactions occurring within a cell. We focus on two main problems in metabolic pathways differential analysis, namely, differential analysis of their inferred activity level and of their estimated abundance. We validate our approaches through differential expression analysis at the transcripts and genes levels and also through real-time quantitative PCR experiments. In part Four, we present the different packages created or updated in the course of this study. We conclude with our future work plans for further improving IsoDE 2.0.

]]>My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multi-cases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods.

]]>