The storage models of most existing graph database systems view graphs as indivisible structures and hence do not allow a hierarchical layering of the graph. This adversely affects query performance for large graphs as there is no way to filter the graph on a higher level without actually accessing the entire information from the disk. Distributing the storage and processing is one way to extract better performance. But current distributed solutions to this problem are not entirely effective, again due to the indivisible representation of graphs adopted in the storage format. This causes unnecessary latency due to increased inter-processor communication.

In this dissertation, we propose an optimized distributed graph storage system for scalable and faster querying of big graph data. We start with our unique physical storage model, in which the graph is decomposed into three different levels of abstraction, each with a different storage hierarchy. We use a hybrid storage model to store the most critical component and restrict the I/O trips to only when absolutely necessary. This lets us actively make use of multi-level filters while querying, without the need of comprehensive indexes. Our results show that our system outperforms established graph databases for several class of queries. We show that this separation also eases the difficulties in distributing graph data and go on propose a more efficient distributed model for querying general purpose graph data using the Spark framework.

]]>For smart grid, this work provides a complete solution for system management based on novel in-situ data analytics designs. We first propose methodologies for two important tasks of power system monitoring: grid topology change and power-line outage detection. To address the issue of low measurement redundancy in topology identification, particularly in the low-level distribution network, we develop a maximum a posterior based mechanism, which is capable of embedding prior information on the breakers status to enhance the identification accuracy. In power-line outage detection, existing approaches suer from high computational complexity and security issues raised from centralized implementation. Instead, this work presents a distributed data analytics framework, which carries out in-network processing and invokes low computational complexity, requiring only simple matrix-vector multiplications. To complete the system functionality, we also propose a new power grid restoration strategy involving data analytics for topology reconfiguration and resource planning after faults or changes.

In seismic imaging system, we develop several innovative in-situ seismic imaging schemes in which each sensor node computes the tomography based on its partial information and through gossip with local neighbors. The seismic data are generated in a distributed fashion originally. Dierent from the conventional approach involving data collection and then processing in order, our proposed in-situ data computing methodology is much more ecient. The underlying mechanisms avoid the bottleneck problem on bandwidth since all the data are processed distributed in nature and only limited decisional information is communicated. Furthermore, the proposed algorithms can deliver quicker insights than the state-of-arts in seismic imaging. Hence they are more promising solutions for real-time in-situ data analytics, which is highly demanded in disaster monitoring related applications. Through extensive experiments, we demonstrate that the proposed data computing methods are able to achieve near-optimal high quality seismic tomography, retain low communication cost, and provide real-time seismic data analytics.

]]>Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.

In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.

]]>The second part of this thesis presents a set of tools to differentially analyze metabolic pathways from RNA-Seq data. Metabolic pathways are series of chemical reactions occurring within a cell. We focus on two main problems in metabolic pathways differential analysis, namely, differential analysis of their inferred activity level and of their estimated abundance. We validate our approaches through differential expression analysis at the transcripts and genes levels and also through real-time quantitative PCR experiments. In part Four, we present the different packages created or updated in the course of this study. We conclude with our future work plans for further improving IsoDE 2.0.

]]>My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multi-cases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods.

]]>In addition to these parallel algorithms, the other main contributions in this dissertation are 1) multi-core and many-core implementation for clipping a pair of polygons and 2) MPI-GIS and Hadoop Topology Suite for distributed polygon overlay using a cluster of nodes. Nvidia GPU and CUDA are used for the many-core implementation. The MPI based system achieves 44X speedup while processing about 600K polygons in two real-world GIS shapefiles 1) USA Detailed Water Bodies and 2) USA Block Group Boundaries) within 20 seconds on a 32-node (8 cores each) IBM iDataPlex cluster interconnected by InfiniBand technology.

]]>The success of a DES system lies in two factors: the quality of base learners and the optimality of ensemble selection. DES-RE proposed in our work addresses these two challenges respectively. 1) Local expertise enhancement. A novel data sampling and weighting strategy that combines the advantages of bagging and boosting is employed to increase the local expertise of the base learners in order to facilitate the later ensemble selection. 2) Competence region optimization. DES-RE tries to learn a distance metric to form better competence regions (aka neighborhood) that promote strong base learners with respect to a specific query pattern. In addition to perform local expertise enhancement and competence region optimization independently, we proposed an expectation–maximization (EM) framework that combines the two procedures. For all the proposed algorithms, extensive simulations are conducted to validate their performances.

]]>In the first part of the dissertation, we present a path-based ILP model for the VNE problem. Our solution employs a branch-and-bound framework to resolve the integrity constraints, while embedding the column generation process to effectively obtain the lower bound for branch pruning. Different from existing approaches, the proposed solution can either obtain an optimal solution or a near-optimal solution with guarantee on the solution quality.

A common strategy in VNE algorithm design is to decompose the problem into two sequential sub-problems: node assignment (NA) and link mapping (LM). With this approach, it is inexorable to sacrifice the solution quality since the NA is not holistic and not-reversible. In the second part, we are motivated to answer the question: Is it possible to maintain the simplicity of the Divide-and-Conquer strategy while still achieving optimality? Our answer is based on a decomposition framework supported by the Primal-Dual analysis of the path-based ILP model.

This dissertation also attempts to address issues in two frontiers of network virtualization: survivability, and integration of optical substrate. In the third part, we address the survivable network embedding (SNE) problem from a network flow perspective, considering both splittable and non-splittable flows. In addition, the explosive growth of the Internet traffic calls for the support of a bandwidth abundant optical substrate, despite the extra dimensions of complexity caused by the heterogeneities of optical resources, and the physical feature of optical transmission. In this fourth part, we present a holistic view of motivation, architecture, and challenges on the way towards a virtualized optical substrate that supports network virtualization.

]]>Wavelets descriptors have been widely used in multi-resolution image analysis. However, making the wavelets transform shift and rotation invariant produces redundancy and requires complex matching processes. As to other multi-resolution descriptors, they usually depend on other theories or information, such as filtering function, prior-domain knowledge, etc.; that not only increases the computation complexity, but also generates errors.

We propose a novel multi-resolution scheme that is capable of transforming any kind of image descriptor into its multi-resolution structure with high computation accuracy and efficiency. Our multi-resolution scheme is based on sub-sampling an image into an odd-even image tree. Through applying image descriptors to the odd-even image tree, we get the relative multi-resolution image descriptors. Multi-resolution analysis is based on downsampling expansion with maximum energy extraction followed by upsampling reconstruction. Since the maximum energy usually retained in the lowest frequency coefficients; we do maximum energy extraction through keeping the lowest coefficients from each resolution level.

Our multi-resolution scheme can analyze images recursively and effectively without introducing artifacts or changes to the original images, produce multi-resolution representations, obtain higher resolution images only using information from lower resolutions, compress data, filter noise, extract effective image features and be implemented in parallel processing.

]]>