For smart grid, this work provides a complete solution for system management based on novel in-situ data analytics designs. We first propose methodologies for two important tasks of power system monitoring: grid topology change and power-line outage detection. To address the issue of low measurement redundancy in topology identification, particularly in the low-level distribution network, we develop a maximum a posterior based mechanism, which is capable of embedding prior information on the breakers status to enhance the identification accuracy. In power-line outage detection, existing approaches suer from high computational complexity and security issues raised from centralized implementation. Instead, this work presents a distributed data analytics framework, which carries out in-network processing and invokes low computational complexity, requiring only simple matrix-vector multiplications. To complete the system functionality, we also propose a new power grid restoration strategy involving data analytics for topology reconfiguration and resource planning after faults or changes.

In seismic imaging system, we develop several innovative in-situ seismic imaging schemes in which each sensor node computes the tomography based on its partial information and through gossip with local neighbors. The seismic data are generated in a distributed fashion originally. Dierent from the conventional approach involving data collection and then processing in order, our proposed in-situ data computing methodology is much more ecient. The underlying mechanisms avoid the bottleneck problem on bandwidth since all the data are processed distributed in nature and only limited decisional information is communicated. Furthermore, the proposed algorithms can deliver quicker insights than the state-of-arts in seismic imaging. Hence they are more promising solutions for real-time in-situ data analytics, which is highly demanded in disaster monitoring related applications. Through extensive experiments, we demonstrate that the proposed data computing methods are able to achieve near-optimal high quality seismic tomography, retain low communication cost, and provide real-time seismic data analytics.

]]>Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.

In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.

]]>The second part of this thesis presents a set of tools to differentially analyze metabolic pathways from RNA-Seq data. Metabolic pathways are series of chemical reactions occurring within a cell. We focus on two main problems in metabolic pathways differential analysis, namely, differential analysis of their inferred activity level and of their estimated abundance. We validate our approaches through differential expression analysis at the transcripts and genes levels and also through real-time quantitative PCR experiments. In part Four, we present the different packages created or updated in the course of this study. We conclude with our future work plans for further improving IsoDE 2.0.

]]>My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multi-cases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods.

]]>In addition to these parallel algorithms, the other main contributions in this dissertation are 1) multi-core and many-core implementation for clipping a pair of polygons and 2) MPI-GIS and Hadoop Topology Suite for distributed polygon overlay using a cluster of nodes. Nvidia GPU and CUDA are used for the many-core implementation. The MPI based system achieves 44X speedup while processing about 600K polygons in two real-world GIS shapefiles 1) USA Detailed Water Bodies and 2) USA Block Group Boundaries) within 20 seconds on a 32-node (8 cores each) IBM iDataPlex cluster interconnected by InfiniBand technology.

]]>The success of a DES system lies in two factors: the quality of base learners and the optimality of ensemble selection. DES-RE proposed in our work addresses these two challenges respectively. 1) Local expertise enhancement. A novel data sampling and weighting strategy that combines the advantages of bagging and boosting is employed to increase the local expertise of the base learners in order to facilitate the later ensemble selection. 2) Competence region optimization. DES-RE tries to learn a distance metric to form better competence regions (aka neighborhood) that promote strong base learners with respect to a specific query pattern. In addition to perform local expertise enhancement and competence region optimization independently, we proposed an expectation–maximization (EM) framework that combines the two procedures. For all the proposed algorithms, extensive simulations are conducted to validate their performances.

]]>In the first part of the dissertation, we present a path-based ILP model for the VNE problem. Our solution employs a branch-and-bound framework to resolve the integrity constraints, while embedding the column generation process to effectively obtain the lower bound for branch pruning. Different from existing approaches, the proposed solution can either obtain an optimal solution or a near-optimal solution with guarantee on the solution quality.

A common strategy in VNE algorithm design is to decompose the problem into two sequential sub-problems: node assignment (NA) and link mapping (LM). With this approach, it is inexorable to sacrifice the solution quality since the NA is not holistic and not-reversible. In the second part, we are motivated to answer the question: Is it possible to maintain the simplicity of the Divide-and-Conquer strategy while still achieving optimality? Our answer is based on a decomposition framework supported by the Primal-Dual analysis of the path-based ILP model.

This dissertation also attempts to address issues in two frontiers of network virtualization: survivability, and integration of optical substrate. In the third part, we address the survivable network embedding (SNE) problem from a network flow perspective, considering both splittable and non-splittable flows. In addition, the explosive growth of the Internet traffic calls for the support of a bandwidth abundant optical substrate, despite the extra dimensions of complexity caused by the heterogeneities of optical resources, and the physical feature of optical transmission. In this fourth part, we present a holistic view of motivation, architecture, and challenges on the way towards a virtualized optical substrate that supports network virtualization.

]]>Wavelets descriptors have been widely used in multi-resolution image analysis. However, making the wavelets transform shift and rotation invariant produces redundancy and requires complex matching processes. As to other multi-resolution descriptors, they usually depend on other theories or information, such as filtering function, prior-domain knowledge, etc.; that not only increases the computation complexity, but also generates errors.

We propose a novel multi-resolution scheme that is capable of transforming any kind of image descriptor into its multi-resolution structure with high computation accuracy and efficiency. Our multi-resolution scheme is based on sub-sampling an image into an odd-even image tree. Through applying image descriptors to the odd-even image tree, we get the relative multi-resolution image descriptors. Multi-resolution analysis is based on downsampling expansion with maximum energy extraction followed by upsampling reconstruction. Since the maximum energy usually retained in the lowest frequency coefficients; we do maximum energy extraction through keeping the lowest coefficients from each resolution level.

Our multi-resolution scheme can analyze images recursively and effectively without introducing artifacts or changes to the original images, produce multi-resolution representations, obtain higher resolution images only using information from lower resolutions, compress data, filter noise, extract effective image features and be implemented in parallel processing.

]]>In order to accurately predict the HIV drug resistance, two main tasks need to be solved: how to encode the protein structure, extracting the more useful information and feeding it into the machine learning tools; and which kinds of machine learning tools to choose. In our research, we first proposed a new protein encoding algorithm, which could convert various sizes of proteins into a fixed size vector. This algorithm enables feeding the protein structure information to most state of the art machine learning algorithms. In the next step, we also proposed a new classification algorithm based on sparse representation. Following that, mean shift and quantile regression were included to help extract the feature information from the data. Our results show that encoding protein structure using our newly proposed method is very efficient, and has consistently higher accuracy regardless of type of machine learning tools. Furthermore, our new classification algorithm based on sparse representation is the first application of sparse representation performed on biological data, and the result is comparable to other state of the art classification algorithms, for example ANN, SVM and multiple regression. Following that, the mean shift and quantile regression provided us with the potentially most important drug resistant mutants, and such results might help biologists/chemists to determine which mutants are the most representative candidates for further research.

]]>Our contributions include (1) transcript and gene expression level estimation methods, (2) methods for genome-guided and annotation-guided transcriptome reconstruction, and (3) *de novo* assembly and annotation of real data sets. Transcript expression level estimation, also referred to as transcriptome quantification, tackle the problem of estimating the expression level of each transcript. Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. We propose a novel simulated regression based method for transcriptome frequency estimation from RNA-Seq reads. Transcriptome reconstruction refers to the problem of reconstructing the transcript sequences from the RNA-Seq data. We present genome-guided and annotation-guided transcriptome reconstruction methods. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to currently state of the art methods. We further present the assembly and annotation of *Bugula neritina* transcriptome (a marine colonial animal), and *Tallapoosa* darter genome (a species-rich radiation freshwater fish).

This work develops algorithms and software to perform data assimilation for dynamic data driven simulation through non-parametric statistic inference based on sequential Monte Carlo (SMC) methods (also called particle filters). A bootstrap particle filter based data assimilation framework is firstly developed, where the proposal distribution is constructed from simulation models and statistical cores of noises. The bootstrap particle filter-based framework is relatively easy to implement. However, it is ineffective when the uncertainty of simulation models is much larger than the observation model (i.e. peaked likelihood) or when rare events happen. To improve the effectiveness of data assimilation, a new data assimilation framework, named as the SenSim framework, is then proposed, which has a more advanced proposal distribution that uses knowledge from both simulation models and sensor readings. Both the bootstrap particle filter-based framework and the SenSim framework are applied and evaluated in two case studies: wildfire spread simulation, and lane-based traffic simulation. Experimental results demonstrate the effectiveness of the proposed data assimilation methods. A software package is also created to encapsulate the different components of SMC methods for supporting data assimilation of general simulation models.

]]>