Computer Science Faculty Publications

Permanent URI for this collection

Browse

Recent Submissions

  • Item
    Parallel Progressive Multiple Sequence Alignment on Reconfigurable Meshes
    (2011-01-01) Nguyen, Ken; Pan, Yi; Nong, Ge; Georgia State University

    Background: One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand.

    Results: We design O(1) run-time solutions for both local and global dynamic programming pair-wise alignment algorithms on reconfigurable mesh computing model. To align m sequences with max length n, we combining the parallel pair-wise dynamic programming solutions with newly designed parallel components. We successfully reduce the progressive multiple sequence alignment algorithm’s run-time complexity from O(m × n4) to O(m) using O(m × n3) processing units for scoring schemes that use three distinct values for match/mismatch/gapextension. The general solution to multiple sequence alignment algorithm takes O(m × n4) processing units and completes in O(m) time.

    Conclusions: To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with O(m) run-time. We also provide a new parallel algorithm for the Longest Common Subsequence (LCS) with O(1) run-time using O(n3) processing units. This is a big improvement over the current best constant-time algorithm that uses O(n4) processing units.

  • Item
    Biological Network Motif Detection and Evaluation
    (2011-01-01) Kim, Wooyoung; Li, Min; Wang, Jianxin; Pan, Yi; Georgia State University

    Background: Molecular level of biological data can be constructed into system level of data as biological networks. Network motifs are defined as over- represented small connected subgraphs in networks and they have been used for many biological applications. Since network motif discovery involves computationally challenging processes, previous algorithms have focused on computational efficiency. However, we believe that the biological quality of network motifs is also very important.

    Results: We define biological network motifs as biologically significant subgraphs and traditional network motifs are differentiated as structural network motifs in this paper. We develop five algorithms, namely, EDGEGO-BNM, EDGEBETWEENNESS-BNM, NMF-BNM, NMFGO-BNM and VOLTAGE-BNM, for efficient detection of biological network motifs, and introduce several evaluation measures including motifs included in complex, motifs included in functional module and GO term clustering score in this paper. Experimental results show that EDGEGO-BNM and EDGEBETWEENNESS-BNM perform better than existing algorithms and all of our algorithms are applicable to find structural network motifs as well.

    Conclusion: We provide new approaches to finding network motifs in biological networks. Our algorithms efficiently detect biological network motifs and further improve existing algorithms to find high quality structural network motifs, which would be impossible using existing algorithms. The performances of the algorithms are compared based on our new evaluation measures in biological contexts. We believe that our work gives some guidelines of network motifs research for the biological networks.

  • Item
    A Comparison of the Functional Modules Identified from Time Course and Static PPI Network Data
    (2011-01-01) Tang, Xiwei; Wang, Jianxin; Liu, Binbin; Li, Min; Chen, Gang; Pan, Yi; Georgia State University

    Background: Cellular systems are highly dynamic and responsive to cues from the environment. Cellular function and response patterns to external stimuli are regulated by biological networks. A protein-protein interaction (PPI) network with static connectivity is dynamic in the sense that the nodes implement so-called functional activities that evolve in time. The shift from static to dynamic network analysis is essential for further understanding of molecular systems.

    Results: In this paper, Time Course Protein Interaction Networks (TC- PINs) are reconstructed by incorporating time series gene expression into PPI networks. Then, a clustering algorithm is used to create functional modules from three kinds of networks: the TC-PINs, a static PPI network and a pseudorandom network. For the functional modules from the TC-PINs, repetitive modules and modules contained within bigger modules are removed. Finally, matching and GO enrichment analyses are performed to compare the functional modules detected from those networks.

    Conclusions: The comparative analyses show that the functional modules from the TC-PINs have much more significant biological meaning than those from static PPI networks. Moreover, it implies that many studies on static PPI networks can be done on the TC-PINs and accordingly, the experimental results are much more satisfactory. The 36 PPI networks corresponding to 36 time points, identified as part of this study, and other materials are available at http://bioinfo.csu.edu.cn/txw/TC-PINs.

  • Item
    A New Essential Protein Discovery Method Based on the Integration of Protein-protein Interaction and Gene Expression Data
    (2012-01-01) Li, Min; Zhang, Hanhui; Wang, Jian-xin; Pan, Yi; Georgia State University

    The article offers information on a study conducted on the essential protein discovery method, PeC, which is based on the integration of protein-protein interaction and gene expression data. It states that PeC was developed on the basis of the definitions of edge clustering coefficient (ECC) and Pearson's correlation coefficient (PCC). It mentions that a list of essential proteins of Saccharomyces cerevisiae were collected.

    Background: Identification of essential proteins is always a challenging task since it requires experimental approaches that are time-consuming and laborious. With the advances in high throughput technologies, a large number of protein-protein interactions are available, which have produced unprecedented opportunities for detecting proteins’ essentialities from the network level. There have been a series of computational approaches proposed for predicting essential proteins based on network topologies. However, the network topology-based centrality measures are very sensitive to the robustness of network. Therefore, a new robust essential protein discovery method would be of great value.

    Results: In this paper, we propose a new centrality measure, named PeC, based on the integration of protein-protein interaction and gene expression data. The performance of PeC is validated based on the protein-protein interaction network of Saccharomyces cerevisiae. The experimental results show that the predicted precision of PeC clearly exceeds that of the other fifteen previously proposed centrality measures: Degree Centrality (DC), Betweenness Centrality (BC), Closeness Centrality (CC), Subgraph Centrality (SC), Eigenvector Centrality (EC), Information Centrality (IC), Bottle Neck (BN), Density of Maximum Neighborhood Component (DMNC), Local Average Connectivity-based method (LAC), Sum of ECC (SoECC), Range-Limited Centrality (RL), L-index (LI), Leader Rank (LR), Normalized a -Centrality (NC), and Moduland-Centrality (MC). Especially, the improvement of PeC over the classic centrality measures (BC, CC, SC, EC, and BN) is more than 50% when predicting no more than 500 proteins.

    Conclusions: We demonstrate that the integration of protein-protein interaction network and gene expression data can help improve the precision of predicting essential proteins. The new centrality measure, PeC, is an effective essential protein discovery method.

  • Item
    SLC30A3 (ZnT3) Oligomerization by Dityrosine Bonds Regulates Its Subcellular Localization and Metal Transport Capacity
    (2009-01-01) Salazar, Gloria; Falcon-Perez, Juan M.; Harrison, Robert W.; Faundez, Victor; Georgia State University

    Non-covalent and covalent homo-oligomerization of membrane proteins regulates their subcellular localization and function. Here, we described a novel oligomerization mechanism affecting solute carrier family 30 member 3/zinc transporter 3 (SLC30A3/ZnT3). Oligomerization was mediated by intermolecular covalent dityrosine bonds. Using mutagenized ZnT3 expressed in PC12 cells, we identified two critical tyrosine residues necessary for dityrosine-mediated ZnT3 oligomerization. ZnT3 carrying the Y372F mutation prevented ZnT3 oligomerization, decreased ZnT3 targeting to synaptic-like microvesicles (SLMVs), and decreased resistance to zinc toxicity. Strikingly, ZnT3 harboring the Y357F mutation behaved as a ‘‘gain-of-function’’ mutant as it displayed increased ZnT3 oligomerization, targeting to SLMVs, and increased resistance to zinc toxicity. Single and double tyrosine ZnT3 mutants indicate that the predominant dimeric species is formed between tyrosine 357 and 372. ZnT3 tyrosine dimerization was detected under normal conditions and it was enhanced by oxidative stress. Covalent species were also detected in other SLC30A zinc transporters localized in different subcellular compartments. These results indicate that covalent tyrosine dimerization of a SLC30A family member modulates its subcellular localization and zinc transport capacity. We propose that dityrosine-dependent membrane protein oligomerization may regulate the function of diverse membrane protein in normal and disease states.

  • Item
    Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks
    (2012-01-01) Peng, Wei; Wang, Jianxin; Wang, Weiping; Liu, Qing; Wu, Fang-Xiang; Pan, Yi; Georgia State University

    Background: Identification of essential proteins plays a significant role in understanding minimal requirements for the cellular survival and development. Many computational methods have been proposed for predicting essential proteins by using the topological features of protein-protein interaction (PPI) networks. However, most of these methods ignored intrinsic biological meaning of proteins. Moreover, PPI data contains many false positives and false negatives. To overcome these limitations, recently many research groups have started to focus on identification of essential proteins by integrating PPI networks with other biological information. However, none of their methods has widely been acknowledged.

    Results: By considering the facts that essential proteins are more evolutionarily conserved than nonessential proteins and essential proteins frequently bind each other, we propose an iteration method for predicting essential proteins by integrating the orthology with PPI networks, named by ION. Differently from other methods, ION identifies essential proteins depending on not only the connections between proteins but also their orthologous properties and features of their neighbors. ION is implemented to predict essential proteins in S. cerevisiae. Experimental results show that ION can achieve higher identification accuracy than eight other existing centrality methods in terms of area under the curve (AUC). Moreover, ION identifies a large amount of essential proteins which have been ignored by eight other existing centrality methods because of their low-connectivity. Many proteins ranked in top 100 by ION are both essential and belong to the complexes with certain biological functions. Furthermore, no matter how many reference organisms were selected, ION outperforms all eight other existing centrality methods. While using as many as possible reference organisms can improve the performance of ION. Additionally, ION also shows good prediction performance in E. coli K-12.

    Conclusions: The accuracy of predicting essential proteins can be improved by integrating the orthology with PPI networks.

  • Item
    Cloud Computing for Detecting High-Order Genome-Wide Epistatic Interaction via Dynamic Clustering
    (2014-04-01) Guo, Xuan; Meng, Yu; Yu, Ning; Pan, Yi; Georgia State University; Georgia State University; Georgia State University

    Backgroud: Taking the advan tage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. Results: In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and BOOST. Furthermore, we have applied our method on two real GWAS datasets, Age-related macular degeneration (AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors which are not shown up in detections of 2-loci epistatic interactions. Conclusions: Experimental results on simulated data demonstrate that our method is more powerful than some recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS.

  • Item
    Towards the Identification of Protein Complexes and Functional Modules by Integrating PPI Network and Gene Expression Data
    (2012-01-01) Li, Min; Wu, Xuehong; Wang, Jianxin; Pan, Yi; Georgia State University

    Background: Identification of protein complexes and functional modules from protein-protein interaction (PPI) networks is crucial to understanding the principles of cellular organization and predicting protein functions. In the past few years, many computational methods have been proposed. However, most of them considered the PPI networks as static graphs and overlooked the dynamics inherent within these networks. Moreover, few of them can distinguish between protein complexes and functional modules.

    Results: In this paper, a new framework is proposed to distinguish between protein complexes and functional modules by integrating gene expression data into protein-protein interaction (PPI) data. A series of time-sequenced subnetworks (TSNs) is constructed according to the time that the interactions were activated. The algorithm TSN-PCD was then developed to identify protein complexes from these TSNs. As protein complexes are significantly related to functional modules, a new algorithm DFM-CIN is proposed to discover functional modules based on the identified complexes. The experimental results show that the combination of temporal gene expression data with PPI data contributes to identifying protein complexes more precisely. A quantitative comparison based on f-measure reveals that our algorithm TSN-PCD outperforms the other previous protein complex discovery algorithms. Furthermore, we evaluate the identified functional modules by using “Biological Process” annotated in GO (Gene Ontology). The validation shows that the identified functional modules are statistically significant in terms of “Biological Process”. More importantly, the relationship between protein complexes and functional modules are studied.

    Conclusions: The proposed framework based on the integration of PPI data and gene expression data makes it possible to identify protein complexes and functional modules more effectively. Moveover, the proposed new framework and algorithms can distinguish between protein complexes and functional modules. Our findings suggest that functional modules are closely related to protein complexes and a functional module may consist of one or multiple protein complexes. The program is available at http://netlab.csu.edu.cn/bioinfomatics/limin/DFM-CIN/index.

  • Item
    Accurate Viral Population Assembly From Ultra-Deep Sequencing Data
    (2014-06-01) Mangul, Serghei; Wu, Nicholas C.; Mancuso, Nicholas; Zelikovskiy, Alexander; Sun, Ren; Eskin, Eleazar; University of California, Los Angeles; University of California, Los Angeles; Georgia State University; Georgia State University; University of California, Los Angeles; University of California, Los Angeles

    Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads.

  • Item
    Distributed Power-Line Outage Detection Based on Wide Area Measurement System
    (2014-07-01) Zhao, Liang; Song, Wen-Zhan; Georgia State University; Georgia State University

    In modern power grids, the fast and reliable detection of power-line outages is an important functionality, which prevents cascading failures and facilitates an accurate state estimation to monitor the real-time conditions of the grids. However, most of the existing approaches for outage detection suffer from two drawbacks, namely: (i) high computational complexity; and (ii) relying on a centralized means of implementation. The high computational complexity limits the practical usage of outage detection only for the case of single-line or double-line outages. Meanwhile, the centralized means of implementation raises security and privacy issues. Considering these drawbacks, the present paper proposes a distributed framework, which carries out in-network information processing and only shares estimates on boundaries with the neighboring control areas. This novel framework relies on a convex-relaxed formulation of the line outage detection problem and leverages the alternating direction method of multipliers (ADMM) for its distributed solution. The proposed framework invokes a low computational complexity, requiring only linear and simple matrix-vector operations. We also extend this framework to incorporate the sparse property of the measurement matrix and employ the LSQRalgorithm to enable a warm start, which further accelerates the algorithm. Analysis and simulation tests validate the correctness and effectiveness of the proposed approaches.

  • Item
    Rechecking the Centrality-Lethality Rule in the Scope of Protein Subcellular Localization Interaction Networks
    (2015-06-01) Peng, Xiaoqing; Wang, Jianxin; Wang, Jun; Wu, FangXiang; Pan, Yi; Central South University; Central South University; Baylor College of Medicine; University of Saskatchewan; Georgia State University

    Essential proteins are indispensable for living organisms to maintain life activities and play important roles in the studies of pathology, synthetic biology, and drug design. Therefore, besides experiment methods, many computational methods are proposed to identify essential proteins. Based on the centrality-lethality rule, various centrality methods are employed to predict essential proteins in a Protein-protein Interaction Network (PIN). However, neglecting the temporal and spatial features of protein-protein interactions, the centrality scores calculated by centrality methods are not effective enough for measuring the essentiality of proteins in a PIN. Moreover, many methods, which overfit with the features of essential proteins for one species, may perform poor for other species. In this paper, we demonstrate that the centrality-lethality rule also exists in Protein Subcellular Localization Interaction Networks (PSLINs). To do this, a method based on Localization Specificity for Essential protein Detection (LSED), was proposed, which can be combined with any centrality method for calculating the improved centrality scores by taking into consideration PSLINs in which proteins play their roles. In this study, LSED was combined with eight centrality methods separately to calculate Localization-specific Centrality Scores (LCSs) for proteins based on the PSLINs of four species (Saccharomyces cerevisiae, Homo sapiens, Mus musculus and Drosophila melanogaster). Compared to the proteins with high centrality scores measured from the global PINs, more proteins with high LCSs measured from PSLINs are essential. It indicates that proteins with high LCSs measured from PSLINs are more likely to be essential and the performance of centrality methods can be improved by LSED. Furthermore, LSED provides a wide applicable prediction model to identify essential proteins for different species.

  • Item
    A Novel Algorithm for Detecting Protein Complexes with the Breadth First Search
    (2014-01-01) Tang, Xiwei; Wang, Jianxin; Li, Min; He, Yiming; Pan, Yi; Central South University; Central South University; Central South University; Central South University; Georgia State University

    Most biological processes are carried out by protein complexes. A substantial number of false positives of the protein-protein interaction (PPI) data can compromise the utility of the datasets for complexes reconstruction. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised. The methods encode the reliabilities (confidence) of physical interactions between pairs of proteins. The challenge now is to identify novel and meaningful protein complexes fromthe weighted PPI network. To address this problem, a novel protein complex mining algorithm ClusterBFS (Cluster with Breadth-First Search) is proposed. Based on the weighted density, ClusterBFS detects protein complexes of the weighted network by the breadth first search algorithm, which originates from a given seed protein used as starting-point. The experimental results show that ClusterBFS performs significantly better than the other computational approaches in terms of the identification of protein complexes.

  • Item
    TRIP: A method for novel transcript reconstruction from paired-end RNA-seq reads
    (2012-01-01) Mangul, Serghei; Caciula, Adrian; Brinza, Dumitru; Măndoiu, Ion I; Zelikovskiy, Alexander; Georgia State University

    Preliminary experimental results on synthetic datasets generated with various sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy compared to previous methods that ignore fragment length distribution information.

  • Item
    Identifying Dynamic Protein Complexes Based on Gene Expression Profiles and PPI Networks
    (2014-01-01) Li, Min; Chen, Weijie; Wang, Jianxin; Wu, Fang-Xiang; Pan, Yi; Central South University; Central South University; Central South University; University of Saskatchewan; Georgia State University

    Identification of protein complexes fromprotein-protein interaction networks has become a key problem for understanding cellular life in postgenomic era. Many computational methods have been proposed for identifying protein complexes. Up to now, the existing computational methods are mostly applied on static PPI networks. However, proteins and their interactions are dynamic in reality. Identifying dynamic protein complexes is more meaningful and challenging. In this paper, a novel algorithm, named DPC, is proposed to identify dynamic protein complexes by integrating PPI data and gene expression profiles. According to Core-Attachment assumption, these proteins which are always active in the molecular cycle are regarded as core proteins. The protein-complex cores are identified from these always active proteins by detecting dense subgraphs. Final protein complexes are extended from the protein-complex cores by adding attachments based on a topological character of “closeness” and dynamic meaning. The protein complexes produced by our algorithm DPC contain two parts: static core expressed in all the molecular cycle and dynamic attachments short-lived.The proposed algorithm DPC was applied on the data of Saccharomyces cerevisiae and the experimental results show that DPC outperforms CMC, MCL, SPICi, HC-PIN, COACH, and Core-Attachment based on the validation of matching with known complexes and hF-measures.

  • Item
    Estimation of Alternative Splicing Isoform Frequencies from RNA-Seq Data
    (2011-01-01) Nicolae, Marius; Mangul, Serghei; Măndoiu, Ion I; Zelikovskiy, Alexander; Georgia State University

    Background: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.

    Results: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and genespecific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.

    Conclusions: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

  • Item
    Inferring viral quasispecies spectra from 454 pyrosequencing reads
    (2011-01-01) Astrovskaya, Irina; Tork, Bassam; Mangul, Serghei; Westbrooks, Kelly; Măndoiu, Ion; Balfe, Peter; Zelikovskiy, Alexander; Georgia State University

    Background: RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences.

    Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html.

    Conclusions: ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.

  • Item
    Guest Editors' Introduction
    (2012-01-01) Chen, Jianer; Măndoiu, Ion; Sunderraman, Raj; Wang, Jianxin; Zelikovskiy, Alexander; Georgia State University; Georgia State University

    This Supplement includes a selection of papers presented at the 7th International Symposium on Bioinformatics Research and Application (ISBRA), which was held on May 27-29, 2011 at Central South University in Changsha, China. The technical program of the symposium included 36 extended abstracts presented orally and published in volume 6674 of Springer Verlag’s Lecture Notes in Bioinformatics series. Additionally, the program included 38 short abstracts presented either orally or as posters. Authors of both extended and short abstracts presented at the symposium were invited to submit full versions of their work to this Supplement. Following a rigorous review process, 19 of the 40 full papers submitted were selected for publication.

    Selected papers cover a broad range of bioinformatics topics, ranging from algorithms for structural biology to phylogenetics and biological networks.

  • Item
    The Effect of Wavelet Families on Watermarking
    (2009-01-01) Brannock, Evelyn; Weeks, Michael; Harrison, Robert W.; Georgia State University; Georgia State University

    With the advance of technologies such as the Internet, Wi-Fi Internet availability and mobile access, it is becoming harder than ever to safeguard intellectual property in a digital form. Digital watermarking is a steganographic technique that is used to protect creative content. Copyrighted work can be accessed from many different computing platforms; the same image can exist on a handheld personal digital assistant, as well as a laptop and desktop server computer. For those who want to pirate, it is simple to copy, modify and redistribute digital media. Because this impacts business profits adversely, this is a highly researched field in recent years. This paper examines a technique for digital watermarking which utilizes properties of the Discrete Wavelet Transform (DWT). The digital watermarking algorithm is explained. This algorithm uses a database of 40 images that are of different types. These images, including greyscale, black and white, and color, were chosen for their diverse characteristics. Eight families of wavelets, both orthogonal and biorthogonal, are compared for their effectiveness. Three distinct watermarks are tested. Since compressing an image is a common occurrence, the images are compacted to determine the significance of such an action. Different types of noise are also added. The PSNR for each image and each wavelet family is used to measure the efficacy of the algorithm. This objective measure is also used to determine the influence of the mother wavelet. The paper asks the question: “Is the wavelet family chosen to implement the algorithm of consequence?” In summary, the results support the concept that the simpler wavelet transforms, e.g. the Haar wavelet, consistently outperform the more complex ones when using a non-colored watermark.

  • Item
    Efficient error correction for next-generation sequencing of viral amplicons
    (2012-01-01) Skums, Pavel; Dimitrova, Zoya; Campo, David S; Vaughan, Gilberto; Rossi, Livia; Forbi, Joseph C; Yokosawa, Jonny; Zelikovskiy, Alexander; Khudyakov, Yury; Georgia State University

    Background: Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

    Results: In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

    Conclusions: Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses. The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/ NGS/?q=content/pyrosequencing-error-correction-algorithm

  • Item
    Multicasting in Multihop Optical WDM Networks with Limited Wavelength Conversion
    (2003-01-01) Shen, Hong; Pan, Yi; Sum, John; Horiguchi, Susumu; Georgia State University

    This paper provides an overview on efficient algorithms for multicasting in optical networks supported by Wavelength Division Multiplexing (WDM) with limited wavelength conversion. We classify the multicast problems according to off-line and on-line in both reliable and unreliable networks. In each problem class, we present efficient algorithms for multicast and multiple multicast and show their performance. We also present efficient schemes for dynamic multicast group membership updating. We conclude the paper by showing possible extension of the presented algorithms for QoS provision.