Project Presentations: CSI5126 ALGORITHMS IN BIOINFORMATICS CBYA605 - Tuesday December 7 - 9:00-11:30 9:05-9:25 Houman Abasian "Implementing and Comparing Algorithms for K-difference Problem" 9:30-9:50 Balasingham Balamohan "Efficient Alignment of three sequences" 9:55-10:15 Erico Neves "Comparison Between MUSCLE and HMMer" 10:20-10:40 Ruoying Gong "Implementation of " 10:45-11:05 Andrew Schoenrock "Predicting Novel Protein Motif Pairs Responsible for Protein-Protein Interaction" 11:10-11:30 Xiaoguang (Benjamin) Wang "Using Multiple Sequence Alignments to Improve Gene Prediction Methods" ============= ABSTRACTS =================================================== "Implementing and Comparing Algorithms for K-difference Problem" Houman Abasian Abstract: K-mismatch S in T is a substring of T that matches S with maximum k errors (just mismatches, excluding spaces). The running time for the naive method to solve this problem is O(nm). But using suffix tree it is possible to solve this problem in O(kn) [1]. We now consider the k-difference problem in which spaces are allowed in addition to mis- matches. This problem can be solved using dynamic programming in O(mn) [2]. The k-difference problem can not be solved using suffix tree because the structure of suffix tree is not designed to handle spaces[3]. The solution for this problem is called "hybrid dynamic programming" in which suffix tree and dynamic programming are combined[3, 2]. In this work we study and implement three different algorithms for k-difference problem and compare them together. The first algorithm is based on dynamic programming. The second and the third algorithms are based on diagonal wise dynamic programming in which the Longest common extension (LCE) should be computed. The second algorithm calculates LCE using naive method O(m) and the third one calculates it using suffix trees in O(1). In this work we simulated the third algorithm to get the LCE in constant time. Then we use two artificial examples to see the output of the algorithms as well as comparing three algorithms in terms of average running time. "Efficient Alignment of three sequences" Balasingham Balamohan Abstract : Sequence alignment is an important problem in bioinformatics. In this talk, I present an efficient algorithm I developed to align three sequences for a restricted class of cost matrices under the sum of pair distance model. "Comparison Between MUSCLE and HMMer" Erico Neves Abstract: Evaluation of tools to execute multiple alignment is a very important issue. Two known classes of multiple alignment algorithms are iterative and probabilistic. One known iterative algorithm is MUSCLE, and one probabilistic algorithm is HMM. In other papers, only MUSCLE was evaluated in relation to other algorithms, but HMM tools still miss comparison tests. The comparison was done using a standard database known as BAliBASE. Results present an advantage in HMM in relation to MUSCLE. "Implementation of " Ruoying Gong Repeated DNA sequences play a crucial role in genome organization and evolution. There are three types of repeated sequences, direct, inverted, and mirror repeats. Finding common repeats in bioinformatics research is an important task. In this paper, author gave an algorithm which uses a single suffix array to handle the process of finding the three types common repeats in a given subset of DNA sequences. My project is to implement this algorithm using C language. The problem is described as this: Given a set of strings U={T1, T2, …, Tn}, find the longest common substrings that appear defined times (D) in a defined subset (K) of U considering direct, inverted and mirror repeats. The framework of this algorithm is as follows: Input: DNA sequences, T1, T2, …, Tn o Step 1: create a new string T’i for each 1 ≤ i ≤ n to consider inverted and mirror repeats, and another string T’’ which is concatenation of all T’i’s. o Step2: construct the suffice array of T’’. o Step3: Find the common substrings and their critical ranges Output: Calculated common substrings within the DNA sequences, T1, T2, …, Tn Constructing the suffix array T’’ without constructing the suffix tree in step 2 uses the Quicksort algorithm which has an average time complex O(nlogn). This algorithm is quite simple to handle three types common repeats and uses only one suffix array which will save a lot of space and running time. "Predicting Novel Protein Motif Pairs Responsible for Protein-Protein Interaction" Andrew Shoenrock Understanding and inhibiting protein-protein interactions play a large role in modern drug design. However, to inhibit protein-protein interactions one must not only know which proteins interact, but where the interaction physically takes place on the proteins themselves. The detection and prediction of these interaction sites are becoming an interest of many in the scientific community. To this end, the goal of this project is to predict novel protein motif pairs responsible for protein-protein interaction using PIPE-Sites, a recent extension of the Protein-protein Interaction Prediction Engine (PIPE), which has not been used to generate any novel predictions to date. The talk will consist of a general overview of the PIPE algorithm itself followed by the idea behind and a description of the PIPE-Sites approach to predicting protein-protein interaction sites. This will lead into a discussion about how this software, in conjunction with tools used to detect known protein motifs such as InterPro, can be used to predict novel protein motif pairs responsible for interaction. "Using Multiple Sequence Alignments to Improve Gene Prediction Methods" Xiaoguang (Benjamin) Wang abstract: Gene Prediction typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. In recent years, Machine learning techniques like support vector machines are widely and successfully used on gene prediction, such as approaches like mSplicer, CONTRAST, or mGene. In our project we are trying to solve two specific problems using machine learning algorithms: 1) Given a sequence of DNA, to recognize the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). 2) Given a sequence of DNA, to recognize it is a promoter instance or non-promoter instance. We found that using Multiple Sequence Alignments (MSAs) method to reorganize the sequences of DNA as a data preprocess may improve the performance of the prediction. We implement the Center Star method which is one the MSAs algorithm and compare it with other well known MSAs algorithms include Clustalw and MUSCLE. We found that our hypothesis is supported by the result of our experiments on both small size sequences and large size sequences.