Project Presentations: CSI5126 ALGORITHMS IN BIOINFORMATICS
CBYA605 - Tuesday December 7 - 9:00-11:30

9:05-9:25 Houman Abasian "Implementing and Comparing Algorithms for K-difference Problem"

9:30-9:50 Balasingham Balamohan "Efficient Alignment of three sequences"

9:55-10:15 Erico Neves "Comparison Between MUSCLE and HMMer"

10:20-10:40 Ruoying Gong "Implementation of <A Simple Algorithm for Finding Exact Common Repeat>"

10:45-11:05 Andrew Schoenrock "Predicting Novel Protein Motif Pairs Responsible for Protein-Protein Interaction"

11:10-11:30 Xiaoguang (Benjamin) Wang 
"Using Multiple Sequence Alignments to Improve Gene Prediction Methods"

============= ABSTRACTS ===================================================

"Implementing and Comparing Algorithms for K-difference Problem"
Houman Abasian 
Abstract:
K-mismatch S in T is a substring of T that matches S with maximum k errors
(just mismatches, excluding spaces). The running time for the naive method
to solve this problem is O(nm). But using suffix tree it is possible to
solve this problem in O(kn) [1]. We now consider the k-difference problem
in which spaces are allowed in addition to mis- matches. This problem can
be solved using dynamic programming in O(mn) [2]. The k-difference problem
can not be solved using suffix tree because the structure of suffix tree
is not designed to handle spaces[3]. The solution for this problem is
called "hybrid dynamic programming" in which suffix tree and dynamic
programming are combined[3, 2]. In this work we study and implement three
different algorithms for k-difference problem and compare them together.
The first algorithm is based on dynamic programming. The second and the
third algorithms are based on diagonal wise dynamic programming in which
the Longest common extension (LCE) should be computed. The second
algorithm calculates LCE using naive method O(m) and the third one
calculates it using suffix trees in O(1). In this work we simulated the
third algorithm to get the LCE in constant time. Then we use two
artificial examples to see the output of the algorithms as well as
comparing three algorithms in terms of average running time.

"Efficient Alignment of three sequences"
Balasingham Balamohan
Abstract : Sequence alignment is an important problem in bioinformatics.
In this talk, I present an efficient algorithm I developed
 to align three sequences for a restricted class of cost matrices
 under the sum of pair distance model. 

"Comparison Between MUSCLE and HMMer"
Erico Neves
Abstract:
Evaluation of tools to execute multiple alignment is a very important
issue. Two known classes of multiple alignment algorithms are iterative 
and probabilistic. One known iterative algorithm is MUSCLE, and one
probabilistic algorithm is HMM. In other papers, only MUSCLE was evaluated
in relation to other algorithms, but HMM tools still miss comparison tests. 
The comparison was done using a standard database known as BAliBASE.
Results present an advantage in HMM in relation to MUSCLE.

"Implementation of <A Simple Algorithm for Finding Exact Common Repeat>"Ruoying GongRepeated DNA sequences play a crucial role in genome organization and evolution. There are three types of repeated sequences, direct, inverted, and mirror repeats. Finding common repeats in bioinformatics research is an important task. In this paper, author gave an algorithm which uses a single suffix array to handle the process of finding the three types common repeats in a given subset of DNA sequences.My project is to implement this algorithm using C language. The problem is described as this: Given a set of strings U={T1, T2, …, Tn}, find the longest common substrings that appear defined times (D) in a defined subset (K) of  U considering direct, inverted and mirror repeats. The framework of this algorithm is as follows:Input: DNA sequences, T1, T2, …, Tno	Step 1: create a new string T’i for each 1 ≤ i ≤ n to consider inverted and mirror repeats, and another string T’’ which is concatenation of all T’i’s.o	Step2: construct the suffice array of T’’. o	Step3: Find the common substrings and their critical rangesOutput: Calculated common substrings within the DNA sequences, T1, T2, …, TnConstructing the suffix array T’’ without constructing the suffix tree in step 2 uses the Quicksort algorithm which has an average time complex O(nlogn). This algorithm is quite simple to handle three types common repeats and uses only one suffix array which will save a lot of space and running time.

"Predicting Novel Protein Motif Pairs Responsible for Protein-Protein  
Interaction"
Andrew Shoenrock
Understanding and inhibiting protein-protein interactions play a large  
role in modern drug design. However, to inhibit protein-protein  
interactions one must not only know which proteins interact, but where  
the interaction physically takes place on the proteins themselves. The  
detection and prediction of these interaction sites are becoming an  
interest of many in the scientiﬁc community. To this end, the goal of  
this project is to predict novel protein motif pairs responsible for  
protein-protein interaction using PIPE-Sites, a recent extension of  
the Protein-protein Interaction Prediction Engine (PIPE), which has  
not been used to generate any novel predictions to date. The talk will  
consist of a general overview of the PIPE algorithm itself followed by  
the idea behind and a description of the PIPE-Sites approach to  
predicting protein-protein interaction sites. This will lead into a  
discussion about how this software, in conjunction with tools used to  
detect known protein motifs such as InterPro, can be used to predict  
novel protein motif pairs responsible for interaction.


"Using Multiple Sequence Alignments to Improve Gene Prediction Methods"
Xiaoguang (Benjamin) Wang 
abstract:Gene Prediction typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional.In recent years, Machine learning techniques like support vector machines are widely and successfully used on gene prediction, such as approaches like mSplicer, CONTRAST, or mGene. In our project we are trying to solve two specific problems using machine learning algorithms: 1) Given a sequence of DNA, to recognize the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). 2) Given a sequence of DNA, to recognize it is a promoter instance or non-promoter instance.We found that using Multiple Sequence Alignments (MSAs) method to reorganize the sequences of DNA as a data preprocess may improve the performance of the prediction. We implement the Center Star method which is one the MSAs algorithm and compare it with other well known MSAs algorithms include Clustalw and MUSCLE. We found that our hypothesis is supported by the result of our experiments on both small size sequences and large size sequences.