CSI5126. Algorithms in bioinformatics
Fall 2017

Assignment 1

Deadline: October 9, 2017, 18:00

[ PDF ]

Learning outcome

In the real world, you would use an existing application or API to perform the tasks of this assignment — see the Resources Section. However, I believe there is a real advantage to write simple programs by yourselves to carry out these tasks so that you can learn more about the biology. For all the questions, assume that the information is stored in FASTA format 1 .

1 Transcription (5 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must transcribe the input to RNA. The result is displayed on the standard output. For instance, given a file with the following DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would display the following information on the output:

ACUGUUGUUCGGUGAUCAUCAGUUGUACAACGUCCUAACAACAUCACAUGCAAUGCUUAUGAUAUUCUUC

2 Reverse complement (5 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must display the reverse complement sequence. For instance, given a file with the following DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would display the following information on the output:

GAAGAATATCATAAGCATTGCATGTGATGTTGTTAGGACGTTGTACAACTGATGATCACCGAACAACAGT

3 All six reading frames (15 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must display all six translation reading frames. For example, given the follow DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would produce the following output. Here the star is used to represent the stop codon.

> 5’3’ Frame 1  
T V V R * S S V V Q R P N N I T C N A Y D I L  
 
> 5’3’ Frame 2  
L L F G D H Q L Y N V L T T S H A M L M I F F  
 
> 5’3’ Frame 3  
C C S V I I S C T T S * Q H H M Q C L * Y S  
 
> 3’5’ Frame 1  
E E Y H K H C M * C C * D V V Q L M I T E Q Q  
 
> 3’5’ Frame 2  
K N I I S I A C D V V R T L Y N * * S P N N S  
 
> 3’5’ Frame 3  
R I S * A L H V M L L G R C T T D D H R T T

4 Database search (5 marks)

One of our life science colleagues has just sequenced this DNA fragment. We would like to know if it corresponds to a protein coding sequence. If so, does it match a known protein sequence. To solve this problem, you must translate this DNA sequence into all six possible reading frames, and search each one of them using UniProt — a well known resource for protein sequence information.

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC  
TTCATCATGCCAGGCACGATGGCAGGACTAGGCAACTTACTAGTGCCATTCCAGATGAGTGTACCGGAGT  
TAGTATTCCCAAAGATTAATAACATCGGTATATGATTTTTAGTATGTGGTCTACTTTTGATTACGGGTTC  
ATCTTGGATGGAGGAAGGTTCAGGAACGGCCTGAACCGTCTATCCACCACTAGCGCTCACTGCAAGTCAT  
AGCGGACTTGCTGTAGATACGTTCATTATCGCATTGCACATGGCCGGTGCAAGCTCCCTTACAGGAAGCA  
TCAACCTTATATGTACAATCGCCTATGCCCGCCGTTCACTCATGGCGATGCTGCAGTCATCACTTTATCC  
CTGATCCATTACAATCACTGCAGCGTTACTCATAGGAGTTGTGCCTGTGCTAGCAGGTGCTATCACGATG  
CTACTCACTGATAGAAGTTGGAGTACCAGCTTCTATGACAGTTCGGCAGGCGGTGATCCTATGTTGTATC  
AGCACTTATTCTGGGTGTTTGGGCATCCAGAAGTCTATATCATCATACTTCCAGTATTCGGTATAGTCAG

5 Genetic Code (20 marks)

Since its discovery 50 years ago, the genetic code 2 has never ceased to amaze. For instance, we now know that biases in codon usage play key roles in the subtle regulation of gene expression.

For this question, write a simple program to analyze the genetic code. In particular, your program must output the following information:

Resources

Directives

A Frequently Asked Questions [FAQ]

  1. “None.”

    For now.

Modified October 3, 2017