genome-analysis

BNFO301 – Introduction to Bioinformatics
Genome Sequences - Some Resources for Analysis

While obtaining a nearly complete genomic sequence is an arduous task, rewards require much additional work. The insights that can be gained from a genomic sequence result from analysis of what genes and what other sequences are present. The tour below is intended to give a taste of what kind of analyisis is done on genomic sequences. (Description of many of the resources visited in this tour and more besides can be found in this list).

Gene finding
Amidst the 122+ million basepairs of the assembled Drosophila genome lie the genes that in large part makes a fly a fly. The task of picking out genes within a genome, already difficult for prokaryotic genomes, is complicated in eukaryotic genomes by the presence of introns that often break up the coding region. How to find those genes?

You'll remember from the tour What is a Gene? that it is not a trivial matter to find the beginning of a gene by inspection. You can't merely scan for ATG's for example. Gene finding programs use a more sophisticated approach, examining the nucleotide sequences of known genes from an organism and extracting sequence tendencies the programs can use to predict if a genome segment is part of a gene. The primary tool in the analysis is hidden Markov models, which perhaps we'll have time to discuss.

Let's look at gene finding in action. Suppose you're considering a segment of the Drosophila genome, which you can find here. Where are the genes? To get a prediction:

Download the sequence file, giving it a name
Go to GeneMark, a program that uses hidden Markov models to predict gene locations
Click on GeneMark.hmm under Gene Prediction in Eukaryotes
Upload the sequence, clicking the Browse button
Choose D. melanogaster as the species (i.e. the species whose genes were used as the training set)
Make sure the following are checked: Generate PDF graphics and Translate predicted genes into protein
Click on Start GeneMark.hmm

The output from the program includes coordinates for the putative gene(s) found. Click on View PDF Graphcial Output and compare it with the coordinate listing. GeneMark considers all six possible reading frames, three on the forwards strand, three on the backwards strand. If you go to the coordinates shown on the listing, you should find that a high degree of similarity to the Markov model generated by the Drosophila training set.

SQ1. How many genes are predicted by GeneMark within this segment?
SQ2. Are the predicted exons of each gene found within the same open reading frames?
SQ3. Take a guess what symbols are used in the graph to represent start and stop codons.

To generate a listing of the segment, go into BioBIKE and upload and display the sequence in the following way:

Click on the down arrow of the Tool box at the bottom of the page and select Upload file
Click on the Browse button and find your file
Click on the Upload button
X out of the window, returning to the Listener
Bring the sequence into BioBIKE as follows:
(DEFINE (title seq) AS (READ-FASTA-FILE "name-of-file"))
Display the sequence as follows:
(DISPLAY-SEQUENCE seq)

SQ4. Do the putative beginnings of the genes (per GeneMark) begin with a start codons?

Protein features
Copy the amino acid sequence of the larger of the two proteins predicted by GeneMark. Does it have any transmembrane regions? Kyte-Doolittle hydropathy plots provide a simple and easiliy comprehended way to predict such regions. The algorithm merely totals the hydrophobicity of amino acids within a window that slides along the length of the amino acid sequence.

Go to the Kyte-Doolittle web site (provided by Malcolm Campbell to accompany his excellent book, Discovering Genomics, Proteomics, and Bioinformatics). Paste in the amino acid sequence you copied earlier, set the window size to be 19, and click Submit (you might also spend a moment looking at the background info link). Positive scores represent regions of hydrophobicity, negative scores regions of hydrophilicity (you can read more about the graph at the bottom of its page).

SQ5. How do you interpret the graph?

A program called DAS (Dense Alignment Surface method) is more sophisticated in predicting transmembrane regions, comparing candidate amino acid sequences to the amino acid frequencies of known membrane-spanning proteins. Go to DAS, paste in the amino acid sequence, and click Submit. (If DAS is having a problem showing graphical output, as it sometimes does, click here as a last resort.)

SQ6. How do you interpret the graph?

A good deal of different sorts of information about a protein sequence can be obtained from several sites, typified by SMART. Go to that site, choose Normal mode, paste in the amino acid sequence, check the Pfam domains and Signal Peptides boxes, and click Sequence SMART. The most interesting thing in the output is the gray box marked Pfam 7tm_1, indicating that a region of high similarity was found between the amino acid sequence and a known protein family. Click on the box and then click on the 7tm_1 link to find out about the protein family.

SQ7. What kind of protein is your amino acid sequence similar to?

Sequence Features
Enough about the coding sequence. What about the rest of the DNA? In particular, what part of it determines the mRNA transcript? This is important in two regards. First, the coding region must lie within the mRNA transcript, so determining the transcript is a check on the predicted amino acid sequence. Secondly, the region immediately upstream of the start of transcription is likely to be responsible for the regulation of transcription, an important feature of the gene. To find the transcript, go back to your friend, Blast.

Click Blast
Click Nucleotide-nucleotide Blast (BlastN)
Paste in the nucleotide sequence (not the amino acid sequence) from the beginning of this tour.
Select Drosophila melanogaster from the list of organisms under Options
Remove the check mark from Low complexity (since you want to see strong similarities even to repeated sequences)
Click Blast
Click Format
When output appears (it might take up to a minute), note the red bars, representing sequences with high similarity
Mouse over each red bar and note the identification of the similar sequence

SQ8. The first three red bars extend over most or all of the 13Kb sequence. What are these hits and why do they extend so far?

SQ9. The fourth and fifth hit have patchy regions of similarity. Why?

SQ10. Examine the coordinates for these hits and interpret them in light of previous results.