Biol 591 
Introduction to Bioinformatics
Fall 2002 

Scenario 3: Comparison of protein sequence against database
Our Story

You are working at the Center for Disease Control when blood samples cross your desk taken from patients exhibiting symptoms of anthrax. The patients are related by their place of work, and you suspect a single causative agent. Obviously you want to find out what that agent is. You follow the usual routine: PCR out a region of lef, the gene encoding the lethal factor from Bacillus anthracis. You do this first of all to see if the gene is present. The PCR gives a fragment of the expected size, so there's little doubt about the diagnosis of anthrax. The second step is to sequence the fragment and compare it against known sequences of the gene, hoping to get clues as to the specific bacterium responsible. You can see the sequence of one of the samples (DG47) by clicking here.

The standard way of comparing the sequence against others is to run the sequence through the non-redundant (NR) sequences in GenBank, using the Blast facility at NCBI. You do this, and much to your surprise you get the result displayed by clicking here. All you get are hits with E-values well above 1, clearly noise.

This result doesn't make sense! This region of the lef gene is highly conserved. If PCR picked up the fragment, then Blast should pick up the sequence. Maybe you've encountered a novel toxin sufficiently related to lef to permit PCR amplification, but not related enough to match known lef sequences in GenBank. This doesn't make a whole lot of sense, but you don't have any better idea at the moment. With this in mind, you translate the DG47 sequence, submit the translated sequence to Blast to search for similar proteins, getting the results displayed by clicking here.

Now you are flabbergasted! How is it that searching for similar DNA sequences yields no significant similarity to lef but searching for protein sequences gives you perfect identity! Something must be real funny with this sequence, so you compare it by hand with the corresponding sequence from lef, getting the alignment shown here.

Things are not getting any clearer. The gene fragment is highly similar to lef, about 90% identical. That's consistent with the identity of the protein sequences, BUT WHY DIDN'T BLAST FIND THE DNA SEQUENCE??? In desperation, you compare the two DNA sequences by pairwise Blast. Now, you already know the answer, but you want to see if Blast has gone crazy. It has, for here are the results.

In brief, when you translate this sequence, it is 100% identical to anthrax toxin. When you compare it by eye, it is almost identical to the lef gene that encodes part of the anthrax toxin. When you ask Blast, the most heavily used program in all of bioinformatics, to do a simple comparison of the two DNA fragments, it finds nothing.

WHAT'S GOING ON???

Problem
How to align sequences using Blast and know what you're doing.

Tools
Blast
Basic local alignment sequence tool. A tool to find matching regions between two sequences that may share limited similarity.