Biol 591 Introduction to Bioinformatics (Fall 2002)

Programming Problem Set 1P

Here is SequenceSearch2.pl, a version of SequenceSearch.pl which counts rather than prints out the matches. When you run it, you should see
Exact matches:
147
Matches with one possible mismatch
3734
P1P.1. SequenceSearch2.pl searches for the pattern GTA.{8}TAC.{20,24}TA.{3}T, which stand for GTA followed by a gap of eight positions, then TACfollowed by a gap of 20 to 24 positions, then TA followed by a gap of three positions and a T.

Suppose we decide to narrow the search, that we are only interested in some of the sequences we counted above: those where GTA is immediately followed by another G. How would you alter the pattern?

Check: Perhaps you're pretty sure you've got the right pattern and would like some confirmation. Here's what I get when I run the program with my solution:

Exact matches:
34
Matches with one possible mismatch
1017
P1P.2. Now suppose that we narrow the search further: as well as a G following the initial GTA, we want TAC to be immediately followed by a C. What pattern should we use?

Check: this pattern should give 9 exact and 294 inexact matches.

P1P.3. Go back again to the original pattern, GTA.{8}TAC.{20,24}TA.{3}T . How would you change this pattern to search only for the consensus NctA binding site, without caring if it's in range of a promoter or not?

Check: this pattern should give 1497 exact and 28731 inexact matches.

P1P.4. Figure 3 on page 5 of the Scenario 1 Molecular Biology notes (PDF) lists the sequences upstream from 20 cyanobacterial genes regulated by nitrogen deprivation. In each sequence the bases corresponding to the consensus NctA binding site and to the promoter sequence are printed in bold. Sometimes the correspondence is exact; the upstream sequence has all nine of the bold bases. But that's not always true. For instance, in the second line (nirB-ntcB), the second base of the promoter is a T rather than an A, so that line has only eight bold bases.

Examine the sixth sequence (amtl). Find a reason for wondering if our simulation might underestimate the probability of finding a match for the consensus binding sequence and promoter. (Consider exact and inexact matches.)

P1P.5. Examine the first two bold columns of Figure 3. Find a reason for wondering if our simulation might overestimate the probability of finding a match. (Again consider exact and inexact matches.)