Write a simulation to find NtcA sites in a random sequence

Biol 591

Introduction to Bioinformatics
Notes for Scenario 2: Implementing a simulation

Fall 2003

I. Defining the problem
II. Code to produce the model
III. Code to look for identical matches in the simulated sequence
IV. Program to look for specific sequences in a real genome
V. Finishing up the simulation program

I. Defining the problem

Now that we have a good handle on how DiceRoll.pl works (we do, don't we?), we're prepared to return to the question at hand: How likely is it that something like an NtcA site would pop up at random in a sequence we're looking at? If DiceRoll.pl represents at least in form a typical simulation (I think it does), then a good strategy may be to let that program guide us in writing a simulation that answers our question. To do this, however, there are a few subquestions we need to ask ourselves:

What does it mean to be something like an NtcA-binding site?
What is the sequence we're looking at?
What kind of random sequence would be sufficiently similar to the sequence we're looking at?

Each of these questions may admit more than one reasonable answer. Let's take them on one at a time.

What does it mean to be something like an NtcA-binding site?
The ultimate answer to that question is that an NtcA-binding site is any DNA sequence to which the protein NtcA binds. Unfortunately, to cast the answer that way requires that we do a biochemical experiment for every sequence under consideration, precisely what we're trying to avoid. We therefore need a computational alternative definition.

If you review what we know about NtcA-binding sites (Regulatory Protein and Their Binding Sites, Figure 3), a some alternative definitions suggest themselves: (a) we could define an NtcA-binding site as the very DNA sequence that you found upstream from hetQ and suspected of binding NtcA (GTAacatgagaTACacaatagcatttatatttgcttTAgtaT; see the scenario, panel 3). Clearly, this would be most unreasonable, since you would have been equally amazed to find many similar sequences (for example an identical sequence changed only at the fourth position: GTAtca...). A better idea would be to define a site as any sequence that fits the consensus sequence indicated in Figure 3 (GTA.{8}TAC.{20-24}TA...T). However, examining that figure, you see that half of the biochemically proven NtcA-binding sites do not exactly match this consensus sequence. Perhaps you should allow that a sequence might have one mismatch relative to the consensus and still be considered an NtcA-binding site. Let's go with that more conservative definition.

What is the sequence we're looking at?
How likely is it that you buy a winning lottery ticket? That depends. If you buy several million tickets, your chances aren't that bad. If you buy just one, that's an entirely different matter. If you were looking through a genome's-worth of sequence, it would be pretty remarkable if you did not find an NtcA-binding site. It would be considerably more surprising if you found one looking at a particular 43-bp sequence. We need to carefully consider, what were you looking at? Would you have been equally excited if you found a possible NtcA-binding site a few nucleotides over? Did you look a few nucleotides over? The scenario describes you as examining the upstream sequence to hetQ. The sequence from hetQ happens to be 847 nucleotides (trust me on that one). So we could define the problem as whether there's a DNA sequence no more than one-nucleotide removed from a consensus NtcA-binding site within a string consisting of 847 random nucleotides.

But did we look at 847 nucleotides? Or 847 cyanobacterial nucleotides? Or 847 cyanobacterial nucleotides upstream from a gene? Or 847 cyanobacterial nucleotides upstream from hetQ? Does it matter? That brings us to the next question...

What kind of random sequence would be sufficiently similar to the sequence we're looking at?
We could simulate 847 nucleotides by rolling a four-sided die 847 times. Simple enough, but perhaps too simple. The numbers on a die come up at an equal frequency. Do A, C, G, T come up with an equal frequency in the cyanobacterial strain we work with? As it happens, no: the GC-fraction of Nostoc DNA is about 41%. But that's from an analysis of the entire genome. Perhaps the frequenies of nucleotides are different in regions upstream from genes. Or perhaps the region upstream from hetQ is still more unusual. We'll start simply and add complications as we proceed.

II. Code to produce the model

Bring up DiceRoll.pl , which will serve as the template for the simulation program we're going to write to assess the likelihood of NtcA-binding sites. First thing: Give it a new name and save it. Call it, say, FindNtcA.pl. Now scroll down to the Main Program:

################### MAIN PROGRAM ####################

   foreach$trial (1..$number_of_trials) {
      Roll_dice();
      if (Any_matches()) { $successes = $successes + 1 }
   }

   print "Number of successes: ", $successes, $LF;
   print "Number of trials:    ", $number_of_trials, $LF;
   print "Fraction successful: ", $successes/$number_of_trials, $LF;

FindNtcA will do pretty much the same thing. Instead of rolling a dice, it will make up a random sequence. Instead of checking for matches between dice, it will check for matches in the random sequence to a consensus NtcA-binding site (allowing for one mismatch). For now, the only thing that needs to be changed is the name of the function that builds the model. Change the function name Roll_dice() to, say, Make_random_sequence(), both in the invocation in the Main Program and in the subroutine definition several lines below in the Subroutine section.

Unfortunately, just changing the name won't change what the subroutine does, so we need to think about that. Scroll down to what used to be Roll_dice in the Subroutine section, where you should see:

#### MAKE_RANDOM_SEQUENCE
# Simulates roll of N dice (where N is $number_of_dice)
#    Counts ones, twos, etc, as it goes along
# Counts must be initialized to zero before each roll
sub Make_random_sequence {
   my $die;               # A counter, going from die #1 to the last die
   my $die_value;         # Number of spots shown, i.e. a number between 1 and 6
   $number_of_ones = 0;
   ...
   $number_of_sixes = 0;
   foreach$die (1..$number_of_dice) {
      $die_value = Random_integer(1,6);
      if ($die_value == 1) { $number_of_ones = $number_of_ones + 1 }
      ...
      if ($die_value == 6) { $number_of_sixes = $number_of_sixes + 1 }
   }
}

We have a lot to change. First, let's define what we want the subroutine to do. We want it to make a random nucleotide sequence 847 nucleotides long -- well, let's make it more general -- as long as specified by the variable $sequence_length. We need to be sure to initialize the sequence to nothing ("") before each trial.

Now look at the loop. It reads "For each die taken from a list of numbers going from 1 to the number of dice, do the following:...". Then, each die got a random number from 1 to 6. Our analogous problem is to proceed through each nucleotide position of a sequence and determine at random what nucleotide should occur there.

SQ1. Change the line beginning foreachso that it is appropriate for Make_random_sequence.

Inside the loop, we don't want a die value but rather a nucleotide and don't want a random integer between 1 and 6 but rather a random nucleotide that is either A, C, G, or T. For now, let's pretend that there exists a subroutine called Random_nucleotide ( ) that provides a random nucleotide on demand. There will be one in a moment. For now, scroll down further and insert a fake subroutine that looks like this:

#### RANDOM_NUCLEOTIDE
# Returns A, C, G, or T by some decision making process to be determined later
# Right now, the subroutine is hardwired to always return "G"
sub Random_nucleotide {
return "G";
}

Once you grab a random nucleotide, you can add it to the growing sequence with the .= operator. It works by adding characters on the right to the string on the left. Consider the following example:

my $string = "act";
$string .= "ion"
print $string;

SQ2. What does this program print?

SQ3. Modify Make_random_sequence and test it. To do this you'll need to make several changes besides those already mentioned:

Declare any new variables you made up (my $variable;) some variables (those analagous to $die and $die_value) will be declared in the Make_random_sequence, but the variable holding the sequence itself should be declared in the main list of variables, since it will be used outside the subroutine.
Comment out (precede by #) the statement referring to Any_matches, since that subroutine won't work any more.
Print out the sequence made by Make_random_sequence.
Do something so that the program does not go through 10000 trials!

If that worked, then all that's left to do is to make a more realistic Random_nucleotide subroutine. For now, let's take the easy road and presume that all nucleotides occur at the same frequency. Roll_dice treated an analogous problem this way:

$die_value = Random_integer(1,6);

Inspired by that, here's one rather tedious solution:

sub Random_nucleotide {
   my $nuc_type = Random_integer(1,4);
   if ($nuc_type == 1) {return "A"}
   elsif ($nuc_type == 2) {return "C"}
   elsif ($nuc_type == 3) {return "G"}
   else {return "T"}
}

Somewhat more shorter though perhaps a bit more difficult to understand:

sub Random_nucleotide {
   my $nucleotides = "ACGT";
   my $nuc_type = Random_integer(1,4);
   returnsubstr($nucleotides, $nuc_type, 1)
}

We'll see other ways when we learn about hashes next month.

SQ4. Run Make_random_sequence to see if your new Random_nucleotide subroutine works.

The beauty of putting the randomizing algorithm in a separate subroutine as we have done is that we can make it increasingly more complicated to address the concerns voiced in Section I without affecting any other part of the program.

If all has gone well, you have now completed half the problem: making the model.

III. Code to look for identical matches in the simulated sequence

According to our definition of the problem, we are looking within the simulated sequence for either a match to the consensus NtcA-binding site or a near match, i.e. one that misses by one nucleotide. We don't yet know how to handle mismatches, so for the first version of the program, let's just look for exact matches. This is something you should feel good at: how do you find a match of a pattern in a sequence? First of all, set up the pattern you want to find and assign it to a variable, say $NtcA_site:

my $NtcA_site = "your pattern goes here";

Then, in the subroutine Any_matches, you can put a statement that begins:

if ($sequence =~ /$NtcA_site/) {

SQ5. Make the suggested changes, completing Any_matches so that it returns a true value when a match has been found and a false value otherwise. Then test the program to see if it works thus far.

All that's left is to (a) teach the program how to find nearly identical sites, i.e. those with one mismatch, and (b) make the random sequence generator more realistic. We'll hold off on those tasks for the moment.

IV. Program to look for specific sequences in a real genome

As I tried to imply in Section I, it is always possible to tinker with the model to make it closer to reality. How can you know if the model is close enough? Often you can't, but when there IS the possibility of a comparison with reality, grab it! As it happens, we have in hand the genomic sequence of the cyanobacterial strain, Anabaena/Nostoc PCC 7120, that is the object of our study. It is therefore possible to search the genome exhaustively, looking for putative NtcA-binding sites or for any other sequence that strikes our fancy.

How can we use this ability to test our model? Not an easy question because we haven't made very clear what purpose we hope that the model will serve. The model is to help us discern what frequency we should expect to encounter NtcA-binding sites (or any other sequence) solely by chance, without regard to selection for or against function of the site. "Solely by chance" we will see is not so easy to implement. But, pressing on, what if we find that NtcA-binding sites are more common in the real genome than in our model sequences? Should we conclude that our model then no good? No, because perhaps there is selection for NtcA-binding sites in the real genome, because there is selection for the function of those sites and their ability to regulate gene expression. OK, what if we find that NtcA-binding sites are less common in the real genome than in our model sequences? Still no problem, because pehaps there is selection against extraneous NtcA-binding sites so as not to titrate out NtcA and diminish its ability to find functional NtcA-binding sites in the genome. In fact, given our stated purpose, comparing the frequency in the genome vs the simulation of a sequence known to have biological function is not very informative.

It might be more useful to compare the frequency of a sequence similar to NtcA-binding sites but having no (known) function. You'd think that if our model captures all the overall features of natural DNA, then it should be able to predict accurately the frequency of most DNA sequences but not those whose frequencies are specifically affected by their function.

SQ6. Think about what specific sequence you might make up that is as similar as possible to putative NtcA-binding sites but probably do not bind NtcA (or do anything else of biological significance). What sequence(s) can you come up with?

FindMatches.pl is a program that searches through a large sequence looking for identical matches of a short sequence and (optionally) matches that differ by no more than one nucleotide. It comes with two data files: AnabaenaChromosome.nt, containing the entire chromosome of the cyanobacterial strain we study, and AnabSeq.nt, containing a small subset of the chromosome, just for testing purposes.

SQ7. Download the program and the data files. What sequence file is it set up to use? What pattern is it set up to use? Where do you expect to find the output (i.e. in a file or on the screen)? When you're satisfied you know the answers to these questions, run the program.

SQ8. Did the program work? How can you tell? If you can't, then devise a test that will convince you that the program does what it claims to do.

SQ9. Run the program with a sequence that you came up with in SQ5. (If you choose to use the full chromosome, expect a wait a bit for the program to complete the search -- 8 min on my computer)

V. Finishing up the simulation program

We left two tasks undone to complete FindNtcA. The first was to teach the program how to find nearly identical sites. Fortunately, we now see that Find_matches provides the solution.

SQ10. Identify the sections of FindMatches that finds nonidentical matches and copy them into FindNtcA. Copy also the section of the program that evokes these subroutines and any other statement necessary to make them work.

SQ11. Look over these sections. From their descriptions, you may find that it identifies identical matches in a different way than you did in FindNtcA. How so? Which way is better? Try to get FindNtcA to work with these new additions.

The second task we have remaining is to make the random sequence generator more realistic. To get the program running, we assumed, with no justification, that all nucleotides occur with the same frequency. In fact, the frequency of nucleotides in the chromosome of Anabaena is [G] = [C] = 21% and [A] = [T] = 29%.

SQ12. Change Make_random_sequence so that it creates a sequence based on the true nucleotide frequencies. (Hint: You might find it useful to get a random integer from 1 to 100)

Even this model is too simplistic (as discussed in Section I). We'll continue tweaking in the problem set.