New Perl features in FindMotif


The FindMotif.pl program uses position-specific scoring matrices to search for sequences similar to a consensus sequence. Here we'll go over some new Perl features that the program uses. You can download the program, or look at a formatted version of it. To run the program you'll need 71NpNtsm.txt, the set of motif sequences (and also NostocChromosome.nt, which you should already have).

Hashes

Hashes are sometimes called associative arrays. Like arrays, they associate a key or index with a value.  Unlike arrays, the key or index is not an integer but a string, a sequence of characters in quotes.  Here is the definition of a hash from the FindMotif.pl program:
my %bg_frequency =         # nucleotide frequencies of overall sequences
("A" => 0.33264,
"T" => 0.33264,
"G" => 0.16736,
"C" => 0.16736);
This statement stores four key-value pairs in %bg_frequency. We use curly brackets to refer to them: $bg_frequency{"A"} is 0.33264, $bg_frequency{"G"} is 0.16736. and so forth.  We can work with hashes much like we do with arrays.  The following piece of code
   foreach my $nucleotide ("A", "C", "G", "T") {
$log_p{$nucleotide} = log($bg_frequency{$nucleotide});
}
is quite similar to what we might do with arrays, if we had decide to code nucleotides as numbers, using 0 for "A" , 1 for "C", 2 for "G" and 3 for "T":
   foreach my $nucleotide (0, 1, 2, 3) {
$log_p_array[$nucleotide] = log($bg_frequency_array[$nucleotide]);
}
Two differences are the square brackets, and the fact that using arrays we need to remember the coding we used.  Another difference is that we'd likely have coded the array version a bit more compactly, using just the starting and ending numbers:
   foreach my $nucleotide (0 .. 3) {
$log_p_array[$nucleotide] = log($bg_frequency_array[$nucleotide]);
}
That's not possible with hashes -- we need to specify each key.

Study Question 1:  How many words are there in the dictionary between "CAT" and "DOG"? Is that even a reasonable question?  How many sequences fall between "CAT" and "GGG"? How often would an abbreviation like "CAT".."DOG" or "CAT".."GGG" be useful?


We can have two dimensional hashes just as we can have two-dimensional arrays:
   $dictionary{"Webster"}{"cat"} = "noun";
 is a perfectly good statement.  In fact, we can mix and match: In the FindMotif.pl program, the expression $n{"A"}[23] stands for the number of times we find the nucleotide adenosine at column 23 of the motif sequences. Here's how that's calculated:
   foreach my $position (0..$motif_length-1) {
my $nucleotide = substr($motif, $position, 1);
$n{$nucleotide}[$position] = $n{$nucleotide}[$position] + 1;
}
The inner two lines are much shorter than the equivalent:

my $nucleotide = substr($motif, $position, 1);
if ($nucleotide eq "A") {
$n{"A"}[$position] = $n{"A"}[$position] + 1;
}
elsif ($nucleotide eq "C") {
$n{"C"}[$position] = $n{"C"}[$position] + 1;
}
elsif ($nucleotide eq "G") {
$n{"G"}[$position] = $n{"G"}[$position] + 1;
}
elsif ($nucleotide eq "T") {
$n{"T"}[$position] = $n{"T"}[$position] + 1;
}

--which is the point of using hashes. If we used arrays and coded bases numerically we'd be forced to be lengthy:
   my $nucleotide = substr($motif, $position, 1);
if ($nucleotide eq "A") {
$n_array[0][$position] = $n_array[0][$position] + 1;
}
elsif ($nucleotide eq "C") {
$n_array[1][$position] = $n_array[1][$position] + 1;
}
elsif ($nucleotide eq "G") {
$n_array[2][$position] = $n_array[2][$position] + 1;
}
elsif ($nucleotide eq "T") {
$n_array[3][$position] = $n_array[3][$position] + 1;
}

Understanding Loops

Whether or not we use hashes, figuring out what's going on in a loop can take some close analysis. Here's a piece of complicated code:
   # Calculate %log_q
#
$N = @motif_sequences;
foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
$log_q{$nucleotide}[$position] =
log( ($n{$nucleotide}[$position] + $B * $bg_frequency{$nucleotide})
/ ($N + $B)
);
}
}
It might help to think about it in two steps: first the outer part, which tells us where we're calculating, has a pattern something like this:
   foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Do Something (with this nucleotide, at this position)
}
}
Without knowing what  Do Something is, we can still guess that we're calculating some quantity for each nucleotide, at each position of the motif.  With that in mind, we can look at the formula itself:
   $log_q{$nucleotide}[$position] =
log( ($n{$nucleotide}[$position] + $B * $bg_frequency{$nucleotide})
/ ($N + $B)
);
Here we see that we're calculating log_q for a particular nucleotide and position. To do the calculation we use information from n, which depends on the nucleotide and position; from bg_frequency, which depends only on the nucleotide, and is the same at all positions; and from B and N, which don't vary at all: they're the same for every nucleotide and position.

Knowing those dependencies, we can think of the formula this way:
   log_q = log((n + B * bg_frequency) / (N + B))
That's simpler for humans to read, and we can temporarily overlook the computer details to concentrate on the math.  For this formula we have some hints from Monday: first, B and bg_frequency are primarily there to make sure the frequency doesn't become zero; so (n + B * bg_frequency) / (N + B) is an adjustment of the much simpler  n/N , where n is the number of times a particular nucleotide occurs, and N is the total number of sequences in which it could occur. Ah! n/N is the frequency of the nucleotide, and if we trust our sources that (n + B * bg_frequency) / (N + B) is a reasonable adjustment of the frequency, we can think of it as being more or less the same as the frequency. So we're left with something like:
   log_q = log(the frequency of the nucleotide at this position)
And from Monday we also remember that we're using logs here just to speed things us: we'd really like to multiply frequencies over all positions. Since adding the logs of the frequencies is equivalent to doing the multiplications, and adding is faster on computers than multiplying, we store the log of the frequency instead of the frequency itself.

So now we can go back to our outer loop:
   foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Do Something (with this nucleotide, at this position)
}
}
and change the Do Something to reflect our understanding:
   foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Calculate the frequency (of this nucleotide, at this position)
}
}
To make even more of a summary, we can write:
   for each nucleotide  {
for each position {
Calculate the frequency of this nucleotide, at this position
}
}
This (or something like it) was what the programmer had in mind when the loop was written. It would be nice if that were all the computer needed to know! But unfortunately we need to give it explicit directions on just how log_q and n depend on the nucleotide and position, and how bg_frequency depends on the nucleotide.

Sorting

Here's another bit of complicated code:
sub sort_and_discard_excess_values {
@hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;
while (@hit_info > $number_to_save) { pop @hit_info }
}
What the programmer had in mind here was something like this:
   sort @hit_info (by score);
throw away the low scoring hits
Let's take it on faith for the moment that the first statement sorts the @hit_info array with the highest scoring hits to the left, so that $hit_info[0] contains the best hit found so far, $hit_info[1] has the next best hit, and so forth. If that's true, we can tackle the second part: throwing away the low scoring hits. Another way to look at it is that we're keeping the high-scoring hits, and that $number_to_save tells us how many to keep. So what we want to do is reduce the size of @hit_info to $number_to_save by throwing away the items at the end.  There are various ways to do this. Here's one that builds on what we know. The pop command decreases the size of an array by exactly one; so if the size of @hit_info is larger than $number_to_save, we can go one step in the right direction with pop @hit_info. So
   while (@hit_info > $number_to_save) { pop @hit_info }
keeps checking the size of @hit_info, and popping one item off the array, until we've thrown away all but $number_to_save hits.

Now we can go back to the first part:
   @hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;
There are four things going on here:
  1. We're sorting the @hit_info array (sort);
  2. We're telling sort how to sort the array ({ $$a[0] <=> $$b[0] }), which as we'll see sorts @hit_info so that the lowest scores come first;
  3. We're reversing the order of the result, so that the highest scores come first; and finally
  4. We're putting the sorted and reversed result back into @hit_info.
This is a little more complicated than the usual sort, which sort by dictionary order.  For instance,
   @a = sort ("the", "quick", "red", "fox");
print join(" ", @a), "\n"
prints
   fox quick red the
That's O.K. for words, but doesn't work as well for numbers:
   @a = sort (1, 5, 17, 18, 100, -4, -5);
print join(" ", @a), "\n"
prints
   -4 -5 1 100 17 18 5
sorting the numbers in order as if they were words.  The correct numeric order is
   -5 -4 1 5 17 18 100
Perl lets us specify numeric order with a special block of code:
   @a = sort { $a <=> $b } (1, 5, 17, 18, 100, -4, -5);
print join(" ", @a), "\n"
Here $a and $b are special identifiers -- they aren't really connected to the rest of the program. The block of code should evaluate to -1 if $a < $b, 0 if $a = $b, and 1 if $a > $b.

As it happens, <=> is a special Perl operator that does just that, for numbers. The equivalent operator for strings is cmp, so
   @a = sort ("the", "quick", "red", "fox");
and
   @a = sort { $a cmp $b } ("the", "quick", "red", "fox");
do exactly the same thing.

Study question 2: What does the following code do? (Notice the order of $a and $b)
   @a = sort { $b <=> $a } (1, 5, 17, 18, 100, -4, -5);
How about this?
   @a = sort { $b cmp $a } ("the", "quick", "red", "fox");

Now that we know about , we could sort an array of numeric scores.  If @hit_info had only numbers in it, we could sort it like this:
   sort { $a <=> $b } @hit_info
But that's not what we see. The program actually says:
   sort { $$a[0] <=> $$b[0] } @hit_info;
To understand this, it helps to look at what goes into @hit_info. The crucial statement is in the store_hit_information routine:
sub store_hit_information {

my ($name, $score, $start, $subregion) = @_;
push @hit_info, [$score, $name, $start, $subregion];
...
}
Here we're actually creating a two-dimensional array, row by row. Each time the routine is called it adds another row to @hit_info, namely:
   [$score, $name, $start, $subregion]
Each row stores the information for one hit. So $hit_info[0][0] is the score of the first hit, $hit_info[0][1] is the name, $hit_info[0][2] is the starting position of the hit, and $hit_info[0][3] is the hit itself, the subregion that corresponds to the consensus sequence.

Unfortunately Perl treats array like
   [$score, $name, $start, $subregion]
differently from the usual arrays.  We say
   my @a = ($score, $name, $start, $subregion);
print $a[0];
but
   my $a = [$score, $name, $start, $subregion];
print $$a[0];
The reasons for the difference are, as they say, beyond the scope of this web page. But now we can make some sense out of the code block for the sort:
   sort { $$a[0] <=> $$b[0] } @hit_info;
We are sorting the rows of @hit_info. Each row contains one hit, and the key for the sort is item 0 of the row, which is the score of the hit:$$a[0] is the score of one hit and $$b[0] is the score of the othert. Since the score are numbers, we use the numeric comparison operator, <=>. Finally, since our code block sorts the array smallest-first, we use reverse to change that to largest first:
   @hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;