Arrays and for loops


Here's a small piece of code that will print out the contents of an array:
foreach my $color (@rainbow) {
   print "$color\n";
}
This is the kind of loop we've seen so far. There's a different way to do the same thing:
for (my $j = 0; $j < @rainbow; $j = $j+1) {
   print "$rainbow[$j]\n";
}
What's going on here? If @rainbow holds, for instance,
("red", "orange", "green", "blue", "indigo", "violet")
then we can use numbers to pick out any item in the @rainbow array, counting from 0: $rainbow[0] is "red", $rainbow[1] is "orange", and $rainbow[5] is"violet". The number in brackets is called an index, or a subscript.

So one way of printing the rainbow would be:
print "$rainbow[0]\n";
print "$rainbow[1]\n";
print "$rainbow[2]\n";
print "$rainbow[3]\n";
print "$rainbow[4]\n";
print "$rainbow[5]\n";

That's a lot of typing, and we can see a simple pattern: on each line we increase the index by 1. We could have written the above as
my $j = 0;
print "$rainbow[$j]\n";
$j = $j+1;
print "$rainbow[$j]\n";
$j = $j+1;
print "$rainbow[$j]\n";
$j = $j+1;
print "$rainbow[$j]\n";
$j = $j+1;
print "$rainbow[$j]\n";
$j = $j+1;
print "$rainbow[$j]\n";

Well, that's even more typing; but now the pattern is even simpler. A for loop packages up the identical parts so we don't have to type them all out. When Perl sees a for loop like this:
for (my $j = 0; ... ; $j = $j + 1) {
   print "$rainbow[$j]\n";
}

it does the first section of the first line (my $j = 0) exactly once, before the rest of the loop. It does the third part of the first line ($j = $j + 1) between every execution of the main body of the loop (print "$rainbow[$j]\n").

There's just one more thing that our for loop needs: a way to know when to stop and when to keep going.  That's what $j < @rainbow does. When we mention an array like @rainbow in a place we'd normally expect to see a number, Perl  uses the number of items in the array.  So $j < @rainbow is true when $j is less than the number of items in @rainbow. In the loop
for (my $j = 0; $j < @rainbow; $j = $j+1) {
   print "$rainbow[$j]\n";
}

we keep going as long as $j < @rainbow is true. In our example @rainbowhas six elements. Since $j is 0 the first time through the loop and we increase it by one each time through, $j takes on the values 0, 1, 2, 3, 4, and 5.

That's better than typing out all the print lines, but still more complicated than
foreach my $color (@rainbow) {
   print "$color\n";
}
Why bother?  One reason is that sometimes we want to look at only part of an array. For instance we could print all but the endpoint of of the rainbow with:
for (my $j = 1; $j < @rainbow-1; $j = $j+1) {
   print "$rainbow[$j]\n";
}

SQ1: What are the endpoints of the rainbow? How does the loop avoid printing them?

Another reason is that we might want to compare different array items, or make a computation using them. For instance, suppose we wanted to count the number of duplicate pairs in an array, where two adjacent items are identical.  The @rainbow array has no duplicates; the array
@a = ("red", "red", "green", "blue", "blue", "red", "orange")
has two: $a[0] and $a[1] are both "red", and $a[3] and $a[4] are both "blue".  (We don't count $a[5] as a duplicate because it's not next to another "red" item.)

Here's a subroutine that will count duplicate pairs:
sub duplicate_pairs {
   my (@subject) = @_;
   my $count = 0;
   for (my $j = 1; $j < @subject; $j = $j + 1) {
      if ($subject[$j] eq $subject[$j-1]) {
         $count = $count + 1;
      }
   }
   return $count;
}
SQ2: Why does this loop start from 1? (my $j = 1)?  What would happen if that was $j = 0 instead?

SQ3: What does
eq mean here? The notes don't mention it, but they do say what the subroutine is supposed to do. Can you deduce the meaning of eq?

Here's a program to test the subroutine:
#!/usr/bin/perl -w
use strict;

my @rainbow = ("red", "orange", "green", "blue", "indigo", "violet");
my @a = ("red", "red", "green", "blue", "blue", "red", "orange");
print "The rainbow has ", duplicate_pairs(@rainbow), " duplicate pairs\n";
print "The array \@a has ", duplicate_pairs(@a), " duplicate pairs\n";

It prints:
The rainbow has 0 duplicate pairs
The array @a has 2 duplicate pairs

Now suppose we want to answer a slightly different question: how many duplicates do we have that are separated by a distance of 2? Or 3? Or some other distance? Our duplicate_pairs subroutine answers the question for a distance of 1.

Here is the start of a subroutine to answer the more general question:
sub distant_pairs {
   my ($distance, @subject) = @_;

We would call this as distant_pairs(4, @a), for instance, to calculate the number of duplicate pairs in @a with a distance of  4; and distant_pairs(1, @a) should give the same answer as duplicate_pairs(@a).

How should we change the body of the duplicate pairs subroutine? Looking at it,
   my $count = 0;
   for (my $j = 1; $j < @subject; $j = $j + 1) {
      if ($subject[$j] eq $subject[$j-1]) {
         $count = $count + 1;
      }
   }
   return $count;

we see that there is a 1 in three places: in my $j = 1, in $j = $j + 1, and in $subject[$j - 1]. Since this loop finds duplicate pairs with a distance of 1 between them, and the distant_pairs routine is to find pairs that are$distance apart, where $distance may or may not be 1, it seems like changing1 to $distance in some or all of those three places might do the trick.

The program distant-pairs.pl is set up to help you test which of those three places to change.

SQ4: Download distant-pairs.pl
#!/usr/bin/perl -w
use strict;

my @a = ("red", "red", "green", "blue", "blue", "red", "orange");

print "Distance: 1; pairs: ", distant_pairs(1, @a), "\n";
print "Distance: 2; pairs: ", distant_pairs(2, @a), "\n";
print "Distance: 3; pairs: ", distant_pairs(3, @a), "\n";
print "Distance: 4; pairs: ", distant_pairs(4, @a), "\n";
print "Distance: 5; pairs: ", distant_pairs(5, @a), "\n";
print "Distance: 6; pairs: ", distant_pairs(6, @a), "\n";
print "Distance: 7; pairs: ", distant_pairs(7, @a), "\n";
print "Distance: 8; pairs: ", distant_pairs(8, @a), "\n";

sub distant_pairs {
   my ($distance, @subject) = @_;
   my $count = 0;
   for (my $j = 1; $j < @subject; $j = $j + 1) {
      if ($subject[$j] eq $subject[$j-1]) {
         $count = $count + 1;
      }
   }
   return $count;
}

Replace some or all of the 1's in distant_pairs with $distance. The correct program will print
Distance: 1; pairs: 2
Distance: 2; pairs: 0
Distance: 3; pairs: 0
Distance: 4; pairs: 1
Distance: 5; pairs: 1
Distance: 6; pairs: 0
Distance: 7; pairs: 0
Distance: 8; pairs: 0

Which places should you replace? (Test by running the changed program. Your humble author got it wrong on his first try.) Why are those the right ones to change?

There is much repetition in the Distance... lines. Perhaps we can help that with a for loop also.

SQ5: Replace the eight lines print "Distance:..., etc. with a for loop that does the same thing.

Here's a possible use for such a program: what are duplicate pairs in a list of colors are repeated sequences in DNA. Suppose that you had looked at the beginning of a large number of genes, each, of course, beginning with a start codon:

atpA: ATGAGCATTTCAATTAGACCTGACGAAATCAGCAGTATTATTCAGCAGCA . . .
atpC: ATGCCTAATCTCAAATCAATACGCGATCGCATTCAGTCGGTCAAAAACAC . . .
atpD: ATGACAAGTAAAGTAGCAAACACTGAGGTAGCTCAACCTTACGCTCAGGC . . .
. . .
zam:  ATGGAATTTTCAATCGCTACACTCCTTGCCAATTTCACCGATGATAAATT . . .
You're curious whether there's a pattern of nucleotides within genes. Do A's tend to come in clusters? Or are they spaced in a patterned fashion? You might write a program that counts nucleotides at each position:
Position

1
2
3
4
5
6
7
8
9
. . .
A
74
0
0
27
25
34
26
30
39

C
0
0
0
23
22
18
20
19
15

G
18
0
0
22
24
13
24
22
14

T
8
100
100
38
29
35
29
28
32

You also might wonder if  A's tend to follow each other directly or with a spacing of 1,  or 2 or 3, or 27, or ...?

Here is a progam, microsat.pl , which does that job.