Biol 591 
Introduction to Bioinformatics
Notes for Scenario 2: Arrays and loops
Fall 2003 

  I. Arrays
 II. Loops

Suggested Reading: Beginning Perl (Simon Cozens), Chapter 3, pp.75-104; Chapter 4, pp.129-131, 141-143

I. Arrays
I.A. The need for arrays

We want to write a simulation that will help us judge whether an NtcA-binding site in some sequence we examined is a cause for rejoicing. Before we tackle that problem, let's take a look at a simple simulation, The program DiceRoll.pl was written to simulate the throw of five dice, to predict the likelihood that you'll find at least one pair of dice. This isn't really far from our problem. We're going to simulate the throw of some number of nucleotides, to predict the likelihood that you'll find in them an NtcA-binding site.

First of all, download and run the program to see what it does. In a few seconds you should see a report that success occurred approximately 91% of the time. Is this a reasonable prediction of the probability of getting at least one pair in a roll of five dice? We'll examine that question later. For now, take a look at DiceRoll. Here's the Main Program:

################### MAIN PROGRAM ####################

   foreach$trial (1..$number_of_trials) {
      Roll_dice();
      if (Any_matches()) { $successes = $successes + 1 }
   }

   print "Number of successes: ", $successes, $LF;
   print "Number of trials:    ", $number_of_trials, $LF;
   print "Fraction successful: ", $successes/$number_of_trials, $LF;

I've colored variables as red (always preceded by $ and always lower case), Perl key words (or functions) as blue (always lower case), and my functions/subroutines as purple (always beginning with an upper case letter and followed by a pair of parentheses).

To translate this into English, we're going to go through each trial, from the first to the number of trials specified. During each trial, we'll roll the dice, and if there are any matches, then we'll increase the number of successes by one. After that's done, we'll print some statistics.

Sounds like a plan, but obviously we're going to have to teach Perl what "roll_dice" means and how to determine whether there are any matches. Scrolling down (do it!), I encounter a program unit called Roll_dice. Here's the main part of that subroutine::

   foreach$die (1..$number_of_dice) {
      $die_value = Random_integer(1,6);

      if ($die_value == 1) { $number_of_ones = $number_of_ones + 1 }
      if ($die_value == 2) { $number_of_twos = $number_of_twos + 1 }
      if ($die_value == 3) { $number_of_threes = $number_of_threes + 1 }
      if ($die_value == 4) { $number_of_fours = $number_of_fours + 1 }
      if ($die_value == 5) { $number_of_fives = $number_of_fives + 1 }
      if ($die_value == 6) { $number_of_sixes = $number_of_sixes + 1 }
   }

Translate this into English: We're going to go through each die, from the first die to however many dice there are. During each roll, we'll assign to the die an integer from 1 to 6. If the value is 1, then we'll score one more for the ones. If the value is 2, then we'll score one more for the twos. If the value is 3, then... hey, this is getting tedious!

Fortunately, there is. Suppose you were rolling not five but, say five hundred dice (so you couldn't perceive the numbers at a glance). In that case, you might use a table like the one below, and each time the die shows a certain number, you'd add one to the number in the appropriate column.
 

value 1 2 3 4 5 6
count 0 1 0 2 1 1
offset 0 1 2 3 4 5

Perl makes this strategy is available to you, letting you refer to the entire table as a single array, which you might call @count. Perl identifies array variables with the prefix @ just as it identifies scalar variables with the now familiar $.

SQ1. Back to the question of dice probability... Can you calculate the probability of getting at least one matched pair in a roll of five dice?

I.B. Accessing values within arrays

We need to be able to access individual elements within the array. This is done by subscripts, just as in matrix notation. You might think that the number of instances the 1 appeared on the dice would be represented by @count1 (i.e. count-sub-1). This is not the way Perl does it. First, subscripts need a more practical means of representation, and Perl uses square brackets (e.g. [1]) to indicate the subscript. So now we can refer to the number of times a one has appeared on a die by @count[1]... well, no. Perl uses @ to refer only to the array as a whole. When you wish to refer to a single element within an array, the scalar sign, $, is used. Finally, Perl does not count as humans would count, starting from 1 but rather starting from 0. This is because of how arrays are stored and accessed. Computers remember the memory location of the first element of the array and then calculate the position of later elements:

Position of element = Position of first element + offset*(bytes per element)
This relationship works only if "offset" begins with 0, as shown in the table above. Bottom line: Array elements start counting from 0. The number of times a one has appeared would then be $count[0].

Returning to DiceRoll, we can eliminate all those $number_of_... variables to get:

   foreach$die (1..$number_of_dice) {
      $die_value = Random_integer(1,6);

      if ($die_value == 1) { $count[1] = $count[1] + 1 }
      if ($die_value == 2) { $count[2] = $count[2] + 1 }
      if ($die_value == 3) { $count[3] = $count[3] + 1 }
      if ($die_value == 4) { $count[4] = $count[4] + 1 }
      if ($die_value == 5) { $count[5] = $count[5] + 1 }
      if ($die_value == 6) { $count[6] = $count[6] + 1 }
   }

Doesn't save a whole lot. Arrays come into their own when variables are used as subscripts. Doing this, the code above can be condensed considerably, to just:

   foreach $die (1..$number_of_dice) {
      $die_value = Random_integer(1,6);
      $count[$die_value] = $count[$die_value] + 1;
   }

SQ2. Suppose there were an array @letters that contained all the uppercase letters of the alphabet in alphabetical order. How would you refer to the element that contained the letter "J"?

I.C. Assigning value to arrays

Every time Roll_dice is called, it's necessary to zero out the totals from the previous roll. How to do that? The sure but tedious way would be:

   $count[0]= 0;
   $count[1] = 0;

and so forth. This is crude beyond words. The loop solution is only a bit better:

   foreach $die (1..$number_of_dice) {
      $count[$die_value] = 0;
   }

Perl provides a few ways to assign values to arrays all at once. Here are three:

      @count = (0 0 0 0 0 0 0);              # Assigns 0 to $count[0] .. $count[6]
or
      @count = (0) x ($number_of_dice + 1);  # Assigns 0 to $count[0] .. $count[6]
or
   @digits = reverse (0..9)               # Assigns 9 to $digit[0], 8 to $digit[1], etc

The first method describes a temporary unnamed array (the contents between the parentheses) and assigns the contents of that array to @count. The second method does the same thing but in a more general fashion, repeating the element (0) as many times as specified after the operator x. Note that in both cases I assigned seven zeros, not six as you might expect from the number of sides on a die. That's because the program is written in human style rather than Perl style, defining $count from [1] to [6]. So I have to remember to provide one zero for $count[0] and six for $count[1] through $count[6]. The third method uses m .. n notation, meaning every digit from m to n.

Another way to assign values to an array is to push them on. Here's an example (taken from BlastParser.pl):

   push @query_info, $subject_name, $subject_description, $subject_length, $expectation;

The push function adds values to the end of an array, without your having to know where that end is. The pop function lops off values from the end. You should imagine a spring at the beginning of the array. Each push presses a new value onto the high end of the array, while pop releases a value. Two other functions, shift and unshift, work from the other end but have names derived from an entirely different metaphor (perhaps moving boxes along a conveyor belt). Any element that's either popped off the high end or shifted off the low end are lost to the array.
 

This example illustrates that arrays can contain numbers, strings, or anything else that can be represented in Perl.

SQ3. In the example below, predict what lines will be printed, then run the full program push_shift.pl to find out.

@protein = ("cytochrome oxidase","hexokinase","glutamine synthetase");
push @protein, "phosphofructokinase", "albumin";
$protein[1]= "deleted";
unshift @protein, "globin";
$name1 = pop @protein;
$name2 = shift @protein;
$name3 = shift @protein;
print"name1 = $name1  name2 = $name2  name3 = $name3", $LF;
print"current protein[2] = $protein[2]", $LF;
print"remaining names: ",join(", ", @protein);
SQ4. Rewrite the following lines of DiceRoll.pl so that it uses an array.
if ($number_of_ones>=$matches_wanted) { return $true}
if ($number_of_twos>=$matches_wanted) { return $true}
 . . .
if ($number_of_sixes>=$matches_wanted) { return $true}
II.D. List context vs scalar context

You might praise it as an accomodation to human modes of thinking, you might decry it as one more source of confusion, but Perl has a habit of defining the behavior of operations differently depending on the context in which they find themselves. Operations acting on arrays can be particularly confusing, and to help sort matters out, Perl defines two mutually exclusive contexts: scalar context and list context. For example, consider the following program snippet:

my @array = (1,2,3,4);
print "The array: ", @array, "\n";
print "Two times the array? ", 2 * @array;
What would you expect to be printed? Surprisingly the outut is:
The array: 1234
Two times the array? 8
You might have expected the second line to end 2468, but no. The reason is that in the first print statement, @array is in list context and so produces the list within it. In the second print statement, @array is in scalar context (because Perl doesn't permit arithmetic directly on lists). Now @array produces not its context but its size, which happens to be 4 (take care to distinguish the size, 4, from the highest index, 3).

II. Loops

Doing something over and over is one thing that computers do well, and Perl provides several ways of teaching computers how to do this. They can be divided into two general classes:

In idiomatic Perl, for / foreach loops are intimately related to arrays. The syntax is:
foreach $variable(list) {
   operations to be performed repetitively
}
You can read the first line: For each [variable] within the list, do the following. Here's an example from DiceRoll:
foreach $die (1..$number_of_dice) {
Read: For each die in the set of digits from 1 to the number of dice, do the following. Notice that the notation within parentheses (m .. n) is the same used to assign values to arrays,... because that's precisely what is happening! What lies between the parentheses is an array, which you can define as above or can provide as an array variable. The two lines below are equivalent to the previous example:
my @dice_numbers = (1..$number_of_dice);
foreach $die (@dice_numbers) {
Those of you with experience in another computer language may be bothered by this syntax, because you might feel that forloops should have a defined beginning and a defined end. For you, Perl provides the following alternate syntax:
for (statement before first iteration; statement before each iteration; statement after each iteration){
these three statements are generally used as follows:
for ($variable = initial value; test of $variable; modification of $variable){
For example:
for ($die= 1; $die <= $number_of_dice; $die += 1){
Read: For each value of die, starting with 1 and continuing while die is less than or equal to the number of dice, adding one to die each time, do the following. Note that $die += 1 is shorthand for $die = $die + 1, a form beloved by C programmers and their descendents.

SQ5. Write a loop that prints out a table of numbers from 1 to 20 and their squares.

SQ6. Rewrite DiceRoll.pl to eliminate the use of $number_of_ones and similar variables, replacing them with an array.