Biol 591 
Introduction to Bioinformatics
Notes for Scenario 3: Blast and 2-dimensional arrays
Fall 2003 

I. Two-dimensional arrays

A two-dimensional array can be thought of as a table. Here's one:
 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 A A A G T T G T A G T T T C T G T T
1 T T T T T A G T A G C A A T T G C T
2 G A A A A A G T A G C A G T T G C T
3 T T G C T G T A G C A G T A A C T
4 T T T T A T G T A T C A G C T G T T
5 C G A A C T G T T A C A T C G A T T
6 C G T T C T G T A A C A A A G A C T
7 A A T T T T G T A G C T A C T T A T
8 A A T T T A G T A T C A A A A A T A
9 A A A G C T G T A A C A A A A T C T
10 T C A T T T G T A C A G T C T G T T

The numbers on the edges are the indexes of the array (starting with zero as usual).  We call out an item from the array by using two numbers in brackets: the row number, then the column number. If our array is called @sequences, then in the top row, $sequences[0][0] is "A", $sequences[0][1] is also "A", $sequence[0][6] is "G", and so forth.  Moving away from the top row, $sequences[8][12] is "A", and $sequences[9][6] if "G".

As with a one-dimensional array, one of the most straightforward things we can do is to print out the contents of an array. Here's a start:

foreach $row (0 .. scalar(@sequences) - 1) {
    print the contents of the row
}
Now we need to replace the comment in blue with Perl code. To print one item, we could use print $sequences[$row][$column]; so to print a row we would use something like
foreach $row (0 .. scalar(@sequences) - 1) {
    foreach $column (something here about columns) {
        print $sequences[$row][$column];
    }
}
We need to vary the $column in the same way we varied $row. Perl doesn't make it as easy to get the number of columns in an array as the number of rows. In our example, since each row is a sequence, the number of columns is the length of the sequence. Let's assume that all sequences are the same length, and the length is stored in a variable $sequence_length.
foreach $row (0 .. scalar(@sequences) - 1) {
    foreach $column (0 .. $sequence_length - 1) {
        print $sequences[$row][$column];
    }
}
We're missing one piece: we never go to a new line.  Somewhere in there we need a "\n", or our printing will run off the edge of the screen, or the paper.

If $sequence_length is not too big, we can add the linefeed at the end of each row:

foreach $row (0 .. scalar(@sequences) - 1) {
    foreach $column (0 .. $sequence_length - 1) {
        print $sequences[$row][$column];
    }
    print $LF;
}
SQ1. Why is it that these loops use scalar(@sequences) - 1 rather than just scalar(@sequences)? Why $sequence_length - 1 rather than $sequence_length? And what does scalar(@sequences) mean anyway?

The consensus.pl program has just this piece of code. Consensus.pl solves a problem that arose in Scenario 2. You might recall that the notes presented a table of NtcA-binding sites and then claimed that NtcA bound to sequences that looked like GTA........TAC. Where did THAT come from? Well, it came from crude, fuzzy impressions of the general features of the table. We can do better than that. And consensus.pl does.

The program calculates a consensus sequence by looking at each column of a table of sequences. If 80% of the bases are A, for instance, then the program will print a capital A in that column of the consensus sequence. If at least 50% of the bases in the column are A, then the program prints a lowercase a in that column. (Otherwise it prints a period.)  Download NtcA-binding.txt to try it out. (This file contains the NtcA-binding sequences from Scenario 2)

SQ2. In consensus.pl, why does the Calculate_consensus_sequence routine have $column in the outside for loop, and $row on the inside for loop, the opposite of the printing routine?

for $column (0 .. $sequence_length - 1) {
   @count = (0, 0, 0, 0);
   for $row (0 .. Size(@sequences) - 1) {
          . . .
   }
}
SQ3. Run consensus.pl, first as is and then altering the extent of the arrays, e.g. (0 .. $sequence_length), NOT minus 1, to see if the results are altered as you predict.

II. Homegrown BlastN

The BlastN.pl program (which you can test with the DG47 and lef sequences you got when going through the scenario) uses two dimensional arrays extensively, but its printing routines are rather complicated.  What it doesn't have, unfortunately, is a routine to print out two-dimensional arrays like the scoring matrix.  If we had such a routine, we might be able to watch BlastN as it calculated the scores, and so get a better idea of how it works.

SQ4. First of all, get BlastN.pl to run. Does it give you the same results as BlastN from NCBI? Take a look at the program and get a basic idea of how it works.

SQ5. Write a subroutine that will print out part of the scoring matrix. It should look something like this:

sub print_score {        
# prints out scoring matrix
# It's called by print_score(a,b,c,d)
# where a is assigned to $first_target
#       b is assigned to $last_target
#       c is assigned to $first_query
#       d is assigned to $last_query
# The table produced should look something like what's shown below
# (you don't have to print labels, just the numbers)

                #                 Query
                #        1st                     last
                # T  1st score score score  ...  score
                # a      score score score  ...  score
                # r      score score score  ...  score
                # g      score score score  ...  score
                # e       ...   ...   ...   ...   ...
                # t last score score score  ...  score

my ($first_target,$last_target,$first_query,$last_query) = @_;
   foreach $t (  CONTINUE THIS LINE
      foreach $q (  CONTINUE THIS LINE
         print  CONTINUE THIS LINE
      }
      print  CONTINUE THIS LINE
   }
}
A couple of hints: SQ6. Where in BlastN.pl would be an appropriate place to call the subroutine you wrote to print the scoring table? Where could you put calls to the subroutine so that you can capture the process of making the scoring table and thereby learn how BlastN does it?

SQ7. Play with BlastN, both the local and the NCBI varieties (use Blast 2 sequences for the latter), trying to characterize how the two implementations of the program differ. Consider, for example, filtering, scoring, and, of course, how their answers compare.

SQ8. Make changes in the parameters of the programs (e.g. the rewards and penalties). Do these changes affect the results?