Biol 591 
Introduction to Bioinformatics
Scenario 2, Problem Set 2P
Fall 2003 

All prior and future Study Questions are deemed members of a problem set. This makes them fair game on days in which we discuss problem sets (and also when I devise questions for exams).

PS2P-1. Discover! Download and run ps2p-1.pl, part of which is reproduced below. Fool around with it, change anything you can think of,  make simple programs to teach yourself what each of these lines do:

@letters = split //, $sequence;
while($sequence =~ /(...)/g) {
   push @triplets, $1;
}
print "@triplets", $LF;
print join("|", @letters);
print "First triplet: ", $triplets[0], $LF;
print "Last triplet: ", $triplets[-1], $LF, $LF;
In particular:
1a. What does split //, $sequence do?
1b. What does $sequence =~ /(...)/g do?
1c. What does join(" ", @array) do?
1d. What does $triplets[-1] refer to?
PS1P-2. Discover! Download and run ps2p-2.pl , part of which is reproduced below, and the accompanying data files ps2p-2-data1.txt and ps2p-2-data2.txt. Fool around with it, change anything you can think of,  until you understand what each of these lines do:

  @combined_list = (@input_list1, @input_list2);
  @final_list = sort (@combined_list);

In particular:

2a. What does the syntax (@array1, @array2) do?
2b. What does sort do?
2c. What's the use of this program?
PS1P-3.  Make an array that contains the letters from A to Z, where each element contains a single letter. Use elements of the array to spell out your name.

PS1P-4. Use two loops to print out on the screen something like graph paper:

       A B C D E F G H I J K L M N O P Q R S T
     1
     2
     3
    ...
    20
PS1P-5. Download and run ps2p-5.pl, then look at the program.
5a. As written, the program would read more clearly if the two statements of the form:
for (specifications)
were replaced with statements of the form:
foreach $variable (list)
Make the two replacements.

5b. If you increase the value of $last_integer to something more interesting, like 10e7, then the time of execution becomes of great interest. The program would run about twice as fast if the loops didn't consider even numbers. Change the loops so that it does not. Make sure that the output of the program is the same, except, perhaps, for the first number.

PS1P-6. Write a quick program, using two for/foreach loops to print out every possible palindrome of length 4. You might make good use of the following subroutine:
#### REVERSE_COMPLEMENT (sequence)
#    Returns reverse complement of sequence
#    In other words, takes strand 5' -> 3' and returns other strand 5' -> 3'
#    Preserves case of input sequence
#    Reverse_complement degenerate symbols (e.g. R) as well
sub Reverse_complement {
   my ($sequence) = @_;
   $sequence =~ tr/ACGTKMRSWYBDHVNXacgtkmrswybdhvnx/TGCAMKYWSRVHDBNXtgcamkywsrvhdbnx/;
   return reverse($sequence);
}
PS1P-7.  Consider ps2p-7.pl (a derivative of BlastParser). Its use of Shift/Unshift is confusing (at least to me). Eliminate all uses of unshift and shift, replacing them with push and pop. Make sure that the output of the program is the same in the program you produce as the program you're given!

PS1P-8. Modify ps2p-7.pl so that it keeps track of the lengths of open reading frames (ORFs). It should store in an array @orf_lengths the number of times ORFs of a given size occurs in the data set. For example, if there are 27 instances of an ORF of length 169, then $orf_length[169] should equal 27. For each length, print the length and the value of $orf_length to a file, then use Excel to display a graph of ORF length vs frequency.

PS1P-9. Change Find_NtcA.pl (which you wrote according to the specifications in the notes) so that it prints out only the total number of exact matches and inexact matches.

PS1P-10. Change the model in Find_NtcA so that it simulates an intergenic sequence. In Anabaena, intergenic sequences have a GC fraction of 36.1%. How much of a difference in the predicted number of NtcA-binding sites does it make to presume (a) equal nucleotide frequency, (b) nucleotide frequency is same as overall genome, (c) nucleotide frequency is same as intergenic region?

PS1P-11. Bottom line: Do you think it would be worthwhile spending the next year in the lab testing the hetQ region for binding to NtcA?