Open Reading Frames Problems

The following problems all assume you will modify orf-ps.pl. If I've written the problems correctly they are independent. Each one improves orf-ps.pl in some way, and together they make it the best program that I was able to come up with to detect ORFs without sophisticated Markov models.

You'll need to download the Synechocystis genome (Synecho.nt), if you haven't already. It may help to compare the output of your program at various points with Kasuza's list of ORFs for Synechocystis, in the file Synecho-orfs.txt.

P7P.1 Wrapping around. If you look at the first line of genome file you'll see

   >6803.flat  3573470 bases

If you look at the last line of Kasuza's list of ORFs you'll see

   3573271    3574242    d

So we've got an ORF that starts at position 3573271 out of 3573470, very near the end of the genome, and stops at position 3574242 -- after the end. How can that be?

The answer is that the genome is circular, so position 3573470 is followed by position 1, and position 3574242 is the same as position 773.

Modify the program to take the circularity of the genome into account. Hint: the expression $sequence . $sequence stands for $sequence followed by an identical copy of itself. Warning: Be sure not to report ORFs whose beginning is after the end of the sequence.

P7P.2 Searching inside ORFs. Study questions 11 through 13 dealt with the problem of finding false stop codons. A complementary problem is ignoring legitimate start codons. Consider the sequence ATG CCA TGA: a start codon, a codon for proline, and a stop codon. Way too short to take any notice of. But hiding in there is a start codon for another frame: AT GCC ATG A. That start codon could be followed by an open reading frame thousands of bases long, but the program won't find it, because it will resume searching after the last A.

We can change where the program searches by setting pos($sequence), which we can do by using it on the left-hand side of an assignment

   pos($sequence) = ... ;

P7P.2a: Replace the ... with an expression giving the right position to resume searching, and put the statement in the right place in the following code:
   while ($sequence =~ /((ATG|GTG|TTG)(...)*?(TGA|TAG|TAA))/g) {

      my $orf_length = length($1);
      my $orf_end = pos($sequence) - 1;
      my $orf_start = pos($sequence) - $orf_length;

      next if $orf_length < $threshold;

      push @orfs_found, [$orf_start, $orf_end, $direction];

   }
P7P.2b: Solving P7P.2a will find some out-of-frame start codons that had been hidden; it will also find some in-frame start codons that we don't really want. For instance, inATG CCC ATG ... TAA there are two start codons. Assuming that the ... stands for a string three hundred or more bases long, the program would report something like this:
   3261    4232    d
   3267    4232    d
Having seen the first line, we probably don't need to see the second. Modify orfs_in_direction so that it reports no more than one ORF with a given endpoint. Hint: use a hash to record the endpoints we've seen already.

P7P.3: Reverse complements. The orf-ps.pl program provides you with a reverse complement function. However, there is a fly in the ointment. We'd like to have the positions reported relative to the direct genome, but when orfs_in_direction calculates the starting and ending position of an ORF, it does the calculation relative to the reversed genome. After we call orfs_in_direction to find reverse ORFs, an row in @reverse_orfs with a starting position of 0 represents an ORF right at the end of the genome in terms of direct positions.

Uncomment the two statements that are commented out in the following section of ocde, and add statements to calculate the direct start and direct end of the ORFs:

   #   @reverse_orfs = orfs_in_direction("c", reverse_complement($genome));

   foreach my $info (@reverse_orfs) {
      my ($reverse_start, $reverse_end, $direction) = @$info;

      # @$info = ($direct_start, $direct_end, $direction);
   }