Finding Open Reading Frames

Our job is to find open reading frames that might be genes. We'll do the simplest possible search, for start codons followed in the same frame by a stop codon.

We're going to test this with the Synechocystis genome from Kasuza DNA Research Institute. To check our work we'll look at a list of known good ORFs, also from Kasuza. We're going to expect some false positives: our program will find some ORFs that Kasuza doesn't list, and sometimes Kasuza will list a shorter ORF than ours, ending at the same place, but beginning earlier than ours.

Before you begin, download the Synechocystis genome in a file called Synecho.nt.

We'll write the program piece by piece, stopping from time to time for study questions. Sometimes a study question will ask you to think about how to use the Perl you already know to solve a problem; other times  we'll really need a new Perl feature, and the study question will ask you to use your imagination to think of what that feature might be.

Here's our starting point, with three crucial variables filled in.
#!/usr/bin/perl -w
use strict;

########################### Variables #################################

my $threshold = 300; # An ORF (open reading frame) with this many bases might
# be a gene.

my $genome; # The genome we're investigating.

my @orfs; # Lists the ORFs we find in the genome. Each row of
# @orfs is a triple
#
# [$start, $end, $direction]
#
# where $start tells us the beginning of the ORF in
# the genome, $end tells us the last position, and
# $direction is either "d" (for direct), meaning the
# ORF came from the genome as given, or "c" (for
# complement), meaning the ORF came from the reverse
# complement of the genome.

############################# Files ###################################


########################## Main Program ###############################


######################## Subroutines ##################################
SQ1: Read the comments, and decide what you would put in the Files and Main program section. Then go to the next page and compare your outline to the one there.