Biol 591 
Introduction to Bioinformatics
Problem Set 8 - Use of Databases
Fall 2002 

PS8.1: Write an outline that describes the strategy employed by the subroutine print-context within find-context.plto locate the appropriate orfs for each hit in motif.hits. What assumption(s) does the routine make? (Since the program works, these assumptions are evidently correct)

PS8.2: The program find-context.pl assumes that memory is not a limiting resource and proceeds to read into memory all the information from the data file it will ever need. This is often not a bad assumption, but as we want to consider more and more information it can become bad. Suppose that you’re working on a PC with 64 Mbyte memory (call me old-fashioned, but that’s my PC), and, for safety reasons, you decide you don’t want to use more than 32 Mbytes of it on running this program.
 

PS8.2a. Calculate how much memory you’re using in storing all the information you read from 7120DB.dat. It's not always easy to tell how much memory Perl will use for its data.  Try something like this: 20 bytes for each number, 25 bytes for each string, plus 1 byte for each character in the string, and 72 bytes for each array (where each row [ ... ] counts as an array). You may find that you’re still lacking one key piece of information needed to calculate this number. If so, then write a quick program (actually, alter find-context.pl) so that it gives you this piece of information.

PS8.2b. Suppose that you want to run the program using a similar database file that covers all human genes. Now how much memory do you need?

PS8.2c. Suppose that want the program to give you not only the coordinates of the relevant genes but also the 3000 bp adjacent to each of the identified orfs. This would be important if you wanted to investigate the sequences of complex regulatory sequences governing the expression of the genes you identify. NOW how much memory do you need?

PS8.3: The program find-context.pl treats 7120DB.dat as a stream file, which is to say, it reads it sequentially from the beginning. Suppose that practical considerations (see previous problem) convince you that you can’t do it this way. Instead, you will access ONLY those records you need, when you need them, treating 7120DB.dat as a random access file. Outline the strategy or draw a flow chart (don’t actually write any code) of a program that would enable you to do this. It will consist of the following steps:
PS8.4: Same problem as above, but use the following strategy instead:
PS8.4a. Presuming that you search 7120DB.dat sequentially, what is the average number of records you’ll have to go through to find some random coordinate read from motif.hits?

PS8.4b. Suppose that searching through this number of records leads to unacceptably long waits (it does on my machine) (with my level of patience). Devise a search strategy that cuts the search time down enormously. [Hint: consider how you would find a name in the phone book – not by scanning from page 1!]

PS8.4c. There are (at least!) two such strategies. One benefits from assuming that motif.hits is sorted (it's not, but clearly we could sort it and print it out again). Another strategy works whether or not motif.hits is sorted. Which was yours? What would the other one look like?

PS8.5: Run find-context.pl using the input file test-motif.hits. It gives the error message:
Can't yet print context without right-hand ORF
PS8.5a. Why?
PS8.5b. Modify find-context.pl so that it works properly with test-motif.hits.
PS8.6: The program find-context.pl does not quite do everything we would want of it. It would be useful to have the output in a form that's easily read by Excel and in a form that makes it easy to sort by the direction of genes (something we're very interested in). Here’s an example of the desired output:
<--*all0606*PetC: cytochrome b6/*104*704137*704212*467*-->*alr0607*NirA: nitrite reduct*25.5
(replace * with tab). This places the motif (coordinates 704137 to 704212) upstream from all0606 (104 bases from the left end of the motif) and upstream from alr0607 (467 bp from the right end of the motif). The motif has a high score (25.5) and it's positioned the right way (upstream from at least one gene), so it may well be a functional NtcA binding site.The output in this form can be read into Excel where it can be searched and displayed in a number of ways.
PS8.6a. Modify find-motif.pl so that it gives the desired output.

PS8.6b. Examine the output in Excel. What motifs look interesting? By “interesting” I mean that they are placed upstream from at least one gene of known function. To facilitate your search, sort the data so that hits between  two upstream regions are at the top of the list, then those with one upstream region, then those with no upstream region (or the NtcA site is within a gene).

This, if you can do it, is the ultimate solution of the problem posed by Scenario 8.

PS8.7: Consider the unpack statements (related to input) and printf statements (related to output) you have run across. Never mind graphic input and output, which add another order of magnitude of complexity. How much easier would programming be if you could let someone else worry about input and output?