Problem Set 5

Biol 591

Introduction to Bioinformatics
Problem Set 5 - PSSMs

Fall 2003

PS5.1. Construct a PSSM using the sequences found in the file 71NpNtSm.txt, a value of B = Square root(N) and background frequencies of [A] = [T] = 0.32 and [C] = [G] = 0.18. I would do this in Excel.

PS5.2. (Still in Excel) Calculate the uncertainty and the information content for the same data.

PS5.3. (Still in Excel) Calculate the log(Normalized Score) of the sequence given in 71NpNtSm.txt for the NtcA-binding regin from glnA of Anabaena PCC 7120 (labeled 71-glnA) using the PSSM you constructed in PS5.1. Do the same but using only those columns whose information content exceeds a value of your choosing.

PS5.4. Complete the subroutine Determine_informational_positions within FindMotif and try it out with different informational threshold values. What is the effect of changing this parameter?

PS5.5. What is the effect on the performance of FindMotif of changing the B value from 0.1 to sqr root (N)? Why?

PS5.6. It’s all very good to calculate the informational content of an alignment, quite another thing to see it in the flesh. Let’s do the latter. In brief, logos are graphical representations of the information content of an alignement. You can find out more about them by going to Tom Schneider's web page. To make a logo of NtcA-binding sites, extract from a file (71NpNtSm.txt) containing aligned binding sites a smaller portion of 55 nucleotides total, by removing the first 22 nucleotides.

One way to do this is to bring the file into Word and cut out the excess sequence on the left. This is facilitated by Word's ability to define blocks. Depress the Alt key, then click on the upper left nucleotide of the aligned sequences. Then drag the mouse to the lower right nucleotide of the block you want to delete and release the mouse. Then press delete.

Paste this sequence into the sequence box at Web Logo. With "Compute defaults" selected, click "generate". Once the form has been generated you can fool around with the settings below the sequence box (Height of vertical bar is a good parameter to change), or you can just accept the default values. Then, with "Graphical view" or "Postscript text" selected, click "generate" again. You will need to have a program that can read postscript files. Adobe Photoshop is one such program. There is also freeware, GhostScript, available.

What positions seem particularly informative? (For some positions you might be able to suggest why)

PS5.7. What sequences found by your modification of FindMotif.pl do you think would be worth investigating further as possible NtcA-binding sites?

PS5.8. Discovery! Consider the program Translate_aa_codes.pl. Download and run it to see what it does.

5.8a. The program is defective in that if forgot proline (whose one-letter code is P). Fix the program.
5.8b. Look at the sixth line of the subroutine Print_all_codes:
for $code ( sort (keys (%amino_acid) ) ) {
Fool around with the program until you understand what the keys function does.
5.8c. Why is it necessary to sort the keys? (After all, the hash was defined with the keys in alphabetical order). What happens if you omit sort?

PS5.9. In Exam1 you wrote a program that determined the nucleotide frequencies of a genome. Simplify that program by using a hash that holds the total counts of each nucleotide.

PS5.10. Suppose you didn’t know anything about NtcA, but you DID know that glnA, nirA, and other genes are controlled by nitrogen availability and suspect that there is a regulatory protein that binds to their upstream sequences. You’ve collected the upstream sequences from a variety of genes of Anabaena PCC 7120 and Nostoc punctiforme that you think are coregulated (in the file 71NpNtcA.nt).

5.10a. Ask either Meme or Gibbs Sampler (or both) to find a common motif. Try requiring that all sequences have motifs. Try also letting the program find motifs only in certain sequences of the set (zero or one).
5.10b. That’s asking a lot of mere software. Repeat the request but using the file 71NpNtSm.txt, containing what happens to be bona fide NtcA binding sites with less extraneous sequence. Can Meme or Gibbs Sampler find it now?