Problem Set 5 - Bioinformatics
PS5B.1. Construct a PSSM using the sequences found in the file 71NpNtsm.txt, a value of B = Square root(N) and background frequencies of [A] = [T] = 0.32 and [C] = [G] = 0.18. Donít do an entire table. Columns 34 through 36 are enough (column 34 has all Aís).
PS5B.2. Calculate the uncertainty and the information content for the same three columns.
PS5B.3. Calculate the log(Normalized Score) of the sequence GCTTAGTA using just the first 8 columns of the matrices found in Table 1 of the notes for Monday, 21 October, and using a B value of 0.1. Note that this a bit different from whatís shown in Table 2 (since Iím asking for the log of the normalized score. Of course, you could go through the calculation shown in Table 2 to obtain the normalized score and then take the log of it, but that requires many divisions. Itís much easier to use the log odds tableÖ but then how do you normalize? (Hint: real easy. You need to take two logs and use them many times).
PS5B.4. You want to play with the program FindMotif, but you donít want to wait 15 minutes each run as the program goes through the entire Nostoc genome. You could alter the sequence file so that itís shorter, but hereís an faster solution. Modify FindMotif so that it reads only the first 800 lines before saving the best hits to a file and quitting. (Hint: you already know a quick way to stop a program: die).
PS5B.5. What is the effect of changing the B value from 0.1 to sqr root (N)? Why? (Don't answer on general principles. Run the program with the different values, THEN trot out the principles).
PS5B.6. What is the effect of changing the threshold used to determine which columns are used to 1.8? to 0.8? Why?
PS5B.7. What is the effect of changing the threshold given as $threshold to Ė2**60? To Ė2? Why?
PS5B.8. Suppose you didnít know anything about NtcA, but you DID know that glnA, nirA, and other genes are controlled by nitrogen availability and suspect that there is a regulatory protein that binds to their upstream sequences. Youíve collected the upstream sequences from a variety of genes of Nostoc PCC 7120 and Nostoc punctiforme that you think are coregulated (in the file 71NpNtcA.nt).
PS5B.8a. Ask either Meme or Gibbs Sampler (or both) to find a common motif. Try requiring that all sequences have motifs. Try also letting the program find motifs only in certain sequences of the set (zero or one). By the way, Meme asks you for an e-mail address and generally gets you the output the next day. Gibbs Sampler gives you output immediately, but in my experience at least, it's found less than Meme.
PS5B.8b. Thatís asking a lot of mere software. Repeat the request but using the file 71NpNtsm.txt, containing what happens to be bona fide NtcA binding sites with less extraneous sequence. Can Meme or Gibbs Sampler find it now?