Problem Set 5B: PSSM

Biol 591

Introduction to Bioinformatics
Problem Set 5 - Bioinformatics

Fall 2002

PS5B.1. Construct a PSSM using the sequences found in the file 71NpNtsm.txt, a value of B = Square root(N) and background frequencies of [A] = [T] = 0.32 and [C] = [G] = 0.18. Don’t do an entire table. Columns 34 through 36 are enough (column 34 has all A’s).

PS5B.2. Calculate the uncertainty and the information content for the same three columns.

PS5B.3. Calculate the log(Normalized Score) of the sequence GCTTAGTA using just the first 8 columns of the matrices found in Table 1 of the notes for Monday, 21 October, and using a B value of 0.1. Note that this a bit different from what’s shown in Table 2 (since I’m asking for the log of the normalized score. Of course, you could go through the calculation shown in Table 2 to obtain the normalized score and then take the log of it, but that requires many divisions. It’s much easier to use the log odds table… but then how do you normalize? (Hint: real easy. You need to take two logs and use them many times).

PS5B.4. You want to play with the program FindMotif, but you don’t want to wait 15 minutes each run as the program goes through the entire Nostoc genome. You could alter the sequence file so that it’s shorter, but here’s an faster solution. Modify FindMotif so that it reads only the first 800 lines before saving the best hits to a file and quitting. (Hint: you already know a quick way to stop a program: die).

PS5B.5. What is the effect of changing the B value from 0.1 to sqr root (N)? Why? (Don't answer on general principles. Run the program with the different values, THEN trot out the principles).

PS5B.6. What is the effect of changing the threshold used to determine which columns are used to 1.8? to 0.8? Why?

PS5B.7. What is the effect of changing the threshold given as $threshold to –2**60? To –2? Why?

PS5B.8. Suppose you didn’t know anything about NtcA, but you DID know that glnA, nirA, and other genes are controlled by nitrogen availability and suspect that there is a regulatory protein that binds to their upstream sequences. You’ve collected the upstream sequences from a variety of genes of Nostoc PCC 7120 and Nostoc punctiforme that you think are coregulated (in the file 71NpNtcA.nt).

PS5B.8a. Ask either Meme or Gibbs Sampler (or both) to find a common motif. Try requiring that all sequences have motifs. Try also letting the program find motifs only in certain sequences of the set (zero or one). By the way, Meme asks you for an e-mail address and generally gets you the output the next day. Gibbs Sampler gives you output immediately, but in my experience at least, it's found less than Meme.
PS5B.8b. That’s asking a lot of mere software. Repeat the request but using the file 71NpNtsm.txt, containing what happens to be bona fide NtcA binding sites with less extraneous sequence. Can Meme or Gibbs Sampler find it now?