Biol 591 
Introduction to Bioinformatics
Problem Set 7 - Identification of features through HMMs
Part 2
Fall 2002 

PS7.4. Use pattern matching and capture to search through a DNA sequence and find only those open reading frames that are at least 100 amino acids in length and:

PS7.4a. Contain at least one tryptophan codon. Tryptophan is the least commonly encountered of the 20 standard amino acids.

PS7.4b. Contain:

{arginine or lysine} {2 or 3 any amino acids} {aspartate or glutamate} {2 or 3 any amino acids} {tyrosine}
This is the signature of tyrosine-specific protein kinases, a family of enzymes of great importance in regulating fundemental processes in eukaryotic cells (e.g. cell division).
PS7.5. A true story. Last weekend someone came to my office asking whether I had FirstPublisher (I think that's the name). She had saved a file at home onto her floppy disk hoping to print it out in school, but she could find no available computer that had the program. She did not want to waste an hour driving home and back again. I didn't have the program either, but looking at the file in NotePad I saw the following:
ì¥Á M    ð¿             w    bjbjâ=â=                    "  €W  €W  w                              ÿÿ         ÿÿ         ÿÿ                 l     î       î   î       î       î       î       î                    4      4      4      4     @           ¡  €  [and more of the same unprintable junk]
There will be a benefit concert... [more of the same for some paragraphs]
[more unprintable junk]
We established that she would be happy if she could just get the text into a word processor. Write a quick Perl program that will take her FirstPublisher file and give her a text file with just English in it.

PS7.6. Alter Hamlet.pl so that it uses a FIFTH-order rather than a FOURTH-order Markov chain.

PS7.7. How long would it take a monkey to type Hamlet? Too long. Let's try something simpler. How long would it take the program Hamlet.pl to type out a line beginning "To be or not to be"? To find out, modify the program so that it churns out speech after speech until one begins the way you want it. This could take a while, so make sure that every so often the program displays how many iterations it's gone through.To give the program a fighting chance, use the fifth-order version and specify that the speech must begin with "To ".

PS7.8. Remember Scenario 5? A PSSM seemed to do a decent job finding NtcA sites (or did it? Stay tuned next week!). Perhaps an HMM could do even better? Write/steal a program that will construct an HMM from the sequences within 71NpNtsm.txt and produce faux NtcA binding sites. Run it a few times. You may get an error message in the execution of the program. If so, the accompanying message will probably give you a big hint how to proceed.

PS7.8a. Did you get the error messages shown below? If so, then go to the line of the program in question. In what routine did the error occur? From the description of how the routine works, what problem do you anticipate in using the sequences you gave the program? Alter the program to fix the problem.
Use of uninitialized value in string eq at test.pl line 153, <INPUT_FILE> line 10.
substr outside of string at test.pl line 153, <INPUT_FILE> line 10.
PS7.8b. Once you get sequence output, do you find any likely NtcA binding sites? (If you don't know what a likely NtcA binding site looks like, go back in time and find out.)

PS7.8c. Why aren't there a lot of good NtcA binding sites? Step back a moment. From how HMMs work, what kind of problem do you think it would be bad at solving? What kind of problem would it be good at solving?

(I would like to take this opportunity to remind you that producing fake protein binding sites based on Markov models is not a very common procedure, while detecting protein binding sites is very common)