BNFO301 – Introduction to Bioinformatics
Sequence Motif Discovery

PSSMs and p53

We've been looking at a particular tool, Position-Specific Scoring Matrices (PSSMs), as a way of identifying conserved sequences. When you find DNA or protein sequences that have been conserved over millions of years, the obvious cause is selection: the conserved sequences are functionally important, and so organisms with mutations that deviate from the conserved sequences are at a selective disadvantage. We can therefore let nature tell us what sequences are important, by discerning what has been conserved.

I want to show you an example of a PSSM in action, doing something useful. To that end, we'll read together the following article;

Hoh J, Jin S, Parrado T, Edington J, Levine AJ, and Ott J (2002). 
The p53MH algorithm and its application in detecting p53-responsive genes.

Proceedings of the National Academy of Sciences 99:8467-8472
You'll want to go to the link to that article and either follow along on line or print out a copy. If you choose the latter route, then click on the Full Text (PDF) link under This Article. While the article has its own charm, I will focus solely on the parts dealing with PSSMs, a small fraction of the article.

...except for this paragraph of general introduction. The authors are trying to find a way of predicting the target genes regulated by the protein p53. This protein is essential in safeguarding the integrity of cellular DNA. Mutations in p53 are found in most instances of human cancer. Since p53 acts by transcriptionally regulating genes, it is of great interest to find the site to which p53 binds, so that we can predict which genes it regulates. The article by Hoh et al is a step in that direction.

If you'd like to learn more about p53, try this review article:

Levine AJ (1997). p53, the cellular gatekeeper for growth and division.
Cell 88:323-331
Introduction of Hoh et al (2002)

Read the first paragraph and the beginning of the second. The second paragraph describes the binding site of p53 in terms that may not be familiar to you. Click here for a table that may help you understand the description of the binding site here and elsewhere in the article. You'll notice that Pu (purine) means "A or G" and Py (pyrimidine) means "T or C". 

SQ1. Is the given consensus sequence for p53 protein palindromic?

SQ2. What is the meaning of the sets of arrows in the paragraph? The three dots?

SQ3. Write out two sequences that fit the described p53 binding site, making them as dissimilar to each other as possible. What percent of their nucleotides are identical to each other?

SQ4. What problems do you see in identifying unknown p53 binding sites upstream from genes?

Skim through the last two paragraphs of the Introduction. Understand as much or as little of it as you like. In brief, RT-PCR (real-time polymerase chain reaction) is a method to quantitate the amount of a specific mRNA. By comparing the expression of a gene in a cell line under conditions where p53 is active to expression under conditions where p53 is inactive, one can determine the degree to which the gene is regulated by p53.

Materials and Methods

p53MH Algorithm
Read the first paragraph. The important thing to see is that the algorithm they describe makes the same noises as a PSSM. Don't bother going through the second paragraph, which describes how they score candidate sequences (using a different method than the one you examined last week).

Table 1: Weight and filtering matrices
The first part of this table looks like a count matrix (as described in last week's notes). Is it? The accompanying text says that the matrix came from two sets of data. Examine the data and determine if indeed the table is a count matrix:

Data from el-Deiry et al (1992) [Nature Genetics 1:45-48]
Data from Funk et al (1992) [Molecular and Cellular Biology 12:2866-2871]

SQ5. Explain how the numbers shown in Table 1 are derived from the two data sets. For example, why is the value for A in the upper left corner of Table 1 14?

SQ6. Construct by yourself the first column of the matrix from each individual data set, to end up with two first columns (one based on 20 sequences and the other based on 17). Are the results comparable?

SQ7. Why does the table sometimes have half counts? (see for example the 7th column)

Why are the two data sets so different? Why do the sequences of el-Deiry et al often have sequences interpolated between the two halves of the 20-nucleotide binding site and sometimes even within the two halves? Why don't the sequences of Funk et al have these defects? As it happens, the sequences of el-Deiry et al were obtained by catching natural human DNA sequences with p53 and sequencing them. The sequences of Funk et al are random sequences that were enriched for those recognized by p53 by multiple rounds of binding to the protein. Evidently, the optimal p53-binding sequences don't have the defects often seen in real p53-binding sequences. Let that be a lesson: real sequences are messy.

What about the filter table at the bottom of Table 1? That's intended to capture information about excluded nucleotides at specific positions: if the filter table shows a 1, then the nucleotide at that position is allowed. If it shows a 0, it isn't. For example, in the fourth column shows zeros for A and G. Indeed, there is no case in the data of el-Deiry and Funk that an A or G appears in that position. Candidate p53-binding sequences are first passed through this filter, if a nucleotide at a given position is excluded (has a 0 at the position), then the candidate is thrown away. 

This behavior may puzzle you, since you saw that 0 counts are transformed into 0 frequencies, which when multiplied, give 0 as the final product... unless pseudocounts are applied. So why go through the bother of constructing a special filter sequence?

SQ8. Is there an exact correspondence between positions that have zero in the "weight" table and positions that have zeros in the "filter" table?
The filter table was constructed using additional information besides the counts derived from the data of el-Deiry and Funk. The authors believed on the basis of other evidence that certain nucleotides at certain positions prevented the binding of p53 to its target sequence. There was no such evidence for some nucleotides at some positions, even if they had never been observed.
SQ9. What is another way to handle this situation, making use of methods you've already examined?

SQ10. Take a look at the explanation for xi (looks like a curly E) at the bottom left of the second page of the article. What does that explanation sound like?

Results

The rest of the article is rather garbled, and you would probably find it difficult to pick out what the authors actually did. In brief, they took 5000 nucleotides upstream from available human genes and 5000 nucleotides downstream from the same genes and applied the weight table in a way we don't have to dwell on now. Genes with regions that scored well are shown in Table 3 (genes already known to be regulated by p53) and Table 4 (genes now predicted to be regulated by p53). 

SQ11. Look at the first sequence given in Table 3 (for the sequence upstream from a gene called Snk). How does it relate to Table 1?

SQ12. Some sequences in Table 3 have dots in the middle, some don't. How do you interpret the dot?

Incredibly, the article does not list the putative p53 binding sites for the genes shown in Table 4. To find them, you have to go to the outside web site given in the paper:
Go to the authors' web site devoted to p53 results
Click on Pathways
Click on the link (shown as "15") given on the row Tumor suppressor/Apoptosis 
       and the column Perfect Match
There you should see the sequences for those genes in Table 4 that happen to be tumor suppressors or related to cell death (apoptosis). 
SQ13. What fraction of the genes with the highest degree of similarity to proven p53 binding sites (i.e. with perfect scores) are related to tumor suppression or cell death? Is this suspicious?
Make a PSSM to check the results

Let's check how good those sites really are. To do that we're going to use the data of el-Deiry and Funk to construct a PSSM and then use that PSSM to score the sequences...

... but that's the subject of the next set of notes.