Biol 591 
Introduction to Bioinformatics
Fall 2002 

Scenario 5: Definition of set of coregulated genes
Our Story

As you recall (see Scenario 1, story), you were trying to parlay a humdrum graduate thesis project into a research career bound for Stockholm. The key was your discovery of a DNA sequence that looked very much like a binding site for the nitrogen-responsive regulator NtcA and that lay upstream from the gene hetX. You decided to spend the months determining whether NtcA actually binds to site – and it does! Unfortunately, hetX itself turned out to be less interesting than you had anticipated. It is indeed regulated by NtcA, but this regulation turned out not to be important for heterocyst differentiation. Bottom line: you got an article out of the work, but no earthquakes.

OK, so if hetX is not the missing link that connects nitrogen starvation to the induction of the master regulatory gene hetR, turning on heterocyst differentiation, then what is? Somewhere there must be a gene regulated by NtcA that, in turn, regulates hetR expression, and you resolve to find it. You turn again to bioinformatics, intending to search exhaustively for all genes preceded by sequences that look like NtcA-binding sites. You recall, however, that your program found a few thousand such sites. It isn’t credible that NtcA regulates a few thousand genes (i.e. most of the genes in the organism!), so you need a more powerful means of predicting in silico what NtcA binds to in vivo.

You enlist outside help. The program Meme is designed to extract from a set of sequences patterns common to them, returning not only a consensus sequence but also the statistical underpinnings of its opinion, in the form of a position-specific scoring matrix (PSSM). You feed Meme the proven NtcA-binding sites of Nostoc and the sequences surrounding them (plus similar sequences near orthologous genes from a related strain), and Meme returns the conserved region within the sequences and the underlying PSSM. You now have a tool with which to scan the Nostoc genome looking for new genes preceded by putative NtcA-binding sites, a tool much richer than merely comparing sequences against a consensus sequence.

That still leaves the problem of how to apply the PSSM to the genome to obtain the best candidates. How do you do that?

Problem
How can we scan a genome, scoring each portion of it with the PSSM derived from proven DNA regulatory sites, to find all plausible candidates for sequences functionally similar to those in a training set?

Tools
Sequence scanning using position-specific scoring matrices
We'll use a homegrown program to illustrate how PSSMs are used to scan genomic sequences.