Biol 591 
Introduction to Bioinformatics
Fall 2002 

Scenario : Identification of a possible regulatory site in genomic DNA
Our Story

Nitrogen-fixing cyanobacteria: Eat air and prosper!
Certain cyanobacteria, amongst them Nostoc PCC 7120, are among the only creatures on earth able to survive on CO2 as a source of carbon,  N2 as a source of nitrogen, water as a source of electrons, and sunlight as a source of energy. This is quite a trick, because the process of fixing carbon with electrons from water necessarily produces O2 as a byproduct and the process of fixing N2 is irreversibly inactivated by tiny amounts of oxygen. Nostoc is able to protect the machinery of nitrogen-fixation from inactivation by producing specialized cells, called heterocysts, that rigorously exclude oxygen from within them.
(Image of Nostoc filament with heterocyst)
Filament of Nostoc. The green cells are photosynthetic vegetative cells. The pale cell is a heterocyst, specialized for nitrogen fixation

The cost of fixing nitrogen: How to pay only when necessary?
Heterocysts are expensive to make and maintain, however, and you are interested studying the mechanism by which Nostoc regulates the appearance of heterocysts. When an alternative source of nitrogen is present, Nostoc makes no heterocysts. When that source is consumed or removed, vegetative cells differentiate into heterocysts within about 18 hours. How do the cells sense nitrogen deprivation and translate that perception into the induction of the genes necessary for heterocyst differentiation? At present, the answer to this question is not known.

The discovery: starvation ==> *** NtcA-BINDING ***  ==> heterocyst differentiation
You are studying the regulation of the gene hetR, whose product is known to be critical in controlling heterocyst differentiation. You're focusing on the protein HetQ, which you believe regulates the expression of hetR. Your plan is to make random mutations in hetQ (which encodes HetQ), hoping to understand from the resulting mutant protein how the regulation is achieved. In examining the sequence upstream of hetQ, you happen to notice the presence of the sequence:

atctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctct
The capital letters, you recognize, meet all the requirements of a binding site for the protein NtcA, known to mediate the expression of many genes sensitive to nitrogen-deprivation. Maybe, just maybe, you have accidently discovered the missing link that connects nitrogen-deprivation to the regulation of heterocyst genes!

The discovery? How do you know?
Unfortunately, you need hard evidence that NtcA actually binds to that site before anyone will believe your theory. And hard evidence means spending the better part of a year measuring the binding of NtcA to your sequence in the test tube. If it DOESN'T bind, then you've wasted a lot of time. Is there any way to assess the LIKELIHOOD that NtcA will bind to your sequence without actually having to do time-consuming experiments? How can you tell whether the sequence you found might not have arisen by chance without regard to function?

Problem
Use bioinformatic tools to assess the likelihood of encountering a specific DNA sequence by chance.

Tools
Simulation
Make up a large number of sequences. Ask in each case whether the sequence fits the criteria for an NtcA binding site. Count how many times it does, how many times it doesn't.

Pattern recognition
Scan the genome of Nostoc PCC 7120 and count how many sequences fit the pattern of an NtcA binding site.