Welcome to BioBIKE's tutorial
What is a Gene?

What determines the beginning of
a protein-encoding gene?

You may be pleased with the elements you found, satisfied that you've achieved a glimpse into the mind of God...

...but wait a second. Perhaps those triplets are red herrings! They are correlated with the beginnings of genes, but do they determine the beginning? After all, capital letters don't ALWAYS indicate the beginning of a sentence.

  1. Do these triplet elements occur elsewhere in genes? If so, then, just as in English, internal cues must be supplemented by external cues in order to determine the beginning. If the triplets are sufficient to determine the beginning of a gene, then there should be the same number of the triplets as there are genes. You can answer this question by using COUNT-OF in two ways. First, it can count the number of items in a list:

    Second, it can count the number of times a string (a group of characters) appear in another string (including a DNA sequence):

    If every ATG marks the beginning of some gene (though all genes do not necessarily begin with ATG), then the number of ATG's should be no more than the number of genes. Is it?

     
  2. We seem to have a problem. How to escape? Well, in English, not every capital letter begins a sentence. You can distinguish between a capital letter beginning a sentence and a capital letter internal to the sentence by punctuation – a preceding period, question mark, or exclamation point.

    Let's see if there is a similar indicator preceding genes, but for some variety, switch to the freshwater cyanobacterium Synechococcus PCC 7942 (abbreviation S7942B). First verify that genes of this organism start with the same triplets as the genes of ss120.


  3. OK, they do. What comes before the begining ofS7942B's genes? You can see the nucleotides preceding the genes by (SEQUENCE OF... as you did before, just use a negative number after FROM. Using this tool, examine the 15 nucleotides prior to each gene.

    You might be able to find some sort of pattern in all those nucleotides (the human mind can find a pattern in anything), but certainly nothing jumps out as did the initial triplets.
    Some statistical analysis might focus our attention on areas of interest. Suppose we built a table that looked like what you see to the right.

    If there were particular biases for or against certain nucleotides at specific positions before genes, maybe the table would make them apparent. Lets find out. First, give the set of sequences a name, something like:


  4. Now we need to go through each sequence and tally the nucleotides at each position. If you were to do this by hand, you would go through each sequence and at position -15 count whether the nucleotide is an A, C, G, or T, putting a mark in the appropriate square of the table. Then you'd do the same with position -14, and so forth. After finishing this sequence, you'd go on to the next, and the next, and the next,... until you had finished the sequence for each of the 2890 genes in S7942B.

    Not something you want to do when the weather's nice outside. Fortunately, BioBIKE can do this automatically, using a function called MAKE-PSSM-FROM. A PSSM (Position-Specific Scoring Matrix) is a table of the sort we imagined, except that frequencies instead of counts are given.

    Find this function in the STRING-SEQUENCES menu and Bioinformatic-tools submenu, giving you in the end something like the following:

    You could click on the aligned-list gray argument box and type in the name upstream-sequences, but here's a more foolproof method: click on the gray box, then click on the VARIABLES menu, and finally click on upstream-sequences. That will transfer the name of the variable to the argument box with no possibility of misspelling error. Finally, execute the function. You will get a message indicating that you generated a two-dimensional table, as expected.

     

  5. To display the contents of the table, grab the DISPLAY-TABLE function from INPUT-OUTPUT menu, and supply it with PREVIOUS-RESULT from the OTHER-COMMANDS menu, giving:


     

PROBLEM 4:

What do you make of the results? Can you find any nucleotide counts at any position that stand out? Can you imagine any explanation for the pattern?

PROBLEM 5:

Examine lots of sequences before genes of S7942B. Now that you know what to look for, do you see by eye what your program detected as an aggregate? How (in principle) could you test whether the pattern is significant or your eye is just inventing it?

If you got this far,
then you have found a tantalizing
hint of a signal that you probably
would not have noticed just by
random browing!

Use your browser to go back
one page to the table of contents.