BNFO 650: Sequence Analysis in Biological Systems

RNA Secondary Structure and small non-coding RNA

An integrated programming environment

Link to mfold RNA scoring tables

Prediction of RNA secondary structure

Presentations

Problems

These problems make use of some special functions within BioBIKE. To access them, mouse over the OTHER-COMMANDS button and click ENTER. Then type bnfo601 in the package-name box. You should find two functions, FILL-IN-ENERGY-TABLE and SEQUENCE-FROM in your FUNCTIONS menu.

1.

Determine the predicted free energies of formation of the optimal secondary structures of different RNA's based on the . First use the mfold RNA scoring tables to make your own assessment as to what you think the free energies ought to be. Then use mfold to get another opinion.

 1a. The RNA should consist of 10 repetitions of the dinucleotide, followed by 10 N's, followed by the inverse of the first 20 nucleotides. For example, considering the dinucleotide "AC", you would test: ACACACACACACACACACACNNNNNNNNNNGTGTGTGTGTGTGTGTGTGT 1b. The RNA should consist of 10 repetitions of the first nucleotide followed by 10 repetitions of the second, followed by 10 N’s, followed by the inverse of the first 20 nucleotides. For example, considering the same dinucleotide “AC”, you would test: AAAAAAAAAACCCCCCCCCCNNNNNNNNNNGGGGGGGGGGTTTTTTTTTT 1c. Enter both free-energy values into tables that will represent findings from everyone. Use for this purpose the FILL-IN-ENERGY-TABLE function. What general conclusions do you draw from a comparison of the free energies of the different oligonucleotides?

2.

How can you tell whether a predicted structure is biologically significant?

 2a. Use mfold to predict the secondary structures of mystery sequences, seq1. Does it look significant to you? Note the free energy. 2b. Predict the secondary structure of a random sequence with the same nucleotide composition as seq1. How does it look? Note the free energy. Possibly useful function:     SHUFFLE (in STRING/SEQUENCES, String-production) 2c. Assign the free energy of the folding of the random sequence to a variable called energy. All of our values will be aggregated. What general conclusions do you draw from a comparison of the free energies of the random oligonucleotides to that of seq1? Possibly useful function:     DEFINE (in DEFINITION)

3.

Problems with predicting folding within large sequences.

 3a. Use mfold to predict the secondary structures of mystery sequences, seq2. Does it look significant to you? Note the free energy. 3b. Repeat the analysis with several 100-nt segments from within seq2. Note the free energies. Possibly useful functions:     SEQUENCE-OF (in STRING/SEQUENCES) 3c. Take the 100-nt segment that gave the best structure and use it to query GenBank. What do you come up with? What general conclusions do you draw?

4.

Is RNAz, implemented in BioBIKE as FIND-CONSERVED-RNA-IN, capable of (re)discovering small noncoding RNAs in real genomes?

 4a. Pick out a test case. Choose your favorite gene amongst the noncoding genes of Prochlorococcus marinus MED4 (nicknamed PMED4). Possibly useful functions:     NON-CODING-GENES-OF (in GENOME) 4b. Create the gene anchors, between which you will search for a putative noncoding RNA. Define a variable, upstream-anchor, as the gene upstream from the test gene. Similarly, define a variable, downstream-anchor, as the gene downstream from the test gene. These will be used to find regions in other organisms where the putative RNA gene might lie. Possibly useful functions:     GENE-UPSTREAM-OF (in GENES/PROTEIN, Gene-neighborhood)     GENE-DOWNSTREAM-OF (in GENES/PROTEIN, Gene-neighborhood) 4c. Create the sequence between the anchor genes that you'll search for the putative noncoding RNA. Possibly useful functions:     SEQUENCE-FROM (in your FUNCTIONS menu) 4d. Create a list of possible target sequences in several organisms (not to exceed 6). To do this, create a FOR-EACH loop, where you loop for each organism within a list of organisms. Within the loop (e.g. in the body of the loop), Define one variable as the ortholog of the upstream-anchor in the current organism, a second variable as the ortholog of the downstream-anchor in the current organism, and a third variable as the sequence from the first gene to the second. Within the Results section, collect this sequence. I suggest you use the first six organisms of *marine-prochlorococcus*. Possibly useful functions:     FOR-EACH (in FLOW/LOGIC)     *marine-prochlorococcus* (in DATA, Marine Unicellular Cyano)     FIRST (in LISTS-TABLES, List-extraction)     ORTHOLOG-OF (in GENES/PROTEIN) 4e. Try to find the conserved RNA by running RNAZ (as implemented through FIND-CONSERVED-RNA-IN) over the list you created, or slices within the list. Note that FIND-CONSERVED-RNA-IN allows you to work with a FROM/TO range within the aligned sequences. Possibly useful functions:     FIND-CONSERVED-RNA-IN (in STRING/SEQUENCES, Bioinformatic Tools)

5.

You have experimentally identified two sRNAs (sRNA1 and sRNA2) in the cyanobacterium Prochlorococcus marinus MED4 (nicknamed PMed4). Do the RNAs exist in other related cyanobacteria: PMIT9313 and S8102?

Possibly useful functions:
SEQUENCE-SIMILAR-TO
(in GENES/PROTEIN)
CONTEXT-OF (in GENES/PROTEIN, Gene-neighborhood)
SEQUENCE-UPSTREAM-OF
(in GENES/PROTEIN, Gene-neighborhood)

6.

Do you predict that the region between alr1152 (dmtB) and alr1153 (trpD2) contains an sRNA?