Intro to Bioinformatics: Myers et al (2000)

BNFO 301

Introduction to Bioinformatics
Topic: Genome Sequencing
Tour of Myers et al (2000)

Spring 2013

Where do genome sequences come from? No better way to find out than to follow the process of one being made. One of the first eukaryotic genomes sequenced was that of the fruit fly, Drosophila melanogaster. An account of the sequencing and the analysis of the sequence was published in a series of articles, the first being: You'll want to be reading along in:

Myers EW et al (2000). A whole-genome assembly of Drosophila.
Science 287:2196-2204.

It's all too easy to flow along an article saying yes, yes, that's all right, I understand, when really not much is connecting. True for me at any rate. The cure for this disease is asking questions, and since no one else is around while you're reading, the questions must be directed at yourself and at the article.

I've read this article along with you. You may use it as a companion as you read the article yourself. I've listed many questions that occured to me. I hope many more occurred to you. If they remain unanswered by the time you emerge from the article, jot them down and send them to me.

Warning! It is easy to get overwhelmed by the various terms. Keep on track by focusing on the big question: How did they obtain the DNA sequence of the Drosophila genome? Build a visual picture of the process -- don't be content with words. If there are holes in your growing mental image, search the article for help in filling them, even if it means ignoring 90% of what the article says. You control the article. Don't let the article control you.

Introduction
Sequencing genomes ultimately depends on the sequencing of DNA fragments. In the first paragraph, Myers et al speak of dideoxy sequencing, invented by Fred Sanger's group. See previous notes for links to explanations of the method. Click here to see the first published sequence determined by dideoxy sequencing.

SQ1. In which direction is the sequencing gel read (top to bottom or bottom to top) in order to get a sequence that is 5' to 3'? To answer this question, you'll need to consider the basis of dideoxy sequencing.
SQ2. What is the sequence (5' to 3') represented by the gel? Not easy, huh? That was life in the old days. Here's some help if you need it.
SQ3. From what organism/molecule does this first published dideoxy sequence come from? Yeah, you could probably look it up, but instead use available bioinformatic resources to find out.

Dideoxy sequencing is still the standard, but sequencing no longer employs radioactivity and autoradiograms as in the original Sanger paper. Instead DNA is tagged by fluorescence, a different color used for each of the dideoxynucleotides. Here is an example.

SQ4. The authors will refer later in the article to the "high-quality region" of a read. Where in this example is the high-quality region?
SQ5. Why can't the whole read by "high-quality"? Look at Sanger's gel for inspiration.

In paragraph 3, you read that the success of shotgun sequencing depended on "...pairs of reads, called mates, from the ends of 2-kbp and 16-kbp inserts randomly sampled from the genome." They wouldn't be critical if dideoxy sequencing could determine very long sequences directly.

SQ6. Why are read pairs so valuable in determining the final sequence?

Early genome-sequencing efforts used ordered sets of cosmids, each carrying a 35-45 kbp DNA fragment. Later efforts used P1 libraries or bacterial artificial chromosome (BAC) libraries.

SQ7. Why mess around with libraries of cosmids BACs or anything else? Why not just sequence the chromosomal DNA?

At the end of the Introduction, the authors described their strategy to sequenc the Drosophila genome. They talk about "10X oversampling" and "15X coverage".

SQ8. What does "oversampling" and "coverage" mean?
SQ9. Why different numbers -- 10X and 15X? A misprint?
SQ10. Why not 1X? How much of the sequence would be thereby be determined? (You'll no doubt need a calculator or a computer)

Celera Assembler Design Principles
The authors (and anyone else who has to assemble sequences) have considerable respect for repeated sequences. There are many repeated sequences in their genome that are larger than the maximum read size of several hundred nucleotides.

SQ11. How is it possible to assemble the genome in the face of large repeats?
SQ12. The authors say that the assembler used "external data". What do they mean by this? Don't try to figure it out from what they say in this section, but answer the question after reading further.

The Drosophlia Data Sets

SQ13. Consider each of the data types listed in Table 1. Why is each important in obtaining a completed sequence?
SQ14. From figures given in the text and in Table 1, check the accuracy of each of the following statements:
     a. "We produced 3.156 million reads that yielded 1.76 Gbp of sequence. . ."
     b. ". . .trillions of overlaps between reads are examined."
     c. ". . .to produce 654,000 of the 2-kbp mates and 497,000 of the 10-kbp mates."

Celera Assembler's Algorithmic Design: Introduction

SQ15. From the information in the opening paragraph in this section, how many mistakes would you expect in a read 500 nt long? If there is no further correction mechanism, how many mistakes would you expect in the entire Drosophila genome?
SQ16. The authors stressed stressed that their in their design, "[a]ny prefix sequence of the high-quality region matching the sequencing vector. . . was aggressively removed." What is meant by "sequencing vector", and why bother to remove it?

Celera Assembler's Algorithmic Design: Screener
The authors spent a good deal of effort screening repeated sequences out of the assembly process.

SQ17. Why do it?
SQ18: In this regard, what is the significant difference between a "hard screen" and a "soft screen"?
SQ19: Screening out sequences implies that they won't be part of the final assembly. The authors justified their procedure, saying that it was consistent with ". . . implicit goal of all sequencing efforts, that is, to determine the sequence of the euchromatic segments of the genome." Why do they consider this the goal of all sequencing efforts? Why euchromatin?

Celera Assembler's Algorithmic Design: Overlapper

SQ20: Consider Fig. 2. Explain how two overlapping fragments is consistent with both scenarios. Add arrows to the diagram labeled (ii) to indicate the orientation of the two repeated segments relative to one another.

Celera Assembler's Algorithmic Design: Unitigger

SQ21: Consider Fig. 3. Explain how at this stage in the assembly process the diagram labeled Target may be produced.

Celera Assembler's Algorithmic Design: Scaffolder

SQ22: The authors say that relying on the left and right reads of a mate to connect unitigs is accurate only 99.66% of the time. Actually that doesn't sound so bad! Usually in biology, a p value of 95% is considered OK and 99% is great. Why are they so hard on themselves? What if they accepted this error rate?
SQ23: How do they drive the error rate down even further? Where did they get the estimate "1 in 10^15"?
SQ24: What process relates reads to unitigs? What process relates unitigs to. . . to what? What is the analogous concept?

Characteristics of the Drosophila assembly

SQ25: 83% of the scaffolds are unconnected to other Drosophila sequences known from this project or any other project. That sounds pretty bad. Where did I get that fraction? Why aren't the authors concerned about it?
SQ26: The authors finally make use of the wealth of Sequence Tagged Sites (STS) information available to them. What is an STS map? If you have no clue, try this one: What is a genetic map?
SQ27: How did the authors use the information given by the STS map?