Topic: Genome Sequencing
Tour of Myers et al (2000)
Myers EW et al (2000). A whole-genome assembly of Drosophila.It's all too easy to flow along an article saying yes, yes, that's all right, I understand, when really not much is connecting. True for me at any rate. The cure for this disease is asking questions, and since no one else is around while you're reading, the questions must be directed at yourself and at the article.
I've read this article along with you. You may use it as a companion as you read the article yourself. I've listed many questions that occured to me. I hope many more occurred to you. If they remain unanswered by the time you emerge from the article, jot them down and send them to me.
SQ1. In which direction is the sequencing gel read (top to bottom or bottom to top) in order to get a sequence that is 5' to 3'? To answer this question, you'll need to consider the basis of dideoxy sequencing.Dideoxy sequencing is still the standard, but sequencing no longer employs radioactivity and autoradiograms as in the original Sanger paper. Instead DNA is tagged by fluorescence, a different color used for each of the dideoxynucleotides. Here is an example.
SQ4. The authors will refer later in the article to the "high-quality region" of a read. Where in this example is the high-quality region?In paragraph 3, you read that the success of shotgun sequencing depended on "...pairs of reads, called mates, from the ends of 2-kbp and 16-kbp inserts randomly sampled from the genome." They wouldn't be critical if dideoxy sequencing could determine very long sequences directly.
SQ6. Why are read pairs so valuable in determining the final sequence?Early genome-sequencing efforts used ordered sets of cosmids, each carrying a 35-45 kbp DNA fragment. Later efforts used P1 libraries or bacterial artificial chromosome (BAC) libraries.
SQ7. Why mess around with libraries of cosmids BACs or anything else? Why not just sequence the chromosomal DNA?At the end of the Introduction, the authors described their strategy to sequenc the Drosophila genome. They talk about "10X oversampling" and "15X coverage".
SQ8. What does "oversampling" and "coverage" mean?Celera Assembler Design Principles
The authors (and anyone else who has to assemble sequences) have considerable respect for repeated sequences. There are many repeated sequences in their genome that are larger than the maximum read size of several hundred nucleotides.
SQ11. How is it possible to assemble the genome in the face of large repeats?The Drosophlia Data Sets
SQ13. Consider each of the data types listed in Table 1. Why is each important in obtaining a completed sequence?Celera Assembler's Algorithmic Design: Introduction
SQ15. From the information in the opening paragraph in this section, how many mistakes would you expect in a read 500 nt long? If there is no further correction mechanism, how many mistakes would you expect in the entire Drosophila genome?Celera Assembler's Algorithmic Design: Screener
The authors spent a good deal of effort screening repeated sequences out of the assembly process.
SQ17. Why do it?Celera Assembler's Algorithmic Design: Overlapper
SQ20: Consider Fig. 2. Explain how two overlapping fragments is consistent with both scenarios. Add arrows to the diagram labeled (ii) to indicate the orientation of the two repeated segments relative to one another.Celera Assembler's Algorithmic Design: Unitigger
SQ21: Consider Fig. 3. Explain how at this stage in the assembly process the diagram labeled Target may be produced.Celera Assembler's Algorithmic Design: Scaffolder
SQ22: The authors say that relying on the left and right reads of a mate to connect unitigs is accurate only 99.66% of the time. Actually that doesn't sound so bad! Usually in biology, a p value of 95% is considered OK and 99% is great. Why are they so hard on themselves? What if they accepted this error rate?Characteristics of the Drosophila assembly
SQ25: 83% of the scaffolds are unconnected to other Drosophila sequences known from this project or any other project. That sounds pretty bad. Where did I get that fraction? Why aren't the authors concerned about it?