Introduction to Bioinformatics:Genome Sequencing

BNFO 301

Introduction to Bioinformatics
Topic: Genome Sequences
Where they come from

Spring 2010

Where genome sequences come from

Note that the path leading to Stockholm described in the story underlying this module relied on the existence of Drosophila genes and proteins in an accessible database. Before 2000, no database contained entries for more than a small fraction of genes and proteins from Drosophila. Before 1995, no database contained entries for more than a small fraction of genes from any organism. The fact that GenBank and other similar databases provide so rich a source of information results from the hundreds of genome sequencing projects that have sprung up since 1995.

One can break up a genome project in many ways. Here's one:

Obtaining the raw sequence of a genome

Identifying genes within the genome

Deducing function of the protein encoded by the genes

In the next set of notes we'll examine the first step, using as an example the elucidation of the Drosophila genome, as described in:

Myers EW et al (2000). A whole-genome assembly of Drosophila.
Science 287:2196-2204.

You'll eventually want to read this article lots of times, at least the first four pages (up to but not including Characteristics of the Drosophila assembly). For now, just skim those pages to get an idea what's in store.

To get the article, proceed as described in How to Find Articles or go directly to NCBI (perhaps through the Links page on the course web site) and use PubMed to search for the article. If you have a problem getting to the full article solve it!. Librarians are very helpful people. I also know how to do it, as do many of your peers. If you choose to print out the article, choose the Full Text (PDF) link under Article Views. As you read the article, generate questions, particularly on issues that are essential for you to understand how genome sequences are elucidated. I have tried to anticipate some questions and provide some ways for you to answer them.

Warning! It is easy to get overwhelmed by the various terms. Keep on track by focusing on the big question: How did they obtain the DNA sequence of the Drosophila genome? Build a visual picture of the process -- don't be content with words. If there are holes in you growing mental image, search the article for help in filling them, even if it means ignoring 90% of what the article says. You control the article. Don't let the article control you.

The main task for today is to understand some of the techniques used in the paper. I know you are capable of finding background on the web, but I've saved you some trouble by gathering together some useful links (I won't always be so helpful). Use them or anything else you like to get the basic idea.

What is shotgun sequencing?

Sequencing Whole Genomes: Hierarchical Shotgun Sequencing v. Shotgun Sequencing, Malcolm Campbell, Davidson College
Shotgun Approach to Genome Sequencing (link not reliable; Flash animation), Sociedad Mexicana de Ciencias Genómicas
Genome Sequencing Assembly Primer, Center for Bioinformatics & Computational Biology (U. Maryland)

What is dideoxy sequencing?

Sanger Method for DNA Sequencing, Elizabeth Canfield, Davidson College
Nucleic Acid Sequencing (link not reliable; Flash animation), Sociedad Mexicana de Ciencias Genómicas

What are BAC libraries? What are P1 inserts?

Monaco AP and Larin Z (1994). YACs, BACs, PACs, and MACs: Artificial chromosomes as research tools. Trends in Biotechnology 12:280-286. (class password required)
(You're getting a link to the article because VCU doesn't subscribe to the journal)
Shizuya H et al (1992). Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proceedings of the National Academy of Sciences USA 89:8794-8797.