Biol 591 
Introduction to Bioinformatics
Fall 2002 

Scenario : Comparison of genomes to look for genes responsible for pathogenesis
Our Story

E. coli: The good, the bad, and the ugly.
You’ve probably heard of E. coli (the common abbreviation for this bacterium’s full name, Escherichia coli) in one or both of two contexts. The first is news reports about people getting very sick or even dying after eating undercooked hamburger meat, non-pasteurized fruit juice, or alfalfa sprouts contaminated with E. coli. The second place you might have encountered this bacterium is in a molecular biology laboratory. E. coli was mentioned in a previous problem set as a bacterium that might be used for producing a protein of interest. In fact, E. coli cells can be found in most molecular biology laboratories, where they are mostly used as factories for making recombinant DNA or proteins.

As an aside, when we isolate a piece of DNA and put it into E. coli to make more of it, we call this “cloning.” This is because we are making many identical copies of something (usually a gene). This process is sometimes referred to as “molecular cloning” and should not be confused with “organismal cloning”, as in cloning sheep or humans.
E. coli are also favorites for use in experiments in student labs. So why are scientists and teachers exposing themselves and their students to these deadly bacteria?

It turns out that there are several varieties or “strains” of E. coli. They are all related, but can have different properties. We all have a great many E. coli bacteria in our large intestines, or colons (hence, the name “coli”). These cells are usually strains that are not only harmless in that environment, but beneficial, providing us with some vitamins and helping to prevent more harmful bacteria that we might eat from taking up residence and causing disease. So E. coli are mostly good for us. Common laboratory strains, such as E. coli strain K-12, are also harmless. We like to use these strains because they’re easy to grow and because we have lots of experience using them to make DNA and protein. So these bacteria are good, too. Click here for a picture of E. coli cells that carry cloned genes allowing them to produce light.

The E. coli that are responsible for the illnesses mentioned in the news are most often a different strain, known as O157:H7. The name comes from the particular varieties of two surface structures possessed by this bacterium. This is akin to describing a criminal suspect as having short, brown hair and a dragon tattoo on his right bicep. This strain is definitely bad. (The “ugly” strains are probably those that just cause diarrhea. They usually don’t kill people, but they sometimes make people wish they were dead.)

What’s the difference?
So now that we know that E. coli can be good or bad, harmless or deadly, the question we would like answered is, “What is it about the O157:H7 strain that makes it harmful?” If we can learn this, we might be able to come up with better diagnostic reagents for tracking these bacteria in our food. We might also be able to devise a vaccine or drug that would target this deadly bacterium without targeting the beneficial ones. One obvious suggestion is that the O157 and H7 surface components are responsible for pathogenesis. Alas, this is not the case, just as having short, brown hair and a tattoo doesn’t dictate that someone will be a criminal. The problem of identifying the components of a microbe that are responsible for its pathogenicity (known as “virulence factors”) comes up often in the study of infectious diseases. One general approach to answering this question is to compare a harmful strain, or “pathogen,” to an innocuous strain, or “non-pathogen.” The proteins possessed by the pathogen but absent from the non-pathogen may hold the key to the virulence of the pathogen.


While sophisticated techniques may be used to identify the proteins actually produced by a given bacterial strain (we’ll encounter such proteomic analysis later on), you have all the information you need right now to identify all the proteins E. coli may POTENTIALLY produce. The DNA sequences of the entire genomes of a pathogenic E. coli (O157:H7) and a nonpathogenic strain (K12) have been determined, and from these sequences, one can predict with a high degree of accuracy the full complement of functional genes and the proteins they encode. Since you’d expect that most of the DNA in the two closely related strains should be the same, so should the encoded protein. The protein that is uniquely encoded by E. coli O157:H7 (and not encoded by E. coli K12) may be responsible for its virulence.

What to do about too much success?
Excited by the prospect of identifying the protein unique to E. coli O157:H7, understanding the basis for its pathogenesis, and winning the gratitude of Burger Kings everywhere, you use a standard bioinformatics tool, BLAST, to compare the set of proteins encoded by E. coli K12 with the set encoded by E. coli O157:H7. Unfortunately, the output you get from the protein is a file several tens of megabytes in length, much bigger than the entire body of Shakespeare’s plays, and much less interesting reading. Surely the answer you seek is in that huge file. How can you rework the output into something you can comprehend?

Problem
Use BLAST to identify those proteins encoded by the pathogen O157:H7 and not by the non-pathogenic laboratory strain K-12, and parse the output into a usable form.

Tools
Blast
This is a standard tool for comparing sequences that we'll be looking at a lot closer later. For now, the task is to install the program on your own computer so that you can run huge genomes through it.

Parsing program
Go through output and pick out just the items you're interested in, saving them in a convenient format. Writing parsing programs is one of the most common activities of people who do bioinformatics.   

References
Perna NT et al (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 40(:529-533.

Hayashi T et al (2001) Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Research 8:11-22.