Biol 591 
Introduction to Bioinformatics
Scenarios
Fall 2003 
Comparison of genomes to look for genes responsible for pathogenesis

Scientific story (html)

In brief: You hit on the idea of undestanding the basis for pathogenesis by the deadly E. coli O157:H7 by comparing its total complement of protein with that of the nonpathogenic strain E. coli K12. Unfortunately, the comparison nets you a file bigger than anything you could go through in a year. How can you extract the useful information from the file and put it in a form a human could understand?
Bioinformatic tools
Blast
     Standard program to find similarities between sequences or sets of sequences.
Parsing program
     Scans output, looking for items of interest as you define them. Outputs them to a separate file.
Perl focus: Pattern matching and extraction of strings through regular expressions
        Suggested Reading: Beginning Perl (Simon Cozens), chapter 5, pp.147-162 (or if you like, 147-165)

Notes and papers

Molecular biology (PDF) (Questions) Blast/Parsing program  (html) (Questions)
Regular expressions (html) (Questions)
Programs
Blast (obtainable from NCBI site - see instructions on how to download and run the program)
Most people run this program off of the web. The point of interest for now is learning how to download the program so that you can tailor it to your own purposes.

Protein databases (obtainable from TIGR-CMR site - see instructions on how to download and which set to download).
Files containing all proteins deduced from completed DNA sequences of E. coli strains, used by Blast.

Parsing program:  BlastParser.pl- takes output from Blast, extracts information
                            BlastParserNative.pl - same as above, but more as a Perl programmer would write it
                                 71vsnps.txt - data program was designed to handle
                                 EdVsK12s.txt - Small portion of expected output from BlastAll
                            matchtest.pl - Allows quick tests of regular expressions

Problem Set - Molecular biology (PDF)
Problem Set - Programming (html)