Biol 591 
Introduction to Bioinformatics
Notes for Scenario 1: Parsing programs
Fall 2003 

I. Overview of problem
II. Figuring out what BlastParser produces
III. Modifying BlastParser

I. Overview of problem

It sounds so easy. Two closely related strains of E. coli, K12 and O175:H7, differ in their ability to form disease and the protein they encode. No doubt these two facts are related to each other. If we could identify which proteins the pathogenic strain has that the nonpathogenic strain lacks, we may understand the means by which E. coli causes disease.

How to determine the pathogen-specific protein? Also easy. You've downloaded a program (Blast) designed to compare proteins, set it up to compare a set of proteins from a pathogenic strain to a set of proteins from E. coli K12, and run the comparison.

If you have not yet done this, then do it now:
Why did you compare the pathogenic strain against a database of E. coli K12 protein? Why not the reverse? Well, what you want in the end is a list of protein found in the pathogenic strain and not in K12. The way BlastAll works is it takes each protein in a file (the protein is called the query sequence) and compares it against the database (called the subject). If a match is found, it is reported. If not, Blast says "No match found". If you use the set of protein from the pathogenic strain as the query sequences, then those protein not found in K12 will produce "No match found" messages.
SQ1. What will happen if you use the set of protein from K12 as the query sequences for comparison against a database of O157 protein? What kind of protein will produce "No match found" messages? If a protein is in O157 but not in K12, what kind of message will be produced?
So you now have the answer you seek... somewhere in a file of roughly 40 million characters. You could go through the file by hand, picking out the proteins you're looking for, but since you can't seem to find a free 10-year block anywhere in your schedule, you need another approach.

Still easy. You've been given a program, BlastParser, that does automatically just what you would do if you had the time: go through the output of Blast and pick out just the names of each protein used in the comparison and what the comparison found, in brief. Unfortunately, the program was written to work on comparisons of two different sets of protein, not from E. coli. You can't expect a free program to do exactly what you want, but it shouldn't be too difficult to modify it slightly to work on your E. coli comparisons. You hope.

[By the way, there are two equivalent versions of the program: BlastParser may be somewhat more comprehensible to most, while  BlastParserNative is written in more idiomatic Perl. Take your pick, or compare the two.]

II. Figuring out what BlastParser produces

For starters, why not try running the program? Well, that could be dangerous. I wouldn't run a program someone handed me without looking at it to see if it did anything I might regret later. But trust me on this one and download BlastParser to your favorite directory,  run perl blastparser.pl, and stand back!

[a moment of silence while you run the program]

... not as satisfying as one might have hoped, but not unexpected either.

Can't open 71vsnps.txt: No such file or directory
As I said, BlastParser was not written to work on your Blast comparison but on a different one. Perl said it couldn't open something called 71vsnps.txt. I'll bet you don't have that file in your directory. Now you see you should have asked for a file the program works on for testing purposes. OK, here it is, 71vsnps.txt. Put this file in your directory and run the program again.

This time you get a lot of stuff rapidly scrolling off your screen. Presumably, the program sent it also to a file. Check out the directory, looking for a newly created file. You'll find one called 71vsnps-out.txt. If you take a look at it, you'll see that it contains some bioinformaticky looking things, gene names and such. That's good. But you'll need to understand what the output means, and, most important, you'll need to get the program to work on YOUR file, i.e. the output of blasting one E. coli against the other.

We can look at the program... we WILL look at the program, but it might be easier to compare the input and the output to get an idea of what comes from what. So bring up 71vsnps.txt and 71vsnps-out.txt. Looking the two files over, I deduce the following:

Here's what the input file has:

Output from Blast
BLASTP 2.1.3 [Apr-1-2001]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= all0001 ID:6715 {-311 <--- 918} unknown protein
         (409 letters)

Database: Npunctiforme_protein
           7432 sequences; 2,397,140 total letters

>Contig647.revised.gene32.protein
          Length = 438

 Score =  256 bits (655), Expect = 3e-069
 Identities = 169/438 (38%), Positives = 235/438 (53%), Gaps = 45/438 (10%)

Query: 2   KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60
           K VR   I  SSLA  LT C++A A    LN+G   I  G E + LQYQ+ N  + L+++
Sbjct: 13  KAVRFATIAFSSLAVGLTTCQSAFA---QLNIGRYGIQQGLESEYLQYQINN--QNLNQM 67

Query: 61  RVIPECSVGFALGCNKTGTILEQVIQSNGGPTYQQMLMRAAGGEDNYRKFSSFYGYNPN- 119
           RVIP C+VGF L CNKTG++++++++ NGG TYQ +L+RAAGGE N++ F+SFYG NPN
Sbjct: 68  RVIPGCNVGFGLTCNKTGSVIQRLVELNGGSTYQNLLIRAAGGEKNFQNFASFYGNNPNL 127

etc for many megabytes... [I cut it off at about 50 kbytes]

And here's what I see in 71vsps-out.txt:
Output from BlastParser working on the output from Blast

all0001, ID:6715 {-311 <--- 918} unknown protein, 409, 647.032, , 438, 3e-069
all0002, ID:2 {981 <--- 1718} unknown protein, 245, 647.030, , 213, 6e-060

etc for many bytes...

Putting the two together, I get the following interpretation (color coded):
  1. Query name: Some symbolic identification of a gene used to probe the database
  2. Query description: Something in plain English
  3. Query length: the length of the query sequence
  4. Subject name: symbolic identification of a gene within the database similar to the query
  5. Subject description: something in plain English [A description ought to be there, even though it's not]
  6. Subject length: the length of that protein
  7. E-value: The probability that a match of this quality might have arisen by chance
Items 4-6 are repeated if there is more than one match and omitted if there aren't any matches.
Does BlastParser do what you need it to do? Actually, it may do too much. You want to know those queries that produce no matches. You're not particularly interested in the Subject name, length, and E-value corresponding to the queries that do produce matches. On the other hand, perhaps you'll judge later that very weak matches are interesting as well. So you decide to follow two routes: (1) use BlastParser as is and scan the output for queries that produce no matches (or bring the output into Excel and search for such queries), and (2) modify BlastParser so as to output only those queries that produce no matches.
SQ2. Why are there two commas in a row in the output after 647.032?
SQ3. What, in plain English, does "3e-069" mean?
III. Modifying BlastParser
The time has come to make the program work on the file you generated from Blast. So, it will be necessary to go into BlastParser and modify it to look for your file instead of the one it was built for.  Let's do it.

Use Notepad to examine BlastParser and try to find the section of code that identifies the name of the file to be parsed. Happily, the program very kindly put neon lights around the pertinent section, and right below the comment

#### OPEN THE BLAST FILE
there's a very suspicious two lines of code:
my $blastPath = "71vsnps.txt";
open BLAST, "<$blastPath" or die "Can't open $blastPath: $!\n";
Obviously, that's where the error message you got came from.
SQ4. Change my $blastPath so that it is assigned the correct file name, i.e. the one created when YOU ran Blast.
Now run the program again. It might take several seconds before you start seeing any output, but when you do...
...that's more like it! But wait... Comparing the output with the file produced by BlastAll (do the comparison), it looks like I'm getting the query name, the query description, and the query length, but nothing for the subject. Could it be that there are NO matches between the two sets of protein? I doubt it.

More likely, the program runs fine with the other Blast output, but there's something perhaps subtly wrong that we need to fix before it can run with OUR Blast output. Perhaps you were afraid of that.

No time like the present. Let's try to get a minimal idea how this program works and what might need fixing. First, go to the Main Program section.

$line = <BLAST>;                                 # Read first line
while (defined $line) {                          # Only if not end of file
   if ($line =~ /^Query=/) {                     # If 'Query' matches beginning of line...
      print_previous_query();
      start_new_query();
   }
   if ($line =~ /^>Contig/) { record_subject() } ...
There's much here that's mysterious, but so long as I don't get dragged down by that, there's also much here that I recognize. For example, /^Query=/. Never mind the / and ^, I recognize Query=. That appears in the output whenever query information is presented. And I recognize >Contig. That appears in the output -- the output that BlastParser was designed for -- and is followed by subject information. BUT >Contig DOESN'T APPEAR IN YOUR BLAST OUTPUT! That's clearly a problem. You need to figure out what signals the presence of subject information in your output and modify the program accordingly.
SQ5. Have any idea how to change the program so that it recognizes the beginning of the subject line?
Even if you make an appropriate change, you're going to find that the program STILL doesn't work. Why? You'll have to look at the parts of the program that actually do the work: the routines called start_new_query and record_match. There's a lot mysterious here, but with surprisingly little explanation, you can make a good deal of sense out of this code. You need to know about the following areas: The next set of notes will take you through these Perl issues.