Introduction to Bioinformatics: Parsing programs

Biol 591

Introduction to Bioinformatics
Notes for Scenario 2: Parsing programs

Fall 2002

I. Overview of problem
II. Figuring out what BlastParser produces
III. Modifying BlastParser
IV. Regular expressions
V. Capturing matched patterns
V. Arrays

Suggested Reading: Beginning Perl for Bioinformatics, pages 334-339 and 50-55.

I. Overview of problem

It sounds so easy. Two closely related strains of E. coli, K12 and O175:H7, differ in their ability to form disease and the protein they encode. No doubt these two facts are related to each other. If we could identify which proteins the pathogenic strain has that the nonpathogenic strain lacks, we may understand the means by which E. coli causes disease.

How to determine the pathogen-specific protein? Also easy. You've downloaded a program (Blast) designed to compare proteins, set it up to compare a set of proteins from a pathogenic strain to a set of proteins from E. coli K12, and run the comparison.

If you have not yet done this, then do it now:

Instructions on how to download Blast

List of course participants and which pathogenic strain each should consider

Instructions on how to download the set of proteins encoded by E. coli K12 and a pathogenic strain

Instructions on how to set up Blast and to run it

Why did you compare the pathogenic strain against a database of E. coli K12 protein? Why not the reverse? Well, what you want in the end is a list of protein found in the pathogenic strain and not in K12. The way BlastAll works is it takes each protein in a file (the protein is called the query sequence) and compares it against the database (called the subject). If a match is found, it is reported. If not, Blast says "No match found". If you use the set of protein from the pathogenic strain as the query sequences, then those protein not found in K12 will produce "No match found" messages.

SQ1. What will happen if you use the set of protein from K12 as the query sequences for comparison against a database of O157 protein? What kind of protein will produce "No match found" messages? If a protein is in O157 but not in K12, what kind of message will be produced?

So you now have the answer you seek... somewhere in a file of roughly 40 million characters. You could go through the file by hand, picking out the proteins you're looking for, but since you can't seem to find a free 10-year block anywhere in your schedule, you need another approach.

Still easy. You've been given a program, BlastParser, that does automatically just what you would do if you had the time: go through the output of Blast and pick out just the names of each protein used in the comparison and what the comparison found, in brief. Unfortunately, the program was written to work on comparisons of two different sets of protein, not from E. coli. You can't expect a free program to do exactly what you want, but it shouldn't be too difficult to modify it slightly to work on your E. coli comparisons. You hope.

(Those of you familar with programming should start with a more challenging version, BlastParser2, that will turn out to need an additional modification before it works with the E. coli data.)

II. Figuring out what BlastParser produces

The first step is examine what BlastParser produces from the Blast output it's given. If it extracts the kind of information you're looking for, then your job will be much easier. Notice I haven't said anything about understanding how the program works. That comes later.

Below is a comparison of the output of Blast BlastParser was designed to deal with, followed by what BlastParser did with it. Notice that it picks out the following pieces of information:

Query name: Some symbolic identification of a gene used to probe the database

Query description: Something in plain English

Query length: the length of the query sequence

Subject name: symbolic identification of a gene within the database similar to the query

Subject description: something in plain English (note that example below lacks a descriptor)

Subject length: the length of that protein

E-value: The probability that a match of this quality might have arisen by chance

Items 3-5 are repeated if there is more than one match and omitted if there aren't any matches.

Output from Blast
BLASTP 2.1.3 [Apr-1-2001]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.

Query= all0001 ID:6715 {-311 <--- 918} unknown protein
         (409 letters)

Database: Npunctiforme_protein
           7432 sequences; 2,397,140 total letters

>Contig647.revised.gene32.protein
          Length = 438

Score = 256 bits (655), Expect = 3e-069
Identities = 169/438 (38%), Positives = 235/438 (53%), Gaps = 45/438 (10%)

Query: 2   KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60
           K VR   I SSLA LT C++A A    LN+G   I G E + LQYQ+ N + L+++
Sbjct: 13 KAVRFATIAFSSLAVGLTTCQSAFA---QLNIGRYGIQQGLESEYLQYQINN--QNLNQM 67

Query: 61 RVIPECSVGFALGCNKTGTILEQVIQSNGGPTYQQMLMRAAGGEDNYRKFSSFYGYNPN- 119
           RVIP C+VGF L CNKTG++++++++ NGG TYQ +L+RAAGGE N++ F+SFYG NPN
Sbjct: 68 RVIPGCNVGFGLTCNKTGSVIQRLVELNGGSTYQNLLIRAAGGEKNFQNFASFYGNNPNL 127

etc for many megabytes...

Output from BlastParser working on the output from Blast

all0001, ID:6715 {-311 <--- 918} unknown protein, 409, 647.032, , 438, 3e-069
all0002, ID:2 {981 <--- 1718} unknown protein, 245, 647.030, , 213, 6e-060

etc for many bytes...

Does BlastParser do what you need it to do? Actually, it may do too much. You want to know those queries that produce no matches. You're not particularly interested in the Subject name, length, and E-value corresponding to the queries that do produce matches. On the other hand, perhaps you'll judge later that very weak matches are interesting as well. So you decide to follow two routes: (1) use BlastParser as is and scan the output for queries that produce no matches (or bring the output into Excel and search for such queries), and (2) modify BlastParser so as to output only those queries that produce no matches.

SQ2. Why are there two commas in a row in the output after 647.032?
SQ3. What, in plain English, does "3e-069" mean?

III. Modifying BlastParser

OK, the time has come to run the program. Download BlastParser to your computer, get into a DOS window and into the appropriate directory, run perl blastparser.pl, and stand back!

[a moment of silence while you run the program]

... not as satisfying as one might have hoped, but not unexpected either. As I said, BlastParser was not written to work on your Blast comparison but on a different one. Perl said it couldn't open something called 71vsnps.txt. I'll bet you don't have that file in your directory, and you can't expect Perl to know the name you gave to the file you generated from Blast. So, it will be necessary to go into BlastParser and modify it to look for your file instead of the one it was built for. Let's do it.

Use Notepad to examine BlastParser and try to find the section of code that identifies the name of the file to be parsed. Happily, the programmer (Fritz) very kindly put neon lights around the pertinent section, and right below the comment

#### Open the BLAST file

there's a very suspicious two lines of code:

my $blastPath = "71vsnps.txt";
open BLAST, "<$blastPath" or die "Can't open $blastPath: $!\n";

Obviously, that's where the error message you got came from.

SQ4. Change my $blastPath so that it is assigned the correct file name, i.e. the one created when YOU ran Blast.

Now run the program again. It might take several seconds before you start seeing any output, but when you do...
...that's more like it! But wait... Comparing the output with the file produced by BlastAll (do the comparison), it looks like I'm getting the query name, the query description, and the query length, but nothing for the subject. Could it be that there are NO matches between the two sets of protein? I doubt it.

More likely, the program runs fine with the other Blast output, but there's something perhaps subtly wrong that we need to fix before it can run with OUR Blast output. Perhaps you were afraid of that.

No time like the present. Let's try to get a minimal idea how this program works and what might need fixing. First, go to the Main Program section.

my @query_info;
while (<BLAST>) {
   if (/^Query=/) {
      print_previous_query();
      start_new_query();
   }
   if (/^>Contig/) { record_match() }

There's much here that's mysterious, but so long as I don't get dragged down by that, there's also much here that I recognize. For example, /^Query=/. Never mind the / and ^, I recognize Query=. That appears in the output whenever query information is presented. And I recognize >Contig. That appears in the output -- the output that BlastParser was designed for -- and is followed by subject information. BUT >Contig DOESN'T APPEAR IN YOUR BLAST OUTPUT! That's clearly a problem. You need to figure out what signals the presence of subject information in your output and modify the program accordingly.

SQ5. Have any idea how to change the program so that it recognizes the beginning of the subject line?

Even if you make an appropriate change, you're going to find that the program STILL doesn't work. Why? You'll have to look at the parts of the program that actually do the work: the routines called start_new_query and record_match. There's a lot mysterious here, but with surprisingly little explanation, you can make a good deal of sense out of this code. You need to know about the following areas:

Regular expressions: the code between two / as in my ($match_length) = /Length = (\d+)/;
Capturing matched patterns: as in my $match_description = $3;
Arrays: as in @query_info = ($query, $query_description, $query_length);

IV. Regular expressions

Relevant section of Beginning Perl for Bioinformatics: pages 334-339

IV.1 Character classes

We've seen how to search for a word like /Roosevelt/, or a pattern with some don't-care positions like /TT.T/. Sometimes we do care, but can't narrow things down to a single character. To match a digit, for instance, we use \d. The pattern /X\d/, for instance, matches X0, X1, X2, X3, X4, X5, X6, X7, X8, or X9.

Other character classes are \w for a "word" character -- either a letter, a digit, or the underscore ( _ ); and \W for a character which is not a word character.

The \s character class matches any space character. That includes the ordinary blank between words; it also include tab characters (which we write in Perl as \t ); and newlines, the character which causes the computer to start printing on the next line. We write newlines in Perl as \n .

The \S character class matches a non-space character.

Finally, the caret ( ^ ) matches the beginning of a line.

IV.2 Repetitions

We've also seen how to search for a certain number of repetitions of the same thing: .{3} means three don't-care positions, and .{20,24} means between 20 and 24 don't-care positions.

SQ6: What does /A{12,15}/ match?

Sometimes, though, we don't know the limits. We humans have no trouble recognizing numbers like 1 and 42 and 1492. We could use /\d{1,4}/ to match each of those numbers -- but that pattern doesn't match 40000000. To match any number of digits in a row we add a + after \d , like this: /\d+/ .

SQ7: What does /\d+\s+\d+/ match?

V. Capturing matched patterns

Often we want to print or save part of the pattern we've matched. For instance, we match something like /^Section \d+/ -- the word "Section" at the beginning of the line followed by a space and a number -- and want to save the number.

To do that we enclose the number in parentheses -- /^Section (\d+)/ -- and the part in parentheses is saved (temporarily) in the special variable $1.

For more permanent storage we can assign $1 to another variable:

/^Section (\d+)/;
my $section_number = $1;

or more directly

my ($section_number) = /^Section (\d+)/;

To capture the section and subsection numbers from something like Section 12-20 we can use two sets of parentheses: /^Section (\d+)-(\d+)/ and store it in two variables either by using $1 and $2:

/^Section (\d+)-(\d+)/;
my $section_number = $1;
my $subsection_number = $1;

or the more direct method

my ($section_number, $subsection_number) = /^Section (\d+)-(\d+)/;

VI. Arrays

Relevant section of the book: pages 50-55.

In Perl, a variable which begins with a dollar sign (like $x ), can hold a number like 42 or text like "four-score and seven years ago".

Perl also lets us talk about lists of values, like

(1776, "four score and seven", 1863)

and has variables (called arrays) to hold such lists. They begin with an at sign ( @ ). For instance,

@a = (1776, "four score and seven", 1863);

We turn an array into text with the join function. For instance, join(" ", @a) is

"1776 four score and seven 1863"

and join(":", @a) is

"1776:four score and seven:1863"

We can add one or more values to an array with the push command. For instance, after

push @a, "Lincoln", "Gettysburg";

@a would hold

(1776, "four score and seven", 1863, "Lincoln", "Gettysburg")

and join(":", @a) would be

"1776:four score and seven:1863:Lincoln:Gettysburg"

SQ8: What do the following program lines print?

@b = ("one", "two", "four");
push @b, "twenty-three";
print join(":", @b);