Regular Expressions

Biol 591

Introduction to Bioinformatics
Notes for Scenario 1: Regular expressions

Fall 2003

I. Regular expressions
II. Capturing matched patterns
III. Arrays (we'll wait for later on this)

Suggested Reading: Beginning Perl (Simon Cozens), chapter 5, pp.147-162 (or if you like, 147-165)

I. Regular expressions

I.A. Strategy for solving problem

Your problem is that you're faced with megabytes of output from Blast, most of which is garbage. You want to automatically sift through it to pick out the useful information, i.e. what sequence from one strain matches what sequence from another.

STEP 1: If you had infinite time and patience, how would you do this by hand?

If you can figure out how you would do it, then you can teach the computer to do it, only faster. Here's some output from Blast (taken again from 71vsnps.txt:

Output from Blast

BLASTP 2.1.3 [Apr-1-2001]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= all0001 ID:6715 {-311 <--- 918} unknown protein
         (409 letters)
Database: Npunctiforme_protein
           7432 sequences; 2,397,140 total letters
>Contig647.revised.gene32.protein
          Length = 438
Score = 256 bits (655), Expect = 3e-069
Identities = 169/438 (38%), Positives = 235/438 (53%), Gaps = 45/438 (10%)
Query: 2   KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60
           K VR   I SSLA LT C++A A    LN+G   I G E + LQYQ+ N + L+++
Sbjct: 13 KAVRFATIAFSSLAVGLTTCQSAFA---QLNIGRYGIQQGLESEYLQYQINN--QNLNQM 67

OK, how would you identify the lines bearing the names of the sequences used to probe the database for matches? Even if you've never used Blast before (a situation that should be remedied very soon), there aren't too many possibilities in the output shown above. In fact, there are only two real possibilities:

Query= all0001 ID:6715 {-311 <--- 918} unknown protein

and

>Contig647.revised.gene32.protein

As it turns out, the first identifies the name of the sequence used to probe the database and the second identifies the name of the sequence that was found in the database. Now that you know how to do it, how could you explain it to a computer? How about: Look for a line beginning 'Query'. That will contain the name. Unfortunately, this strategy doesn't quite work, since the following line fits the description but certainly does not contain the name of a sequence:

Query: 2 KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60

You can exclude this line by adding ... and 'Query' must be followed by '='. That might do it.

SQ6. What strategy would you devise to pick out a line containing the number of letters within a query?

I.B. Simple regular expressions, anchors, and character sets

Perl is at its most powerful in finding patterns in character strings. You can capture what you want in Perl with the statement:

$line =~ /Query=/

which might read Do the contents of the variable $line contain 'Query=' ? The text between / and / is called a regular expression. Don't mistake the Perl =~ operator for an equal sign. Cross that thought out of your mind. It has nothing to do with equal sign. It merely identifies the string of characters within which Perl is asked to look for a match.

Suppose you were worried that somewhere in the huge file you generated is a line something like:

>Eco0029 Unidentified gene found when Query=Sty0047

Now, your strategy fails again, because there it is, Query=, in an unwanted line. You can refine your instructions by specifying that the word lies at the beginning of the line, which it will always do in the cases you're interested in. Perl uses ^ to specify "beginning of line" (and $ to specify "end of line"). So your instruction now becomes:

$line =~ /^Query=/

You may still be concerned about getting unwanted matches with this limited instruction and notice that all bona fide query lines have the form

Query= aaadddd ID:dddd ...

where "a" represents a letter and "d" represents a digit. Perl allows you search for lines of this form:

$line =~ /^Query= ....... ID:..../

Now, understanding . to mean "any character", the line reads Do the contents of the variable $line contain 'Query=' followed by a space, then 7 characters, then another space, then "ID:", and then 4 characters? You might object that my translation into Perl didn't quite capture the pattern, since 7 characters could be "all0001" or "0001all" or "@#$%^&*" or even " ". That's my fault. Perl is certainly capable of representing the more complicated pattern:

$line =~ /^Query= [a-z]{3}[0-9]{4} ID:[0-9]{4}/

which reads Do the contents of the variable $line contain 'Query=' followed by 3 instances of some character between a and z, then 4 instances of some character between 0 and 9, then a space, then "ID:", and then 4 instances of some characters between 0 and 9?

This is beginning to get complicated, but you can see that with a few more tools, you can match just about any pattern you want. Here's a summary of what we have so far:

/ text /	/ marks the boundaries of the regular expression
^	ANCHOR: Beginning of line
$	ANCHOR: End of line
.	Any character
[abcd]	SET: Set of specified characters
[^abc]	SET: Set of excluded characters
[a-b]	SET: Set defined by first and last characters
{d}	REPETITIONS: Previous character repeated d times
{d,e}	REPETITIONS: Previous character repeated from d to e times

Before getting to more tools, you need to have a way of trying them out. It's no use reading about them. You'll learn by doing. One way is to write (or copy and paste) a quick and dirty program and run it:

#!/usr/bin/perl -w
use strict;
my $line = " put text here " ;
if ($line =~ / put regular expression here /) {print "FOUND"} else {print "NOT FOUND"};

Or you can download and run matchtest.pl, a program that solicits the line of text and the regular expression and compares the two. Either way, try lots of tests to make sure you understand how to use regular expressions as described thus far. If you need some inspiration, here are some ideas:

SQ7. Devise regular expressions that could:

Determine whether text contains a social security number (form: ddd-dd-dddd)
Determine whether text contains a date (form dd/dd/dd)
Determine whether a line of Blast output (see above) contains the fraction of identities
Determine whether the following nucleotide (GGTGCTAAATGGTCCTCTAGC) sequence contains a site recognized by the restriction enzyme AvaII (which recognizes sequences of the form GG[AT]CC, where [AT] means either A or T)

I.B. Indefinite repetition and built-in character sets

How far can we get understanding BlastParser using what we now know? Here are some lines from the Main Program that might now be less mysterious:

if ($line =~ /^Query=/) { # If 'Query' matches beginning of line...
if ($line =~ /^>Contig/) { record_subject() } # If '>Contig' matches beginning of line...

OK, but what about these two later on in the program:

($query_name, $query_description) = ($line =~ /^Query=\s+(\w+)\W+(.+)/);
($query_length) = ($line =~ /(\d+) letters/);

The first regular expression is intended to match the line carrying the query name and query description and the second is intended to match the line carrying the query length. These lines are:

Query= all0001 ID:6715 {-311 <--- 918} unknown protein
(409 letters)

Well, there are no \ or + signs in sight. How does the match work? \ is a special character in Perl regular expressions, indicating that the next character has a special meaning. So \d is read together to mean a digit; \d is equivalent to [0-9]. Try it! Use either matchtest or the quick program you copied to determine whether \d and [0-9] pick out the same patterns.

There are three other characters that follow \ in the example lines above. \s means a white space character. You might think that \s is equivalent to " ", but Perl extends the meaning to include tab and return, i.e. anything that adds space without printing. \w means any character commonly appearing in a word, which Perl defines as upper case letters, lower case letters, digits, or underscore ("_"). Curiously, the regular expression matching query name and description uses both \w and \W. Does case matter? Try it!

SQ8. Which of the following regular expressions will yield a match?

      TEXT: Four has four letters.
      EXPRESSION: \w\w\w\w
or   EXPRESSION: \W\W\W\W

In general any special character (preceded by \) that is capitalized has exactly the opposite meaning of the same special character in lower case. So \S means any character that is NOT a space, tab, or new line.

How is it that \w+ seems intended to match "all0001"? You can see that \w{7} would work, since it specifies a pattern of seven word-characters in a row. What if you didn't know how long the query name was? You might try \w{1-100}, which specifies a pattern of anywhere from 1 to 100 word-characters in a row, but even that is restrictive: how do you know the name is no more than 100 characters long? Perl supplies a simple solution for this and similar situations with the three special characters: +, *, and ?. The + symbol is applied to the previous character (or group of characters) and specifies that there may be anywhere from 1 to an infinite number of repetitions. The * symbol is the same but allows even 0 repetitions. The ? symbol requires that the previous character appear exactly 0 or 1 time.

In summary:

\d	CHAR CLASS: Any digit (= `[0-9]`)
\D	CHAR CLASS: Any nondigit (= `[^0-9]`)
\s	CHAR CLASS: Any white space character (= `[ \t\n]`)
\S	CHAR CLASS: Any non-white space character (= `[^ \t\n]`)
\w	CHAR CLASS: Any word-character (= `[a-zA-Z0-9]`)
\W	CHAR CLASS: Any non-word-character (= `[^a-zA-Z0-9]`)
\t	SPEC CHAR: tab
\n	SPEC CHAR: new line (LF)
\.	SPEC CHAR: period (.), because . itself has a special meaning
?	REPETITIONS: 0 or 1 time (= `{0-1}`)
*	REPETITIONS: any number of times (= `{0-`infinity`}`)
+	REPETITIONS: at least one time (= `{1-`infinity`}`)

SQ9. Determine which (if either) of the two regular expressions below match the text (and why).

    a. TEXT: Query: 2   KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60
        EXPRESSION: \w+: \d \s+\w+ \d{2}
    b. TEXT: Query= all0001 ID:6715 {-311 <--- 918} unknown protein
    EXPRESSION: .+<--.+

II. Capturing matched patterns

In BlastParser, it isn't enough to find lines with the information you want. The program must also extract that information. Perl regular expressions allows you to do this by grouping the information within parentheses. The first line of the pair:

if ($line =~ /^>Contig(\d+)\.revised\.gene(\d+)\.protein\W+(.*)/) {
$subject_name = sprintf "%03d.%03d", $1, $2;

not only finds a line that begins with ">Contig" followed by at least one digit, followed by...

SQ10. Translate the first line above into English.

it also captures three pieces of information, contained within three pairs of parentheses. Each captured piece of information is assigned to a built-in variable: $1, $2, $3,... (as many as are needed). These variables can be used by subsequent statements or even within the same statement that created it. The second line above uses these variables (never mind sprintf for the moment -- it simply formats the variables prior to assignment to $subject_name). For example:

$line = ">Contig647.revised.gene32.protein";
if ($line =~ /^>Contig(\d+)\.revised\.gene(\d+)\.protein\W+(.*)/) {
$subject_name = sprintf "%03d.%03d", $1, $2;
print $subject_name, "\t", $1, "\t", $2;
}

would print:

647.32 647 32

Note that $1 is given the value 647, which is just the information within the first set of parentheses, since \d+ captures all the digits after "Contig".

SQ11. Determine what is the output of the following program scrap (and why).

my $line;
$line = "Query: 2 KFVRNIVILLSSLAAVLTVCENANAGKGSLNVGAK-IPPGQERQLLQYQLQNDGKGLHRI 60";
$line =~ /^Query:\s+(\d+)\s+(\S+)\s+(\d+)/;
print $1, "-", $3, ":\t", $2;