Introduction to Bioinformatics: Parsing Questions

Biol 591

Introduction to Bioinformatics
Programming Problem Set 2

Fall 2002

I. Arrays

P2P.1. Run the following program:

#!/usr/bin/perl -w
use strict;
# Asks for input of three words, printing each as it's entered
my ($word_count, $word);
for ( $word_count= 1; $word_count < 4; ++$word_count) {
    # $word_count begins at 1
                    # So long as word_count is less than 4, do what follows
                                     # Increment $word_count by 1 each iteration
$word = <STDIN>;          # STDIN = Standard input = keyboard
chomp $word;
print $word, "\n";
}

Change the program so that it collects all three words and prints them out at once, separated by commas. Do this by:

a. Add the following line just before for ($word_count, $word);
      my @word_list;
b. Replace the line print $word, "\n";with a line beginning:
      push my @word_list
Remember to put a variable after @word_list (as in BlastParser).
    Remember to end the statement with a semicolon.
c. Add to the end of the program a line beginning:
      my $result_line = join(
Use an analogous line in BlastParser as a model.
c. Add a final line beginning:
      print

Be sure you understand the reason behind each of the changes.

2P.2. Most of the lines that BlastParser prints sumarize matches between the pathogenic O175:H7 and non-pathogenic K12 sets of proteins. The interesting queries, though, are the ones that don't find a match. How can you list only those? The following feature of Perl may prove useful in answering this question:

It's often useful to know the size of an array. One way to do this is an assignment command:

my @a = ("red", "green", "blue");
my $size = @a
print $size, "\n";
will print 3. Here's another example:
my $size = @a;
if ($size == 0) {
   print "This array contains no items\n";
}
if ($size == 1) {
   print "This array contains one item\n";
}
if ($size > 1) {
   print "This array contains multiple items\n";
}
(See Beginning Perl for Bioinformatics, page 340.)
P2P.2a. The size of which array in BlastParser is of interest in picking out only those queries with no matches? What size will that array have when there are zero matches? When there is one match?
P2P.2b. Which section of the program would you change to test for the size of the array? (For instance, the part labeled MAIN PROGRAM? The section following sub print_previous_query? The section following
sub start_new_query? The section following sub record_subject? Somewhere else?)
P2P.2c Make the change and run the program.

II. Regular expressions

P2P.3. Match the following regular expressions with the value they will produce when they act on the line of text:

LOCUS BACLEFB 3291 bp DNA linear BCT 12-OCT-1995

Regular expression Value of $part or @part

1. (my $part) = /.+(\d+) bp/; a.

2. (my @part) = /.+(\d+)-(\w+)-(\d+)/ b. 1

3. (my $part) = /\w+(\W+)/; c. 3291

4. (my $part) = /.+\d+(.+)\s+/; d. 2, Oct, 1995

5. (my $part) = /^DNA\s+(W+)/; e. bp DNA linear BCT

Hint: Use a program (click here) to check each expression.

P2P.4. The query descriptions that BlastParser prints are so long as to make the output ugly. And the last part of the descriptions are useless -- they all say "Escherichia coli..." We'd like to modify the program so that "Escherichia coli" is suppressed. The following feature of Perl may prove useful in answering this question:

Sometimes we don't simply want to search for a pattern in a line; we want to change the line. For instance, a command to transcribe DNA to RNA by changing every T to a U is

$RNA =~ s/T/U/g;
(see pages 40-41 of Beginning Perl for Bioinformatics.)
Sometimes, though, we want to get rid of what's matched. To eliminate all T's we could use

$RNA =~ s/T//g;

(Not that this makes any biological sense!) We can also eliminate entire words.
For instance, if $petholds"My Siamese cat", then we can use
$pet =~ s/Siamese//g;
to produce "My cat". Notice that there is an extra space in "My cat". We can compress multiple spaces to single spaces with s/\s+/ /, so
$pet =~ s/Siamese//g;
$pet =~ s/\s+//g;
would produce "My cat".
P2P.4a. Which section of the program would you modify to get rid of "Escherichia coli"?
P2P.4b. Which variable would you put on the left of the =~operator?
P2P.4c. Make the change.
P2P.4d. How could you eliminate both "Escherichia coli" and everything after it in the description?

P2P.5. Remember the four warning messages that occurred when installing the E.coli K12 database using FormatDB and running BlastAll? I (JE) went into the text file, and sure enough, the protein sequences were missing. This did not seem right, so I contacted TIGR/CMR asking what was going on. ... In brief, it turns out that these four proteins use a very rare amino acid (selenocysteine) that is not part of the conventional 20. TIGR's program choked on the sequences.

What could make a program choke? I decided to download one of the offending sequences and take a closer look at it. I chose fdnG, encoding formate dehydrogenase, alpha subunit. How is its amino acid sequence different from other amino acid sequences, in such a way that a program used to normal amino acid sequences would have a problem?

I scanned the amino acid sequence but learned nothing from my effort. Maybe I just missed the strange part. So, I wrote a quick program, and there it was. There WHAT was? Well, you can find out by:

P2P.5a. Download the sequence as a GenBank file (click here for instructions).
P2P.5b. Extract the amino acid sequence from the GenBank file to create a file in FastA format. You can do this by hand in Word (or similar). Or see P2P.6 below. You should end up with a file that looks like this:
>NP_415991: formate dehydrogenase-N, nitrate-inducible, alpha subunit
mkkvvtvcpycasgckinlvvdngkivraeaaqgktnqgtlclkgyygwdfindtqiltp
etc
P2P.5c. Write (or actually 90% steal) a program that reads in the FastA file you created and searches for a letter that is NOT amongst the conventional 20 amino acids, i.e. [acdefghiklmnpqrstvwy]. There are more elegant ways of doing this, but you might just eyeball the list and see that the only letters left are [bjouxz]. So the program looks for lines in which there is no match with [bjouxz]. The program should print out the nonconventional symbol used to represent selenocysteine..

P2P.6. Draw on BlastParser to help you write a program that will take as input any protein file in GenBank format and spit out an amino acid sequence in FastA format, as shown in P2P.5b.

P2P.7. Perl is not merely useful in bioinformatics. You can use it for a variety of useful purposes! (Will it dice those hard to cut onions?... never mind). Write a quick program to take a list of names (click here) and reverse last names with first names. The next step would be to alphabetize them, but for the moment you can do that in Word or Excel if you like.

Regular expression	Value of $part or @part
`1. (my $part) = /.+(\d+) bp/;`	`a.`
`2. (my @part) = /.+(\d+)-(\w+)-(\d+)/`	`b. 1`
`3. (my $part) = /\w+(\W+)/;`	`c. 3291`
`4. (my $part) = /.+\d+(.+)\s+/;`	`d. 2, Oct, 1995`
`5. (my $part) = /^DNA\s+(W+)/;`	`e. bp DNA linear BCT`