Problem Set 6: Microarray Analysis

Biol 591

Introduction to Bioinformatics
Problem Set 6 - Microarray Analysis (part 1)

Fall 2002

PS6.1. Write a program that will take the training data used by Golub et al and find a set of 50 genes that best predicts the distinction between acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML)... OK, that might be a bit much. Let's break it down:

PS6.1a. Starting with the template found by clicking here, fill in the MAIN PROGRAM portion as best you can at the moment. It should consist of subroutine calls with descriptive names so that the logic of your program is apparent to anyone reading this section. There will probably be a loop of some sort. Save the resulting protoprogram as predict.pl.
Hint 1: Outline the steps of what needs to be done to translate each line in the training set into a single number representing the correlation between the expression levels of that gene for each patient and the classification of those patients as suffering from ALL or AML. DON'T try to put any equation in the MAIN PROGRAM. Leave that for some future subroutine, which you call from the MAIN PROGRAM.
Hint 2: It's not likely that you're going to be able to come up with the ultimate MAIN PROGRAM immediately from scratch. Still, get something down, realizing that you'll probably change it before the program is through. Don't get stuck at this point. Jotting down a few feeble lines and moving on is better than staring miserably at a blank screen. Your thoughts will no doubt evolve as you go through the problem.
PS6.1b. Go through the MAIN PROGRAM you just wrote and note every variable used, both scalar variables ($...) and array variables (@...). Declare each of these variables with a my statement in the VARIABLES section of the program.
PS6.1c. Save and run the program.
- If you get an error message like:
Global symbol "$mean_ALL" requires explicit package name at predict.pl line 25.
then you forgot to declare the indicated variable in a my statement. Fix it.
- If you get an error message like:
syntax error at predict.pl line 11, near "$mean_ALL               # mean expression of gene in ALL patients      my"
then you forgot to put in a semicolon.
- If you get an error message like:
Bareword "std_dev_AML" not allowed while "strict subs" in use at predict.pl line 29.
then you forgot to put a $ or @ before the name of a variable.
- If you get an error message like:
Undefined subroutine &main::Sort_Correlations called at predict.pl line 37.
then you need to make a fake subroutine to keep Perl happy. For example:
Sort_Correlations ();
...
sub Sort_Correlations {
}
This empty subroutine is called a stub. You'll fill in later how to sort the correlations.
- If you get an error message like:
readline() on unopened filehandle TRAINING_SET at predict.pl line 27.
then you're actually doing fine. Perl is complaining that you haven't defined the input file. That comes later.
PS6.1d. Once you've gotten rid of any syntax errors in your proto-program, fill in the FILES section. Do this by stealing the appropriate lines from any one of a number of programs you've encountered. Of course you'll want to change the name of the file (BlastParser.pl is a good example).
You'll need the training set if you don't have it already. To get it, go to the web site given in Golub et al (footnote 14), click on datasets, and scroll way down to the section entitled Molecular Classification of Cancer.... You'll find there a downloadable file called data_set_ALL_AML_train.txt. Download it.
Save and run the program. If you get no error messages, you win. (Of course the program doesn't DO anything)
PS6.1e. Now we have to teach the program how to do the tasks outlined in the MAIN PROGRAM. Start with reading in the data (since we can't do anything else without the data).
WARNING: Reading things in (input) and printing them out (output) are generally the most complicated parts of a program, and this program is no exception. Don't get ground down by this section. If you feel like you are at the last extremity, then click here for instructions on how to bypass the need for writing any code to parse the input file.
Somewhere in your MAIN PROGRAM you had a step that said something like: Read_Line_From_Training_File, or Parse_Line_From_Training_File(). Now's the time to write that subroutine. For now, let's just teach it to read in one line and then immediately print it out without any processing.
Hint: Perl has the idiosyncracy that referring to the file within a while statement causes a line to be automatically read in from the file. For example:
while <TRAINING_SET> {
causes one line to be read in from the file defined in an open statement as TRAINING_SET into the special Perl variable called $_ (I really wish Perl did not try to be so helpful in invisible ways. It makes it difficult to see where things happen in a program). This automatic reading from the file occurs ONLY in while statements. Otherwise, you have to explicitely cause Perl to read from a file, like:
$input_line = <TRAINING_SET>;

Run the program... if nothing happens, that's good, but...
PS6.1.f. ...how do you know the input worked? Test by adding a print statement within the subroutine. The result should be that the program reads in every line and prints out every line. That's a lot of printing. Remember you can stop execution of the program at any time by pressing <ctrl-c> or <ctrl-Break>. Or better, run it with through the | more pipe.
Hints:
- Do you see each set of data running together with the next? Then you forgot to put "\n" at the end of the print statement to move to the next line.
- Do you see each set of data separated by a BLANK line? How did that get there? You probably forgot to put in the almost obligatory chomp; statement after every read from the input file to get rid of the return/linefeed that comes along with the input line.
-Do you see each set of data separated from the next, with no BLANK lines? Maybe you did everything right OR maybe you have the two above errors, which cancel each other out!
PS6.1g. Now we need to teach the program how to extract the different pieces of information from each line. First, we have to figure out how WE would do it if we saw the same information that that the program sees. To see what that is, get into MS Word, read in the training file, and click on the Show/Hide button (looks like a paragraph symbol, ¶). Now you should see how items are separated and how lines are separated. The arrow symbol represents a tab ("\t" in Perl language) and the paragraph symbol represents a return/line feed ("\n" in Perl language). So now you know how you could tell where one item stops and the next one begins.
How can you split up the input line, breaking it up as you've learned how, and putting each item into an array so that you can access each individually? You've done the reverse process many times:
my $result_line = join("\t", @query_info);          # Taken from BlastParser.pl
The Perl join function joins each element of an array into a single string, separating each element by a character you specify (in this case a tab). Is there any way to do the reverse operation of splitting at a specific character the string holding the input line and putting each item into an array? One approach is to check Beginning Perl for Bioinformatics for help on splitting strings. Another is to search the various Perl programs you've accumulated for the word "split" (you can do this in Windows through the File search facility).
After educating yourself in this way, write lines to be placed in the subroutine that reads/parses the input line that puts each item in the line into an array. Remember that if you used Perl's automatic input, then the input line is sitting in a variable called $_. Then, to make sure everything has worked properly, print out the first four elements of the array (remembering to start with $array[0]). Run the program to make sure it works up to this point.
PS6.1h. Save the name of the gene from the array you just made as a separate variable. Here's a new variable. You'll need to define it, and I suggest doing so in the VARIABLES section of the program. You'll notice (from the file displayed in Word) that there are two versions of the name for each gene: a long version and a short version. Choose the short version. Save it in a variable and then print that variable to make sure that you saved the correct information. (By the way, by this time you've generated a lot of print statements. It's probably a good time to get rid of the earlier ones).
PS6.1i. Did your output begin with Gene Accession Number? That's not the name of a gene... it's a header. If you examine the training set file in Word, you'll see that the first line is garbage. Get rid of it! You can do this by reading in the line and ignoring what you've read. Where is the right place to do this? If you do it within the subroutine that reads/parses the input line, then you'll skip a line for every line you read -- not a good idea. You want to skip a line only once and at the very beginning. Where in the program can you ensure that the first line of the input file will be skipped? Go to that point and insert a line like:
my $garbage = <TRAINING_SET>;
Run the program to see if the header line is no longer given as a gene name.
PS6.1j. Now we want to save the information on each line after the gene name. What information is there to save? Look at the file in Word. You'll see a general rule: number, letter, number, letter, .... To tell the truth, I don't know exactly what the letters mean. I think A is for absent, P is for present, the rare M is for My God, what's this mean? I suggest that we save the numbers (i.e. the gene expression level) in one array and the letters (probably related to the quality of the data) in a separate array. So $expression[1] will contain the first number, $quality[1] will contain the first letter, $expression[2] will contain the second number, etc. Fritz will blow a fuse at this numbering scheme, since arrays begin at [0], but this way the subscript can correspond to the patient number. Who would want to be patient 0?
Write a for loop (not a foreach loop) that goes through the array you made in PS6.1g and assigns the appropriate values to elements of the array @expression and @quality. This is a bit tricky, because there are twice as many elements (plus two for the gene names) in the first array than in @expression and @quality, because you're going from:
                                    0                        1                    2          3           4         5           6         7       ...
@input_items: Long_gene_name Short_gene_namenumber1 letter1number2letter2number3 letter3 ...
to:
                                              0                                    1                     2                      3
@expression:                   (undefined)                        number1           number2          number3             ...
                                              0                                                1                     2                      3
@quality:                         (undefined)                                     letter1              letter2               letter3   ...
Fortunately, the for loop structure in Perl can handle this:
for (variable = first value;
      conditions under which loop continues;
      variable = variable + increment) {
so you can go through the input array two items at a time, giving the first to $expression[ ] and the second to $quality[ ]. Do it, and print out at the end of the loop (outside of it) $expression[1] and $quality[1] to make sure that your loop worked.
If you were able to get this far, the rest is a piece of cake

PS6.1k. After all that heavy lifting, it will be a delight to do something simple like calculate a mean. Now teach the program how to do this, in the subroutine you specified for that purpose.
Hint 1: You might have written your MAIN PROGRAM like:
$mean_ALL = Calculate_mean_of_ALL_patients();
$mean_AML = Calculate_mean_of_AML_patients();
but this is both inefficient and prone to error, because whenever you correct an error in the first subroutine, you'll probably have to remember to correct it in the second one as well. It's better to write a single subroutine that can act on any set of patients, like:
$mean_ALL = Calculate_mean(@ALL_patients);
$mean_ALL = Calculate_mean(@AML_patients);

The time has come to tell the program who are the ALL patients are and who are the AML patients... you don't know yourself? In that case go to the web site given in Golub et al (footnote 14), click on datasets, and scroll way down to the section entitled Molecular Classification of Cancer.... You'll find there a downloadable file called Files_descriptions.txt. Unfortunately, the file is not as descriptive as it might be. However, it does point you to a file table_ALL_AML_samples.txt that will give you what you want to know. What you want to know is which patients in the initial set of 38 have ALL and which have AML. Make two arrays that capture this information.
Hint 2: You could do this in at least two ways. Perl allows you to define arrays like this:
@AML_patients = (28,29,30,31,32,33,34,35,36,37,38);
You could achieve the same effect by the following:
@AML_patients = (28 .. 38);
Either line assigns to the array the patient numbers for those patients with AML.
Now write the routine, adding up the expression values for each patient in the array given to the subroutine, counting how many patients you considered, and, in the end, dividing the sum by that number. Put a print statement somewhere so that you can tell if the subroutine is working.
Hint 3: You've probably been using perl predict.pl | more to start the program. That's OK, but it means that you have to wait for the program to completely finish execution before you get any output. An alternative is to put after a diagnostic print statement something to temporarily stop execution, something like:
print [some useful variable], "\n";
<STDIN>;
This has the effect of waiting for some input from the keyboard before proceeding. A third alternative is to step through the program with a debugger.
Hint 4: The most common mistakes in writing a subroutine (at least for me) are:
(a) Forgetting to precede the first instance of a variable that's used only in the subroutine with my.
     If you do this, you might get an error like:
   Global symbol "@patients" requires explicit package name at predict.pl line 66.
(b) Forgetting to initialize a variable. If you do this, you might get an error like:
   Use of uninitialized value in addition (+) at predict.pl line 71
(c) Forgetting to put a semicolon at the end of a statement. If you do this, you'll get an unpredictable
      error. The linenumber of the earliest error (+ or - 1) may help.
(d) Forgetting to return a value at the end of the subroutine. If you do this, you won't get any error
     message, but the program may act as if the subroutine was never called.

PS6.1l. Now teach the program how to do calculate the standard deviation, in the subroutine you specified for that purpose. The process will be almost identical as what you did for the mean, except, of course, that the quantitites you calculate are different.
PS6.1m. (almost done) Now to teach how to calculate the correlation coefficient. Recall the formula:
(mean_ALL - mean_AML)
correlation coefficient = ---------------------------------------
(std_dev_ALL + std_dev_AML)
This subroutine will not need any parameters (i.e. nothing within the parentheses in the statement that calls the subroutine). In fact, the subroutine could be as little as a single line. As usual, put a print statement somewhere to test the efficacy of your subroutine.
PS6.1n. Now teach it to save the data in an array. You've done this before, most notably in FindMotif (store_hit_information). Steal that code and modify it to suit.
PS6.1o. Teach the program how to sort the array that contains all the correlation coefficients. You've done this before, also in FindMotif (sort_and_discard_excess_values) and in the notes a week or so ago. It was pretty mysterious stuff, but it isn't difficult to copy and paste and modify the names of the variables.
Hint: There is one thing about the sort statement we need to understand (except where the variables go). Recall that the subscript in the statement says which variable is to be the key for sorting: $$a[0] <=> $$b[0] says that the first variable is the key, while $$a[1] <=> $$b[1] says that the second variable is the key.
PS6.1p. FINALLY, the last step is to print out the best genes. You want to print out the genes with the 25 highest correlations (expression correlates best with expression from ALL patients) and those with the 25 lowest correlations (expression correlates worst with expression from ALL patients, which is still good for prediction).
Hint 1: There are clever ways of doing this, but the straightforward way is to write two loops, the first going from the first score to the 25th, and the second going from the last score to the 25th from the end. The second loop poses a special challenge: What is the subscript of the last score? Consider the usual trick of saying:
$size = @correlation_info;
and recall as well the power of for loops to control the amount by which the variable is iterated.
Hint 2: Remember that arrays in Perl start from the subscript [0], not [1].
Hint 3: Count the number of genes you got. It should be 25 with positive correlations and 25 with negative correlations. Is it? If not, why not? The most likely error is that you got one too many negative correlations.

PS6.1q. Compare your results to those in Fig. 3B of Golub et al.
Hint: It might be convenient to print out your results. To do this, redirect Perls output to a file:
Perl predict.pl > correlation.txt
Now you can read the file into Word or Excel and print out the results. By the way, when I do this, I get the error shown below, but the output still seems OK.
Use of uninitialized value in print at predict.pl line 108,...