Expression Arrays: Quality Assessment, Normalization and Summaries

The aims of this section are to prepare you in the mechanics of working with probe level data, to give you a set of approaches to normalization (which may work some of the time), and to get you to think critically about what normalization is supposed to do and how you would tell if any particular normalization is working well.


Course: Quality Assessment
Course: Microarray Normalization
Henrik Bengtsson's Two-color slides
Pan Du's Illumina presentation

Web Resources


My brief paper on regional artifacts arrays was one of two that drew attention to how it was possible to detect internal anomalies and perform QA by statistical means on Affymetrix' enclosed system.
Gordon Smythe and Terry Speed wrote a synopsis of normalization for two-color arrays as of 2003. Many of the issues they raise are still with us, although some particular methods, such as print-tip loess, are not used as much because of the shift to oligonucleotide arrays.
Leek and Storey described one way to take advantage of correlated errors to improve normalization.[PRESENTABLE]
My recent review article on expression array normalization.
Shirley Liu's MAT method used technical information about probes to improve normalization. [PRESENTABLE]
Carvalho's method for SNP arrays used a sophisticated statistical procedure to take advantage of technical information about probes. [PRESENTABLE]

HW 1

Expression Data
Affymetrix HapMap subset on HG-Focus chips, and sample ID's.
Agilent MAQC data samples B & D - one color
Illumina MAQC data samples B & D
Illumina HapMap (Genetical Genomics) data subset - Yoruba samples. These are four replicates of four samples.
Notes for HW 1
  1. The MAQC data comes from two samples of RNA (A & B) mixed in proportions 75:25 (sample C) and 25:75 (sample D). Each sample was done five times by three different labs. The full data set is available from GEO at GSE5350.
  2. The CEPH samples come from a group of extended families in Utah, and are often used for genetic studies. These expression data sets are actually measurements of expression in immortalized cell lines derived from lymphocytes from the CEPH individuals.
Warm-up: exercises

Use the affy package to read in the Affy data, but read in the other formats using read.table(). It will be convenient to use a loop and extract only the data you need from each table; the Agilent files in particular contain a lot of information you won't keep. You'll need the probe information (Probe ID, row and column) and it makes sense to extract that once. The Agilent files are a bit tricky because they have single quote ("'") and # characters, both of which R reads in in special ways. Try adding   quote="", comment.char="" to your read.table() command. Use the following code to read in Illumina MAQC, Illumina HapMap and Agilent.

  1. Construct box plots of the intensities for each chip: try making box plots with data on original scale data, and data transformed by cube root, and log2. Here is some sample ratio intensity code.
  2. Make plots of the density of signal intensities for each experiment: overlay the density plots for different chips in different colors. For the CEPH data choose colors that are identical for replicate chips.
  3. For each chip construct the deviations of that chip from the average chip profile. It's easiest to do this after log2 transforming everything. Plot the deviations of each chip relative to the average against the average log2 intensities. You may find the plots easier to look at using the smoothScatter function from library geneplotter.
  4. Use the code provided here (for Affy) and here (for Agilent) to examine regional variation over individual chips.
  5. (optional) Match up the sequences for each chip type and then compute the relative proportions of each base. For each chip plot the log2 deviation ratios against the proportion of each base for each chip. Compute the probe melting temperatures (Tm) using the code provided here and plot the log deviations against Tm.
  6. Construct a quantile normalization procedure in R.
  7. Construct a technical regression normalization procedure using loess() against the ideal reference chip.
  8. Normalize all the experiments using quantile and loess procedures.
  9. Compare the cluster plots of the samples before and after quantile and regression normalization.
  10. Construct a measure of how effective the normalization procedures are by comparing biological variation to technical variation in each experiment.
Here are some graphical summaries of what you should get.

You've now constructed (most of) a microarray pre-processing pipeline!

HW 1

Hand in a comparison of median normalization, quantile normalization, and loess normalization for the four data sets. This should be about 2 pages of text and tables, and several pages of graphs for illustration. The key issue is the ratio of biological to technical variation. You may want to evaluate only gene signals that are most likely to reflect real signals (how would you tell?).