The aims of this section are to prepare you in the mechanics of working with probe level data, to give you a set of approaches to normalization (which may work some of the time), and to get you to think critically about what normalization is supposed to do and how you would tell if any particular normalization is working well.

- Displaying residuals for QA diagnostics
- Covariation of residuals with technical covariates
- Comparing current normalization methods for expression data
- Advantages and drawbacks of quantile normalization
- Systematic errors and correlated differences
- Normalization by singular value decomposition
- Approaches to normalization by estimating technical distortion as a function of technical covariates

The first step in any data analysis is quality checking. Microarrays are very complex and delicately balanced measuring instruments. We have very limited access to diagnostic information about the process of preparation (although that is changing), but we do have very rich data, about which we may form reasonable expectations based on what we think *should* have happened during the preparation of the array.

A standard technique in statistics is to check the validity of a model by examining the residuals from the fit. We don't have a statistical model here, but we have expectations which can be formulated as a crude model, and then plot residuals. These residual plots are often quite informative about problems during the preparation process. See the Opinionated Guide .

The simplest kinds of covariation plots are ratio-intensity plots and spatial aberration plots. Ratio-intensity plots show the ratios of signals on one chip to the average intensity in the sample (usually on a log2-transformed scale). For samples all taken from one tissue type, most intensities are roughly constant, and so average probe intensity is a sensitive indicator of probe saturation and quenching. Spatial aberration plots represent the ratios of intensities over the spatial extent of the chip.

There has been a great deal published over the past ten years about microarray normalization (to the point where a leading journal, *Bioinformatics*, will not accept any more papers on the subject), without a consensus forming as to what is the appropriate normalization. The problem is not merely statistical. There seems to be a great deal of non-random error in array data, and it seems to depend on a great many factors, and so how to model this error is a matter of art, rather than elegant science.

However we can compare normalization methods on a variety of standard data sets, where we have some idea of what a good normalization should do. The key idea is that, while we may not for some time be able to know the 'truth' about absolute gene expression levels, we can at least ask how accurate the relative measures of expression are. See the OGMDA web site for further information.

By 2003 statisticians were inventing very complex normalization procedures. Benjamin Bolstad, one of Terry Speed's students, proposed cutting through all the complexity by a simple non-parametric normalization procedure, at least for one-color arrays. He proposed to shoe-horn the intensities of all probes on each chip into one standard distribution shape, which is determined by pooling all the individual chip distributions. The algorithm mapped every value on any one chip to the corresponding quantile of the standard distribution; hence the method is called quantile normalization. This simple 'between-chip' procedure worked as well as most of the more complex procedures then current, and certainly better than the regression method, which was then the manufacturer's default for Affymetrix chips. This method was also made available as the default in the affy package of Bioconductor, which has become the most widely used suite of freeware tools for microarrays (see www.bioconductor.org). For all these reasons quantile normalization has become the most common normalization procedure.

One of the biggest changes to our way of thinking needed to analyze high-throughput data, is the idea of correlated or systematic errors. The majority of technical variation in a microarry experiment can be represented by only a few principal components. Since the beginning, statisticians have assumed that errors in repeated measures are independent. Most procedures for high-density data have built on such procedures, which embody at least tacitly, the same assumption. We have only recently begun drawing on the deep traditions of statistics to systematically address the kinds of correlated errors that characterize high-throughput data.

One of the ground-breaking studies of this sort was (Leek and Storey, 2007). This paper showed how to perform a singular value decomposition of the data, using the design matrix, and therefore come up with a set of inferred ('surrogate') covariates, which could then function in a traditional analysis of covariance.

Another approach to normalization proceeds by generalizing the LOESS procedure that Terry Speed introduced.

Course: Microarray Normalization

Henrik Bengtsson's Two-color slides

Pan Du's Illumina presentation

- Ben Bolstad's Rogue's Gallery of Affymetrix Chips.

Gordon Smythe and Terry Speed wrote a synopsis of normalization for two-color arrays as of 2003. Many of the issues they raise are still with us, although some particular methods, such as print-tip loess, are not used as much because of the shift to oligonucleotide arrays.

Leek and Storey described one way to take advantage of correlated errors to improve normalization.[PRESENTABLE]

My recent review article on expression array normalization.

Shirley Liu's MAT method used technical information about probes to improve normalization. [PRESENTABLE]

Carvalho's method for SNP arrays used a sophisticated statistical procedure to take advantage of technical information about probes. [PRESENTABLE]

Agilent MAQC data samples B & D - one color

Illumina MAQC data samples B & D

Illumina HapMap (Genetical Genomics) data subset - Yoruba samples. These are four replicates of four samples.

- The MAQC data comes from two samples of RNA (A & B) mixed in proportions 75:25 (sample C) and 25:75 (sample D). Each sample was done five times by three different labs. The full data set is available from GEO at GSE5350.
- The CEPH samples come from a group of extended families in Utah, and are often used for genetic studies. These expression data sets are actually measurements of expression in immortalized cell lines derived from lymphocytes from the CEPH individuals.

Use the affy package to read in the Affy data, but read in the other formats using read.table(). It will be convenient to use a loop and extract only the data you need from each table; the Agilent files in particular contain a lot of information you won't keep. You'll need the probe information (Probe ID, row and column) and it makes sense to extract that once. The Agilent files are a bit tricky because they have single quote ("'") and # characters, both of which R reads in in special ways. Try adding quote="", comment.char="" to your read.table() command. Use the following code to read in Illumina MAQC, Illumina HapMap and Agilent.

- Construct box plots of the intensities for each chip: try making box plots with data on original scale data, and data transformed by cube root, and log2. Here is some sample ratio intensity code.
- Make plots of the density of signal intensities for each experiment: overlay the density plots for different chips in different colors. For the CEPH data choose colors that are identical for replicate chips.
- For each chip construct the
*deviations*of that chip from the average chip profile. It's easiest to do this after log2 transforming everything. Plot the deviations of each chip relative to the average against the average log2 intensities. You may find the plots easier to look at using the`smoothScatter`

function from library*geneplotter*. - Use the code provided here (for Affy) and here (for Agilent) to examine regional variation over individual chips.
- (optional) Match up the sequences for each chip type and then compute the relative proportions of each base. For each chip plot the log2 deviation ratios against the proportion of each base for each chip. Compute the probe melting temperatures (Tm) using the code provided here and plot the log deviations against Tm.
- Construct a quantile normalization procedure in R.
- Construct a technical regression normalization procedure using loess() against the ideal reference chip.
- Normalize all the experiments using quantile and loess procedures.
- Compare the cluster plots of the samples before and after quantile and regression normalization.
- Construct a measure of how effective the normalization procedures are by comparing biological variation to technical variation in each experiment.

You've now constructed (most of) a microarray pre-processing pipeline!