Summarizing Oligonucleotide Expression Data

Affymetrix and Nimblegen design oligonucleotide arrays based on the idea that many probes targeting a single gene yield many measures whose combination will give a better estimate of true gene expression than any single measure. Affymetrix designs most of its probes to match targets near the 3' end of genes, because the 3'UTR is often more distinctive among related genes in a paralog family and also the 3' end of transcripts are more heavily represented in the cRNA's made from the cDNA templates coming from the initial poly-T primed reverse transcription step. Nimblegen uses random priming during the initial construction of the cDNA library, and so the cRNA's come from all over the transcript. Therefore Nimblegen probes are often placed in consitutive exons in the middle of a gene.

Statisticians have enjoyed considerable debate over how to construct single expression estimates based on multiple-probe hybridization data. Over fifty different methods have been published to synthesize the different readings from the various probes for a gene, into a single estimate of transcript abundance.

Bioinformatic issues

Affymetrix makes more of the chips used in laboratories than any other manufacturer. According to their website there are 1,950 Affymetrix facilities worldwide, and more than 21,000 papers have appeared based on data from Affymetrix chips. Although Affy has developed new designs in the past few years,their human genome U133A and U133Plus2 chips are still used by many laboratories, and form the bulk of Affymetrix human data in the GEO repository. However Affymetrix invested heavily in the masks for photolithography required by their technology, and hence has been reluctant to update the probe set annotations on these chips since UniGene build 133 in 2001. Of course the human genome assembly has improved a great deal since 2001. Early builds of the genome often confused regions or genes and their complements.

Beginning in 2005 several groups (Harbig, 2005; Dai, 2005 ) published detailed studies of the accuracy of Affymetrix probe mappings, and some provided alternate mappings. Bioconductor tries to make available newer mappings of Affymetrix probes when loading the affy package.

The figures at right represent intensities across 100 varied tissue samples for two Affy probe sets. The top one maps to an rRNA gene and bottom to the (non-transcribed) complementary strand (from (Harbig, 2005)). Note that while the intensity of the probe for complement is clearly lower, there seem to be some samples with much higher signal intensity than seems reasonable from cross-hybridization alone. Perhaps the initial cDNA step ran too long, so that some cDNA copies of the intended cDNAs were made.

A related issue is SNPs. Several studies have shown that a single mismatch on 25-mer decreases signal quite substantially. When Affymetrix designed the U133 probe sets, few of the SNPs were known.

Alternate splicing and poly-adenylation issues

When Affymetrix designed its U133 probe sets, many fewer genes were known to be alternately spliced. If the individual probes were themselves accurate then the older generation of arrays could be used to probe alternate termination and some splice variation. The Affymetrix probes are (at nominally 25 base pairs, but usually less in practice due to abrasion) too short to be reliable individually. Affymetrix has come out with an array specifically targeting individual exons. However at the moment the data coming from the most widely used algorithms for processing these chips have too much technical variation between probes for neighboring exons for these measures to be reliable indicators of splice variation.

A related issue is alternate poly-adenylation. Since Affymetrix designed their widely used U133 probe sets, many studies have shown that transcription of most genes may terminate (with polyadenylation) at several different sites in the 3' UTR. Thus transcripts with similar exon composition may not all contribute to the intensities on all the probes in a probe set; i.e. the signals from an Affymetrix probe set may not all be measuring the same population of transcripts.

Statistical Issues

The early Affymetrix algorithms for summarization used various forms of averaging the intensities of probes within a probe set on each individual chip. This simple commonsense approach was not very accurate because the intensities of probes in a probe set varied quite substantially. (Li and Wong, PNAS, 2000) introduced a linear model method called dChip (DNA Chip); a similar model was published by (Irizarry, 2003). Since then numerous other authors have proposed models based on the linear model framework. All these methods presume that the probes in a probe set match a common unique transcript. This presumption seems unlikely to be true based on current knowledge of splice variation, although Affymetrix did make a reasonable effort to design probe sets to match the specific splice variants that were known at the time they their design.

Linear Multi-Chip Models

The idea behind the linear, or multi-chip, models is that the amount of signal from one probe in a gene's probe set should depend both on the amount of that gene in the sample, and on the specific affinity of the probe for that gene's mRNA. Probes for the same transcript will differ in their affinity depending on technical factors like the numbers of C/G bases versus A/T bases. The statistical motivation for multi-chip models is that the signals from individual probes often move in parallel across a set of chips; the signals have roughly the same pattern across the different samples, as shown in Figure 1. The animations of probe sets in dChip show this quite compellingly.

Figure 1. Probe signals from a spike-in experiment. The concentrations are plotted along the horizontal axis (log scale), and the probe signals are plotted on the vertical axis (log scale). Each probe is represented by one color. The different probe signals change in parallel. Image courtesy of Terry Speed

Most approaches in the past decade have tried to estimate jointly both the characteristics of the probe (affinity for the target or better avidity) and the abundance of the transcripts in the samples. The simplest strategy is the statistical linear two-factor model: that model assumes that the errors in each data point have similar variances, and the two factors combine in a simple way. If the signal from each probe is proportional to both probe affinity, and gene abundance, then it must be proportional to the product of these factors multiplied together. Suppose for one target gene, the chip has a set of probes p1,…,pk; each probe pj binds to the target with affinity fk. Suppose in each sample i the gene occurs in amount ai. Then the intensity of probe j on chip i should be proportional to fk ai. See figure 2.

Figure 2. Ideal linear model relationship among intensity (height of green bars), abundance of transcript (ai), and probe affinity (j).

In practice, there are frequent outliers in microarray data: a few values far beyond the usual random fluctuations in signal intensities. These extreme values may be due to scratches, or uneven hybridization, or other artefacts. Often up to 10% of probes in an acceptable Affymetrix chip are outliers. Most standard methods to fit multivariate data flounder badly on data with this many outliers. The robust approach is to try to identify the outliers, and exclude or down-weight them.

Constant Variance – the Li and Wong Model

Li and Wong originally suggested the model PMij – MM ij = fk ai + εij, by analogy with MAS 4.0. They have found better fits with the model PMij = fk ai + εij, (PM-only). Li-Wong assumes that the noise in all the probe measures is roughly same size. In practice all biological measures exhibit intensity-dependent noise. The effect of their assumption is that probes with smaller variation are ignored, even though this variation may be measuring real differences. Fortunately the bright probes are often the most specific, and it does little harm to ignore the majority of probes, if the bright probes are good. They have tuned their fitting procedure to try to reduce the emphasis on the very bright probes, but in my experience their algorithm often throws out a good bright probe probe as an outlier.

Figure 3. A probe set with values (represented by red lines) fitted to actual PM-MM values (in blue); from dChip.
Proportional Variance – RMA

This is largely the work of Terry Speed's group at Berkeley, especially Ben Bolstad, and Rafael Irizarry. They work only with PM values, and ignore MM entirely. They take a log transform of equation () and find

With errors proportional to intensity in the original scale, the errors on the log scale have constant variance. After background subtraction and normalization they fit the model:
where nlog is their terminology for 'normalize and then take logarithm'. They fit this model by iteratively re-weighted least squares, or by median polish. Code is available in the affy package on BioConductor, together with quantile normalization.

At the moment RMA appears to be the best widely used method, considerably better than the Affymetrix PLIER model. All approaches do a decent job on high abundance genes, but the multi-chip models do better on low-abundance genes, which include many genes of interest, such as transcription factors, and signalling proteins.

Improvements to the linear model approach - probe weighting

Several authors have noted that probes within a probe set are not all equally accurate. If those probes can be identified and down-weighted, the linear model approach would give more accurate results. Several approaches have been taken. In particular the factor analysis approach of (Hochreiter, 2006) attempts to find a common factor among all the probes; this approach effectively down-weights those probes, which are not highly correlated with the majority of the others. Another more naive approach suggested by (Chen, 2006) simply down-weights probes based on their higher variance. Surprisingly this seems to work very well, at least on the spike-in data set used in the AffyComp comparison of pre-processing methods.

Background Adjustment

The 25 base oligomers used by Affymetrix are prone to cross-hybridization. In fact the MM value exceeded the PM for about 1/3 of all probes on the older chips. Cross-hybridization varies with GC content: the more GC-rich probes seem to have higher non-specific signal, as they have higher specific signal for the same amount of cRNA. Jean Wu and Rafa Irizarry proposed a way to address this issue by estimating the cross-hybridization for each probe. The best way to estimate is to actually have cross-hybridization data. Usually this kind of data is not available and Wu proposed a way to estimate this from MM data, using a model where the GC content over the probe is explicitly represented. Hence they called this method GC-RMA (or gcRMA).

The gcRMA did very well in the AffyComp comparison of pre-processing methods (Irizarry, 2006). However I have heard anecdotal reports from many practitioners, saying that they found it worked badly on their data in comparison to 'plain vanilla' RMA. My explanation for this is that the gcRMA model attempts to estimate the cross-hybridization background for each probe individually, across all the chips. This works very well on the very clean Affymetrix spike-in data. But when I look at probes where I am confident all the signal comes from cross-hybridization, I see very different patterns across different chips. Hence I suspect gcRMA's background adjustment introduces more noise than RMA into the typical 'noisy' chips produced in most labs.

Model-based Quality Control of Affymetrix Chips

One of the particular values of a multi-probe system is that all probes effectively act like positive controls for each other. With a good multi-chip model the hybridization problems show up as outliers from the fitted model. Plotting these outliers shows up problems in distinct areas of some chips. For example, the following figure 5 was made from dChip, and shows outliers on one chip. These outliers mostly occur in two rings.

Figure 5. Outliers are shown in pink

Software Available

Li and Wong's method is available through their program dChip, at www.dchip.org. They maintain an active help thread. Academic licenses are free.

The RMA method is available as part of the affy package in the Bioconductor tools suite: see www.bioconductor.org. A commercial software vendor, Iobion, has incorporated RMA into their GeneTrafic product.

Appendix:

Affymetrix MAS 5.0

Affymetrix has upgraded their Genome CO System (GCOS) - formerly MicroArray Suite (MAS)- software several times over the short history of their product. MAS 5.0 was replaced by PLIER in 2004 but data processed with MAS5.0 is still in much of the literature. MAS 5.0 basically averages log-transformed signals from all the individual probes in a probe set. Log-transformation equalizes the contribution of different probes. An estimate of background based on the mismatch probe (MM) is used (although not MM itself): log(PMj/MMj).

The idea of averaging different probe intensities for the same gene seems natural but is somewhat misleading in practice. The individual probes in a probe set have very different hybridization kinetics. Taking an average of their intensities, is like averaging the readings from scans taken at very different settings. A good algorithm should compare information about probe characteristics, based on the performance of each probe across chips, and use this to make a better estimate. These principles are the basis of the large class of multi-chip models. Affymetrix saw the evidence and released their own multi-chip model in 2004. Affymetrix chips processed with MAS 5.0 give reasonable estimates for high-abundance genes, and the algorithm reliably distinguishes high from low abundance genes. However fold-changes in low abundance genes are not always comparable across samples.