The Purposes of Statistical Tests 
Microarray studies often aim to identify genes
that are differentially regulated across different classes of samples; examples
are: finding the genes affected by a treatment, or finding marker genes that
discriminate diseased from healthy subjects. Statistical tests, rather than
cluster analysis, are the right tool for this purpose. Statisticians have developed
many new procedures that seem more useful than conventional tests for microarray
data. Some studies also aim to identify groups of genes that act together, or
to uncover molecular similarities among subsets of samples. The section on
Exploratory Analysis describes appropriate techniques for these studies. Some
investigators confirm results found by exploratory analysis by using formal
statistical methods, or do cluster analysis on differentially expressed genes.
Such procedure can be circular in logic; better procedure is to identify
differences between groups defined on biological grounds. 
Microarray data is often used as a guide to
further, more precise studies of gene expression by qtPCR or other methods.
Then the goal of the statistical analysis is heuristic: to provide the
experimenter with an ordered list of good candidate genes to follow up.
Sometimes the experimenter plans to publish microarray results as evidence for
changes in gene abundance; in this case it is important to state the correct
degree of evidence: the ‘pvalue’. Many microarray papers still present pvalues,
as if each gene had been tested in isolation, even though many genes were
actually tested in parallel; these pvalues are wrong in the context of testing
thousands of genes. Perhaps a better way to specify the confidence of
microarray results is the ‘false discovery rate’. 

Transforms 
Often the first step is transforming the values to log
scale, and doing all subsequent steps on the logtransformed values. Although
taking logarithms is common practice, and helpful in several ways, there are
other options. The main justification for transforms in statistics is to better
detect differences between groups whose withingroup variances are very different.
Most commonly the withingroup variances are higher in those groups where the
mean is also higher. A different kind of variation, the measurement error in expression
level estimates, grows with the mean level. If the measurement error is proportional
to the mean, then the logtransformed values will have consistent variance for
all genes. For both reasons many researchers argue that gene expression measures
should be analyzed on a logarithmic scale. 
However the gene abundances on the log scale usually show
greater variation near the bottom end, giving rise to the characteristic
‘funnel’ shape of many microarray plots. The log transform is too ‘strong’, and
partly reverses the inequality of measurement variances. A weaker transform,
such as a cube root, would bring the variances closer to equal. Some researchers
have devised ‘variancestabilizing’ transformations, which attempt to equalize
measurement errors for all genes. 
However we are not in principle comparing different genes,
but rather the same genes across different groups, and in most experiments, few
genes change more than threefold in mean levels. For studies where gene levels
are fairly constant, and changes are expected to be modest, such as
neuroscience studies, there is no need to transform data. For studies of
cancerous tissue, where often some genes are elevated tenfold or more, and these
increases are highly variable between individuals, the log transform is very
useful. For studies with at most moderate fold changes – such as most
experimental treatments on healthy animals – it would be better to use a weaker
transform, such as a cuberoot transform or a ‘variancestabilizing’ transform’.
However most people find these less common transforms confusing; then as a practical
alternative, it is worth doing the analysis in parallel in both log and true
scale, and discarding genes whose significance disappears entirely under one or
the other. However there are other reasons to equalize variances for all genes;
roughly equal variances are necessary for SAM and for the empirical Bayes
methods described below. 
These issues are discussed more fully in the section on Transforms
in Distributions of Microarray Intensities. 

Comparison of Two Groups of Samples 
The simplest and most common experimental
setup is to compare two groups: for example, Treatment vs. Control, or Mutant
vs. Wild type. The issues arising in simple comparisons arise also in more
complex settings; it is easier to explain these in the simpler context. 
The longtime standard test statistic for
comparing two groups is the tstatistic: 
t = (x_{i,1} – x_{i,2})
/ s_{i}, 
where x_{i,1} is the mean value of gene
i in group 1, x_{i,2} is the mean in group 2, and s_{i} is the
(nonpooled) withingroups standard error (SE) for gene i. 
Usually the majority of genes are not
consistently changed between groups. For the unchanged genes, the values of
their tstatistic would follow the tdistribution indexed by n1+n2 – 2 degrees
of freedom, in repeated trials, provided there are no outliers in the samples.
As a matter of practice, in almost all cases, the values of the tstatistics
for all of the unchanged genes in a single experiment also follow a
tdistribution. If the researcher expects only a small number of genes are
greatly changed, then a convenient way of assessing the most likely true
positives is to plot the tscores obtained, against the tdistribution. Usually
the tscores of the unchanged genes follow the tdistribution; if the unchanged
genes are the overwhelming majority, then this plot will be a straight line at
45^{o}. The tscores for the few really significant genes are more extreme,
and stand out from the straight line, where they can be easily seen. A small R
program that does this is here. 


Figure 1. A quantile plot 
A drawback of the tstatistic for microarray
datasets is that most experiments have only a few samples in each group (n1 and
n2 are small), and so the standard error s_{i} is not very reliable. In
a modest fraction of cases, s_{i} is (by normal sampling variation)
greatly underestimated, and genes that are little changed give rise to extreme
tvalues, and therefore false positives. The theory compensate for some of
these false positives, in that the tails of the tdistribution are much further
out than those of a Normal curve. This causes a reverse problem: a truly
changed gene has to change a great deal to give convincing evidence; therefore
many of the moderately changed genes won’t show convincing evidence of change. 
Tusher and Tibshirani suggested (a form of) the
following test statistic t* rather than the usual t statistic[1]. The SE in the
denominator is adjusted toward a common value for all genes; this prevents gross
underestimation of SE’s, and so eliminates from the gene list those genes that
change only a little: 
t* = (x_{i,1} – x_{i,2})
/ ((s_{i} + s_{0})/2) 
where s_{0} is the median of the
distribution of standard errors for all genes. The range of values of the
revised denominator is narrower than the range of sample SEs; it is as if all
the SE’s are ‘shrunk’ toward the common value s_{0}. Unfortunately no
theoretical distribution is known for this statistic t*, and hence the
significance levels of this statistic must be computed by a permutation method
(see below). A similar statistic also crops up in empirical Bayes analysis (see
below). 
Another approach to detecting more of the
differentially expressed genes is to use a more precise estimate of the
variation between individuals, for each gene, in tests of that gene. If a good
deal of prior data exist on the tissue and strain used in the wildtype (or
control) group, measured on the same microarray platform – and this is
sometimes the case now – then it is defensible to pool the estimates of wildtype
variation from each of the prior studies, and use this as the denominator in
the tscores. The tscores should then be compared to the tdistribution on a
number of degrees of freedom, equal to that used in computing the pooled
standard error. A variant of this approach may be used in a study where many
groups are compared in parallel. The withingroup variances for each gene may
be pooled across the different groups to obtain a more accurate estimate of
variation. This presumes that treatments applied to different groups affect
mostly the mean expression levels, and not the variation among individuals. Of
course one should test that the discrepancies in variance estimates are not too
large for many of the genes that are selected as differentially expressed. This
may be done by computing the ratios of variances between groups (Fratios), and
comparing to an Fdistribution. 
Since gene expression values are not
distributed according to a Normal curve, some researchers have used
nonparametric tests. The most commonly used test of this type is the Wilcoxon
signedrank test. Unfortunately this is not really a good choice, because the
Wilcoxon test assumes that the distribution of the gene expression values,
although not Normal, is strictly symmetric; the pvalues from this test are
inaccurate, if this assumption is badly wrong; and in practice the distribution
of values of a single gene is frequently quite asymmetric. A thoroughly robust
nonparametric test is the rank, or binomial test, which counts the number of
values in each group greater than the median value, and compares this number to
the binomial distribution. This test has very little power to detect
differences in small groups, which are typical of microarray studies, and is
never used in practice to detect differential expression. 
Gene expression values don’t follow a normal
curve, but the pvalues from a standard ttest aren’t far away from truth when
the distribution is moderately asymmetric. However the ttest does fall down
badly when there are outliers (values more than twice as far from the mean as
all the others). For this reason, when doing a ttest, it is wise to confirm
that neither group contains outliers. In practice, it often happens that genes
detected as different between groups, are actually expressed very highly in
only one individual of the higher group. Some methods for doing this
preanalysis are discussed in the Exploratory Analysis section. Probably the
best way to avoid the problem with outliers is to use robust estimates of mean
(trimmed mean) and variability (mean absolute deviation). However no
theoretical distribution is known for these; and their significance levels
(pvalues) would have to be computed by permutations. 

Permutation Tests 
Permutation testing is an approach that is
widely applicable and copes with distributions that are far from Normal; this
approach is particularly useful for microarray studies because it can be easily
adapted to estimate significance levels for many genes in parallel. The major
drawback for experimentalists is that these tests usually require some
programming. Some recent software packages, notably SAM, implement permutation
testing in a menudriven interface. 
The meaning of a pvalue from a permutation
procedure differs from the meaning of a modelbased pvalue. The modelbased
pvalue is the probability of the test statistic, assuming that the gene levels
in both the treatment and control groups follow the model (eg. a Normal
distribution). A permutationbased pvalue tells how rare that test statistic
is, among all the random partitions of the actual samples into pseudotreatment
and pseudocontrol groups. The steps in a permutationbased computation of the
significance level of a test statistic are as follows: 
 Choose a test statistic, eg. a tscore for a comparison of two groups,
 Compute the test statistic for the gene of interest,
 Permute the labels on samples at random, and recompute the test statistic for the
rearranged labels; repeat for a large number (perhaps 1,000) permutations, and finally,
 Compute the fraction of cases in
which the test statistics from iii) exceed the real test statistic from ii).

The pvalue
for the gene is the fraction of cases in which the randomly permuted samples
give a test statistic for that gene, at least as extreme as the one that occurs
in the properly labelled samples. The idea is that if the gene is distributed
similarly in both treatment and control groups, then the difference statistic
(a tstatistic or any other) will appear about as big in the permuted
arrangement, as in the true arrangement. If the gene levels in the treatment
group are higher than any levels in the control group, then no value of the
permuted statistic will be as great as the true value. 
A permutation
test needs at least two groups of six samples, in order to have enough
different permutations. For two groups of six, there are C(12,6) = 924 permutations that give
different groups; although half of these permutations are mirror images of the
other half, so the true number of distinct pseudoscores is 462. Some
statisticians use balanced permutations: where each pseudogroup has roughly
equal representation from both the true treatment and the true control group.
The true test statistics typically stand out better from this group of permutations,
giving more extreme pvalues, but at the cost of requiring larger numbers of
samples; for example for two groups of six there are only C(6,3)^{2} /2
= 200 distinct balanced pseudogroupings. 

Volcano Plot 
However one chooses to compute the significance
values (pvalues) of the genes, it is interesting to compare the size of the
fold change to the statistical significance level. The ‘volcano plot’ arrange
genes along dimensions of biological and statistical significance. The first (horizontal)
dimension is the fold change between the two groups (on a log scale, so that up and down
regulation appear symmetric), and the second (vertical) axis represents the
pvalue for a ttest of differences between samples (most conveniently on a
negative log scale – so smaller pvalues appear higher up). The first axis
indicates biological impact of the change; the second indicates the statistical
evidence, or reliability of the change. The researcher can then make judgements
about the most promising candidates for followup studies, by trading off both
these criteria by eye. With a good interactive program, it is possible to attach names to genes that appear promising. 


Figure 2. A volcano plot 
GenomeWide Comparisons, Corrected PValues, and False Discovery Rates 
PValues and False Discovery Rates 
Most scientific papers quote pvalues, however
few papers discuss their meaning. In order to understand what the problem is
with quoting pvalues for massively parallel comparisons, we need to be
precise. Let’s consider, for example, a ttest of differences between two
samples. If there is no systematic (real, reproducible) difference between
groups, nevertheless the tscore for differences between groups is never
exactly 0. Common sense cannot decide whether a particular value provides
strong evidence for a real difference. The natural question to ask is: how
often a random sampling of a single group would produce a t value as far from 0
as the t we observed. When you declare an effect is significant at 5%, you say
you are willing to let one false positive sneak in, roughly every twenty tests.
We don’t accept this for critical decisions; we won’t long continue to cross
the street, if we do so on a 95% confidence that there is a break in traffic.
We may call this the false positive rate (FPR); the FPR of a procedure is the
fraction of truly unchanged genes which appear as (false) positives. 
If the aim of the microarray study is to select
a few genes for more precise study, then the goal is an ordered list of genes,
most of which are really different (true positives). Another way to say this is
that the expected number of false positives is some reasonable fraction (for
example less than .3) of the genes selected. This goal leads naturally to
specifying the false discovery rate (FDR) for a list, rather than significance
level (FPR). The FDR is the expected fraction of false positives in a list of
genes selected following a particular statistical procedure. For example, doing
a ttest on all genes, and selecting those whose tscore is above a threshold
of the .05point of the tdistribution would be such a procedure. If following
this procedure in many experiments would give gene lists including twenty per
cent false positives, on average, then the procedure's FDR is 20%. The FDR is
distinct from the false positive rate (FPR), which is the rate at which truly
unchanged genes appear as false positives. If following a particular
statistical procedure on samples with no (really) changed genes gives one
positive (falsely) in every fifth experiment, then the procedure's FPR is 20%. 
Related to the FDR is the qvalue, introduced
by Storey [2]. The qvalue is the smallest FDR at
which a particular gene would just stay on the list of positives. This is not
identical to the pvalue, which is the smallest false positive rate (FPR) at
which the gene appears positive. The pvalue is much stricter than the qvalue. Most of
the researchers, who compute significance of genes by permutations, are actually computing
the qvalue, rather than the pvalue. 
As of yet no conventions have been established
for false discovery rate in published work. An FDR of 5 or 10% should be
acceptable for journal publication of gene lists, in keeping with the practice
of accepting 5% of erroneous single results. For individual followup
experiments, the investigator must decide what's an acceptable waste of time.
If the investigator is selecting only half of the genes for followup based on
a priori biological plausibility, then perhaps a less stringent criterion is
acceptable. 
Multiple Testing PValues and False Positives 
Suppose you compare two groups of samples drawn
from the same larger group, using a chip with 10,000 genes on it. On average
500 genes will appear ‘significantly different’ at a 5% threshold. For these genes,
the variation between samples will be large relative to the variation within
groups due to random, but uneven allocation of the expression values to the
treatment and control groups. Therefore the pvalue appropriate to a single
test situation is inappropriate to presenting evidence for a set of changed
genes. 
The quantile plot can single out a modest
number of genes that are really different, but it is not a rigorous or
systematic procedure, and can’t really handle large numbers of differentially
regulated genes. Statisticians have devised several procedures for adjusting
pvalues to correct for the multiple comparisons problem. The oldest is the
Bonferroni correction; this is available as an option in many microarray
software packages. The corrected pvalue, p_{i}* for gene i
is set to: p_{i}* = Np_{i}, if Np_{i}
< 1, or 1, if Np_{i} > 1; where p_{i} is
the pvalue for a single test of gene i, and N is the number of genes
being tested (which may be less than the number of genes on the array). This is
correct, but too conservative. In practice, few genes meet this strict
criterion, including many, which are known differentially expressed from other
work. Some researchers use a related ‘singlestep’ adjustment procedure: they
find the smallest pvalue, p_{(1)}, which is corrected to Np_{(1)};
then the next smallest pvalue p_{(2)} is corrected to (N1)p_{(2)},
unless this is smaller than Np_{(1)}, in which case it is
corrected to the identical value Np_{(1)}; successively larger
pvalues are corrected in similar fashion. This procedure is still too
conservative; in fact since N is usually over 10,000, and the number of
genes is a few hundred, this procedure gives corrected pvalues that are almost
the same as the Bonferroni. The reason that both these procedures are too
conservative is that test statistics are correlated. 
To give some idea of why correlation makes a
difference, imagine an extreme case: suppose all genes are perfectly
correlated, and not changed between groups. In that case the tests for all the
genes give identical results: the pvalues for one gene are the pvalues for
all. In this case no correction for multiplecomparisons is needed, because all
the tscores either exceed or fall short of the threshold together. For
example, if the researcher sets a threshold at the 5% point, then no genes will
appear as (false) positives in 95% of all experiments testing for differential
expression. Of course, when false positives occur, a great number occur: in 5%
of all experiments, all genes will appear as (false) positives; the false positives
are ‘clumped’. In practice real gene expression levels are highly correlated,
because genes are coregulated; hence the probability of false positives is
much less than calculated by the Bonferroni procedure; on the other hand, when
false positives do occur, they tend to occur in abundance. The number of
extreme test statistics, and therefore apparently significant pvalues, for
correlated genes will be more variable than for independent genes, although it
will have the same longrun average. Therefore for realistically correlated
data, the multiple testing correction of pvalues should be weaker than the
correction for independent genes. For example, suppose repeated experiments are
done with the two groups drawn repeatedly from the same cell line; suppose that
2% of the (unchanged) genes appear significant at the .05 threshold, in 90% of
the experiments, and 40% of the genes appear significant at the .05 threshold,
in 10% of experiments. Suppose we adopt a procedure of selecting
‘differentially expressed’ genes, only if more than 2% of the genes
individually appear significant at the .05 threshold; this procedure will
select false positives only in 10% of the experiments. That is, the corrected
pvalue for more than 2% genes at the .05 level, should be 0.10 (10%). 
This gives us an approach to correcting for
multiple testing: for a group of genes, which appear to differ between sample
types, we ask how often would a group this size exceed the threshold that these
genes exceed? To be precise: for a specific number k and a threshold α,
how often will random sampling
from the same group give at least k single test pvalues will fall under
the threshold for significance level α? In practice we don’t have the
luxury of repeating experiments that
could give us these estimates. However we can calculate the results of an
experiment, where we resample from our existing samples, and calculate how
often large groups of genes appear significant. 
Calculating PermutationBased Corrected Pvalues 
To calculate corrected pvalues, first
calculate singlestep pvalues for all genes: p_{1}, …, p_{N}.
Then order the pvalues: p_{(1)}, …, p_{(N)},
from least to greatest. Next permute the sample labels at random, and compute the
test statistics for all genes between the two (randomized) groups. For each
position k, keep track of how often you get at least one pvalue more
significant than p_{(k)}, from gene k, or from any of the
genes further down on the list: k+1, k+2, …, N. ?After all
permutations, compute the fraction of permutations with at least one apparently
more significant pvalue less than p_{(k)}. This is the
corrected pvalue for gene k. Although this procedure is complicated, it
is much more powerful than the other corrections: that is, the procedure gives
a much smaller corrected pvalue for each gene than the Bonferroni procedure,
and therefore a bigger list of significant genes at any corrected significance
level (specified risk of false positives). This is known as the Westfall–Young
correction <refs>. 
There is a related procedure for computing
false discovery rates in the presence of correlation. The FDR procedure is
simpler, however it is still under some debate. When this debate is resolved,
computing an FDR may be the best way to indicate degree of evidence within the
traditions of classical statistics. However there are several emerging
approaches from the tradition of Bayesian statistics, which seem even more
powerful. 
Empirical Bayes Methods 
Bayesian approaches make assumptions about the
parameters to be estimated (such as the differences between gene levels in
treatment and control groups); used intelligently these assumptions can make
use of prior experience with microarray data. A pure Bayes approach assumes specific
distributions (prior distributions) for the mean differences of gene levels,
and their standard deviations. The empirical Bayes approach assumes less,
usually that the form of these distributions is known; the parameters of the
prior distribution are estimated from data. As we get more detailed knowledge
of the variability of individual genes, it should become possible to make
detailed useful prior estimates based on past experience. A simple empirical
Bayes approach to identifying changed genes runs as follows, in two stages. 
We
believe that all unchanged genes have a range of variances that follow a known
prior distribution. We estimate the mean of that distribution as the mean of
the experimental variances for all the genes. We estimate the variance of that
distribution as the variance of the gene variances. Both of these estimates
assume that a small minority of genes are actually changed. Then we consider
each gene individually. We would like an estimate of the variability of that
gene that is more reliable than its’ sample standard deviation. One way to
derive that is to work backward from the assumption that the gene variances
follow the known distribution. We can compute the probability of any value for
sample variance, given a supposed value for the (unknown) true variance of that
gene. If we combine that with the prior assignment of probability for each true
variance of a gene, we can work out the probabilities that various possible
true values underlie the one observed value of the variance. Not surprisingly
it is more probable that a very high sample variance comes about by an
overestimate of a moderately high variance, than a true estimate of a very
high variance, simply because there are many more moderate variances than high
variances, and overestimates are not very uncommon compared to precise
estimates. In the Bayesian tradition we construct a density function for the
probability function for the true value of the gene variance (called a
posterior distribution). To come up with a single value we might estimate the
expected value of the posterior distribution. By considerable algebra, this
expected value turns out to be a weighted average of the sample variance for
the gene, and the mean of the prior distribution, which was set to the mean of
all the sample variances. Thus we can obtain a theoretically more reliable
estimate of variation, by pooling information about variation, derived from all
genes simultaneously. This more reliable estimate of variation should then
translate into more reliable tscores, with more power to pick up moderate
differences in gene abundance between samples. 
In practice this works not badly at all. The
empirical Bayes estimate of variance is closer to the mean variance, more often
than the sample variance, which means that the moderated tscore is more likely
to have a moderate (not very small, not very large) value than a true tscore.
However the tails of the moderated t distribution are fairly close to the tails
of a t distribution based on a more reliable variance estimate, that is, with a
higher number of degrees of freedom. This means that a given difference in
means often gives a more significant pvalue, and that more genes are selected
at a particular threshold. However there is more possible, again if we are
willing to make another fairly precise assumption. 
The second stage of the empirical Bayes
procedure is to use prior beliefs about the number of genes that will be
changed. Suppose that based on past experience we believe that some fraction p1
of genes are actually changed by the treatment, and the remaining fraction p0 =
1  p1 are unchanged. Then we examine the distribution of the pvalues from all
the tscores from all the genes in the experiment (not just the significant
ones). This is better done with the raw (rather than moderated) tscores, since
the distribution of moderated tscores doesn’t quite match near 0, where the
larger pvalues are. The way that pvalues are constructed for a tscore, we
should find even numbers of pvalues in any interval of the same length. We
find generally that there are fewer pvalues in the range 0.5 to 1.0 than there
should be (there should be N/2 such pvalues), and correspondingly more in the
range 0.0 to 0.5. The natural assumption is that some genes are missing from
the range 0.5 to 1.0, because they are really differentially expressed. The
number of genes remaining, out of the total N/2 that we would expect, gives an
estimate of p0, the fraction of genes that are unchanged. Let's look at the
fraction of pvalues between .5 and 1, say M. Then p_{0} ≈ M/(2*N). 
Once we have prior empirical estimates of p1
and p0, we can construct the odds ratio for change in expression versus
unchanged; that is, we can estimate the probability of getting any particular
tscore if the gene is changed, and compare that to the probability if the gene
is unchanged. The ratio of probabilities is the odds, and we imagine placing a
bet on those genes with odds ratios in favor of change, above a particular
threshold. The probability of getting a particular tscore from an unchanged
distribution is the tdistribution. The probability of a particular tscore
given a changed gene depends on the magnitude of the change. This would be easy
to work out, if say we were interested only in genes with a twofold or greater
change. Then the probability distribution can be worked out. If we want to
detect genes of any change, then we need to specify a prior distribution for
the size of the foldchanges among genes. This is very difficult to estimate empirically. 
Another approach to this does away with the
assumption of a specific fold change in the changed genes, at the cost of
considerably increased complexity. The distribution of tscores from the
unchanged genes is of course a tdistribution. The distribution of tscores
from changed genes is some other unknown distribution, presumably with wider
tails, and possibly bimodal (see figure). The distribution of all scores is the
mixture of these two distributions with mixture parameters p_{0}, p_{1}.
The density function of all the tscores is p_{0}f_{0}(t) + p_{1}f_{1}(t).
Then for a given gene with tscore t_{i}, the posterior probability of
belonging to class 1, is given by Bayes' formula: P(1t) = P(t & 1) / P(t)
= p_{1}f_{1}(t) / f(t) = 1  p_{0}f_{0}(t)/f(t),
where f_{0} is the known tdistribution and f is the empirical
distribution. In practice this is reasonably robust provided that the tails of
f are fairly thick, and that a wellconditioned estimate for the density is
used. It's best to fit the distribution function with a smooth curve and take a
first derivative, insisting that it be monotone. 
Several Groups – Analysis of Variance 
Many current microarray studies compare more
than two groups. Sometimes the question is to determine differences among three
or more cell lines, or strains of experimental animal. Another common design
compares the effect of a particular treatment (often a ligand for a receptor),
on cell lines (or animals) with wildtype and mutant versions of the receptor.
Usually the experimenter wants to know which genes are actively regulated
during treatment in both cell lines, or wants some criterion for selecting
those that are differentially regulated among groups. These questions belong in
the tradition of analysis of variance (ANOVA). 
Generally, all of the procedures that were
discussed above in the context of twosample comparisons, carry over to
analogues in ANOVA. However the ANOVA analogues are more complex, because many
hypotheses are being tested, and some nested within others. Especially in the
factorial design case, there is no obvious way to do permutation testing to
obtain genomewide pvalues for the interaction (2^{nd} order) effects.
Several researchers have suggested permuting the residuals from a fit of the
main effects to obtain a permutation distribution. 
Analysis
of variance is complex and many researchers don’t bother; they do ttests on
the contrasts that interest them. This procedure is less effective than
analysis of variance for two reasons. The power of a ttest to pick up a
difference increases with greater confidence in the denominator – the estimate
of variability. ANOVA computes a consensus estimate of variability within
groups, based on all the groups. This estimate, based on more information, has
more confidence (greater degrees of freedom) than the variability estimated
between two samples alone. Thus for the same degree of difference, and the same
variability estimate, the ANOVA will pick up more differences than the ttest.
It’s possible to use the consensus estimate of variability in the tscore
denominators for all tests. 
Sometimes
when researchers compare treatment effects on WT and mutant, they look for
genes that are significant on one list and not on the other. This procedure is
very fallible; if genes are changed by treatment to the same extent in both
samples, according to statistical theory, the significance levels should
fluctuate considerably. Many should appear on one list but not on the other.
The ANOVA for a factorial design is the most efficient way of identifying the
true changes in regulation under treatment among the many noisy genes. 


1. Tusher,
V.G., R. Tibshirani, and G. Chu, Significance
analysis of microarrays applied to the ionizing radiation response. Proc
Natl Acad Sci U S A, 2001. 98(9): p.511621. 
2. Storey,
J.D. and R. Tibshirani, Statistical
significance for genomewide studies. Proc Natl Acad Sci U S A, 2003. 100
(16): p. 94405. 