Exploratory Analysis


Exploratory analysis aims to find patterns in the data that aren’t predicted by the experimenter’s current knowledge or pre-conceptions. Some typical goals are to identify groups of genes expression patterns across samples are closely related; or to find unknown subgroups among samples. A useful first step in all analyses is to identify outliers among samples – those that appear suspiciously far from others in their group. To address these questions, researchers have turned to methods such as cluster analysis, and principal components analysis, although these have often been used inappropriately.
The first widely publicized microarray studies aimed to find uncharacterised genes, which act at specific points during the cell cycle. Clustering is the natural first step in doing this. Unfortunately many people got the impression that clustering is the 'right' thing to do with microarray data; the confusion has been perpetuated, since many software packages have catered to this impression. The proper way to analyze data is the way that addresses the goal at which the study was aimed. Clustering is a useful exploratory technique for suggesting resemblances among groups of genes, but it’s not a way of identifying the differentially regulated genes in an experimental study.


After that disclaimer, and after all preprocessing steps, there are two main types of analysis; Unsupervised Analysis and Supervised Analysis. It is considered Supervised Analysis when class lables are known a priori and are used in the analysis is such a way as to identify differentially expressed genes, or to develope a class prediction model. Unsupervised Analysis methods do not use any external class or phenotypic variables external to the gene expression data that are used in the analysis; that is to say we wish to learn from the gene expression data only. Also, unsupervised analyses are commonly used to determine whether there are previously unrecognized phenotypic groups identifiable only using gene expression data. Suppose that we want to find groups of similar genes or similar samples, how do we go about it? Clustering depends on the idea that differences between gene expression profiles are like distances; however the user must make (somewhat arbitrary) choices to compute a single measure of distance from many individual differences. It is important to determine what distance measure is appropriate for the data at hand before choosing a clustering algorithm, because the resulting preprocessed microarray values themselves are not directly used in the clustering algorithm. Instead, these expression values are converted to distance, or dissimilarities, and then clustering is applied. For visualization purposes it may be helpfull to see how the data from a gene expression microarray experiment is stored as a matrix (see figure one).
Figure 1. Gene Expression Matrix
For the above figure (figure 1) note that the Y–axis may be either genes or probe sets. Also, realize that for the first value (the square in the upper right hand corner of the matrix) corresponds to the expression for the first gene for the first sample. Moving one square to the right, represents the first gene for the second sample. This same convention applies moving veritcally within the matrix as well.
Different procedures emphasize different types of similarities, and give different resulting clusters. Four choices you have to make are:
  • what scale to use: original scale, log scale, or another transform,
  • whether to use all genes or to make a selection of genes,
  • what metric (proximity measure) to use to combine the scaled values of the selected genes, and
  • what clustering algorithm to use.
It is not clear what is the most appropriate scale for multivariate exploratory techniques, such as clustering and PCA (see below). Differences measured on the linear scale will be strongly influenced by the one hundred or so highly expressed genes, while being only moderately affected by the hundreds of moderate–abundance genes. The thousands of low–abundance genes will contribute little. Often the high-abundance genes are 'housekeeping' genes; these may or may not be diagnostic for the kinds of differences being sought. A housekeeping gene is any gene involved in basic functions needed for the sustenance of the cell; housekeeping genes are constitutively expressed, that is to say they are always turned 'ON'. On the other hand, the log scale will amplify the noise among genes with low expression levels. If low-abundance genes are included (see below) then they should be down-weighted. In the author's opinion, the most useful measure of a single gene difference is the difference between two samples, relative to that gene's variability within experimental groups; this is like a t-score for difference between two individuals.
Gene Selection
It would be wise not to place much emphasis on genes whose values are uncertain. These are usually those with low signals in relation to noise, or which fail spot-level quality control. If the estimation software provides a measure of confidence in each gene estimate, this can be used to weight the contribution to distance of that gene overall. It's not wise to simply omit (that is, set to 0) distances which are not known accurately, but it is wise to down-weight relative distances if several are probably in error. A general rule is that genes whose signal falls within the background noise range are probably contributing just noise to your clustering (and any other global procedure); therefore one may discard them.
Metrics (Proximity Measures)
Proximities are the imputs to a clustering algorithm, such as distances, dissimilarities, and similarities. In order for a distance to be considered metric we require:
Most cluster programs give you a menu of distance measures: Euclidean, Manhattan distances, and some relational measures: correlation, and sometimes relative distance, and mutual information. The most commonly used distance measure in many clustering algorithms is Euclidean Distance, that is the straight-line distance; some may prefer to say the root of the sum of squares, as in geometry. More specifically the Euclidean distance is computed as follows:
Some biologist prefer to stadardize (when clustering genes) the gene expression vectors first, then calculate d(i,j), more specifically:
, where S^2 is the pooled variance.
, where R is the range.
Manhattan Distance is the sum of linear distances (like navigating in Manhattan), some refer to it as the 'city block' distance. More specifically:
The aforementioned distance measures are all part of general class known as Minskowski Distances, this class of distances are all of the form:
Another type of proximity measure is a similarity measure. Similarities are used when gene expression profiles represent comparative expression measures, causing the Euclidean, and other, distance measures to not be as meaningful. Therefore, we commonly use a distance measure that scores based on similarity between genes. It is important to note that for similarities, the more objects i and j are alike, the larger the measure, this is to say:
The correlation distance measure is actually 1-r, where r is the correlation coefficient. Probably a more useful version is 1 – |r|; negative correlation is as informative as positive correlation. One common type of correlation is Pearson's Correlation which must be transformed to meet properties of a similarity measure. More specifically the Pearson's correlation is calculated as follows:
, where
Typically transformations are of the form:
, negative p's are dissimilar
, negative p's are similar.
The choice of which transformation to use should be based on subject matter knowledge. It is important to note that if the similarities have been calculated and one wishes to cluster the data, similarities may be converted to dissimilarities as follows:
The mutual information (MI) is defined in terms of entropy: H = Σp(x)log2(p(x))for a discrete distribution {p}. Then MI(g1,g2) = H(g1) + H(g2) – H(g1,g2) for genes g. This measure is robust – not affected by outliers. However it is tedious to program, because it requires adaptive binning to construct a meaningful discrete distribution.
By and large there are no theoretical reasons to pick one over the other, since we don't really know what we mean by 'distance' between expression profiles. The author prefers to use 'Manhattan' metrics for clustering samples by similarity of gene expression levels, and to use a correlation measure to cluster genes. Most of these measures are fairly sensitive to outliers, except mutual information. Robust versions of these measures can easily be constructed by a competent statistician, but are not available in most software packages. However we do get different results depending on the algorithm we use, as shown below for a study with 10 samples: two normal samples and two groups of tumor samples.
Clustering of the same data set using four different distance measures. All genes were on a logarithmic scale, and only genes with an MAS 5 'Present' call in 8 out of 10 samples were used (Affymetrix data). The four measures are listed in the titles; 'relative' is |x-y|/|x+y|.
Clustering Algorithms
To better understand clustering it may help to discuss filtering briefly. filtering is an attempt to exclude probe–sets or genes based on how much they vary. Some different types of filtering use different criteria. Some may retain genes whose absolute deviation exceeds a predefined threshold (the type of array is important to keep in mind when setting this threshold). Some methods retain genes who have an expression value above a predefined threshold. Others retain probe–sets declared 'Present'among a certain proportion of Affymetrix GeneChips, say 70%. While still other methods retain only well characterized genes. Most biologists find hierarchical clustering more familiar, and other algorithms somewhat magical. Some different types of hierarchical clustering are agglomerative hierarchical clustering, divisive heirarchical clustering, and Ward's method. With agglomerative hierarchical clustering each object starts as a cluster of size one. Once the distances between all pairs of clusters is calculated there are several ways to group the clusters. Hierarchical clustering groups the two clusters that are most similar (that is, the two that have the smallest distance. Another way to cluster would be using the single–linkage approach, also thought of as the nearest neighbor. There is also the average–linkage approach which averages all pairwise distances. Finally, there is a complete–linkage approach, which uses a maximum distance. Divisive hierarchical clustering starts with all objects in one large cluster, then proceeds by removing that the object from the cluster that is most dissimilar. Ward's method seeks to minimize the within–cluster sum of squares at each step; the union of every possible cluster pair is considered and the two clusters whose fusion results in a minimum increase in "information loss" (that is the error sum of squares) are combined. One potentially misleading artifact of hierarchical clustering is at the end, all objects are ultimately in one large cluster; it is possibly that one, or some, of the objects are totally unrelated, yet the algorithm will display them together. Statisticians object to hierarchical clustering because it seems (falsely) to imply descent; however this is a quibble: all of the common clustering methods are based on models which don't really apply to microarray data. There is no reason to suspect that genes (or samples) naturally cluster in a hierarchical way. That is, it may not be natural to impose a nesting structure if the primary purpose is to ID naturally occuring groups in the data. Another approach to clustering is use partitioning methods. Partitioning methods partition the set of n objects into a prespecified number of clusters; two such methods are K-means and Partitioning Around Medoids (PAM). The K-means method requires that K, the number of clusters, be pre–specified. Once K has been determined, initial clusters are defined (typically at random), each object is assigned to be a member of one of the K-clusters or K-random clusters are identified and objects are assigned to the closest of k clusters. The specific steps taken when using the K-means clustering method are as follows. (ONE) Specify K (recall K is the random assignment of objects to be clusters). (TWO) Calculate the center of each cluster. (THREE) For each object, calculate the distance between that object and the K centers. (FOUR) Assign each object to that cluster which minimizes the distances. The aforementioned steps are repeated untill no assignments change. The PAM method differs slightly, rather than use means of the currently defined clusters, the cluster center is restricted to be one of the observations within the cluster. While K–means requires the raw data to be available, PAM only needs the distances/dissimilarities rather than the raw data. The specific steps taken when using the PAM clustering method are as follows. (ONE) Specify K. (TWO) For a given cluster assignment C, find the observations in the cluster that minimizes the total distance to all other points within the cluster. (THREE) For a current set of K–medoids, minimize the total error by assigning each observation to that cluster with minimum distance (that is assign each observation to the closest medoid). The aforementioned steps are repeated untill no assignments change or you exceed the maximum number of iterations allowed by the software. It is important to note that since K is a parameter decided on by the technician or statistician that not all values of K will result in natural groupings. It is advisable to run the algorithm several times with different values of K and select that K for which certain characteristics are optimal, or to retain the clustering that gives rise to the most meaningful interpretation. There are some ways to check the appropriateness, or goodness of fit, for a given clustering. One such check is to compare the size of the cluster against the distance of the nearest cluster. One may also deem a cluster assignment as trust worthy if the inter–cluster distance is much larger than the size of the clusters. Yet another method for estimating the appropriate number of clusters is to estimate a homogeneity or a seperation quantity for various values of K and then look for a leveling off. The equations for the aforemention quatities are as follows. For the homogeneity quantity the average distance between each observation and the center of the cluster to which it belongs is calculated by:
For the separtation quantity the weighted average distances between the clusters is calculated by:
One may also use a silhouette, a composite index reflecting the compactness of, as well as, the seperation between clusters. Useful statistics resulting from this are the medoids (and the co-ordinates), the diameters and separation of the clusters. Broadly speaking, the differences between clustering methods show up in how ambiguous cases are assigned; if you have very many ambiguous cases you'll see great differences; if so, then maybe clustering isn't appropriate anyway, because the data don't separate into groups naturally. The k-means and SOM methods require the user to specify in advance the number of clusters to be constructed. Of course you don’t know ahead of time, most people end up trying out many values. A criterion that some people use to decide how many clusters to use is to track how much the intra-group variance drops at each addition of another cluster; then going back to the point where the rate of decrease really slows down. More advanced methods allow clusters to overlap, as often happens in biology, or to omit some outlying genes. To recap, a list of all commonalities between the various clustering methods discussed is presented here:
  • Objects to be clusters must be selected;
  • The variables to be used must be selected, and need to contain enough information for accurate clustering;
  • The researcher must decide whether or not to standardize the raw data;
  • A dissimilarity measure must be chossen;
  • A clustering technique needs to be ascertained;
  • The number of clusters needs to be ascertained;
  • One must interpret the meaning of the clusters in terms of research objectives.
Statistical significance of clusters by bootstrapping
An important question, but rarely asked, is whether the clusters obtained from a procedure depend on the overall pattern of gene expression, or on a few samples; they could be very different if one or two samples are omitted. One approach to address this is the Bootstrap: you re-cluster many times, each time re-sampling conditions or genes from your original data, and then derive new clusters of genes or conditions. A variant is known as Jack-knife analysis. Branches in a hierarchical cluster that are supported by a large fraction of the re-sampled data sets are considered fairly reliable. A reasonable figure is 70%, but this is arbitrary, like the often-quoted 5% level of significance.
With any exploratory technique, one should think about what technical variable may underlie the groups discovered this way, before going to the lab to confirm findings. The author finds that clustering most often identifies systematic differences in collection procedures or experimental protocol. These are important but not biologically significant. Even when the difference is biological, it may not be a discovery. Most sets of breast cancer data segregate into ER+ and ER- in clustering, which is re-assuring but hardly news.

Principal Components and Multi-dimensional scaling

Several other good multivariate techniques can help with exploratory analysis. Many authors suggest principal components analysis (PCA) or singular value decomposition to find coherent patterns of genes, or ‘metagenes’, that discriminate groups. These techniques with a long history in the statistical arsenal rely on the idea that most variation in a data set can be explained by a smaller number of transformed variables; they each form linear combinations of the data, which represent most of the variation, and in principle these approaches, are well-suited for this purpose. However this author believes these approaches are delicate enough that they are not very often useful for deep exploratory work; often the largest coherent component, such as the first principal component (PC), reflects mostly systematic error in the data. In fact some researchers have seriously suggested normalizing microarray data, by removing the first PC. PCA is not terribly robust to outliers, which are common. Like cluster analysis, the results of PCA are sensitive to transforms, which are somewhat arbitrary.
These multivariate approaches are more useful for exploring relations among samples, and particularly for a diagnostic look at samples before formal statistical tests. Multi-dimensional scaling (MDS) makes the most useful graphical displays. Classical MDS is identical to PCA for most data sets; however if you fix a dimension, a modern iterative algorithm can place the points in an arrangement that is more representative of true distances, than are the same number of principle components. It’s worth getting the ‘strain’ parameter for the MDS fit: this parameter measures the discrepancy between the distances computed by the metric, and the distances represented in the picture. When this parameter is much over 15%, the picture can be misleading. Often if you omit outlying samples, or take a major subgroup, the picture is a more accurate representation of the computed distances.
When the data conform to expectation the MDS plot shows well-defined groups. Figure ** shows an MDS plot of seven groups in a comparative study; the control group is in black; three of the groups are quite similar to controls and each other, while three others are quite distinct from controls and each other. This experiment showed many genes under very clear regulation.
Things aren’t always so happy. The following figure ** shows a group of experiments which might have been misinterpreted. Replicate experiments comparing three treatments against three controls were done on two different dates. Results were not consistent between experiments, but the MDS plot shows that we can work with the day 2 data quite confidently, and separately from day 1. The cluster analysis on the left doesn’t show this nearly as clearly.
Representing two batches of 3 treatment (T) and 3 controls (C), done on two different dates: 1T3 represents 3rd treated sample on day one. We can see that the day 1 chips cluster together, and are displayed together by MDS. However the day 2 genes seem to fall into two distinct clusters, which don’t divide neatly along T and C. The MDS plot shows that the C samples on day 2 are in fact quite close, whereas the 3 T’s are more disparate but all quite different from the C’s.