|This section covers the low–level preprocessing steps from the point
at which the chip is scanned, to obtaining reliable estimates for the relative gene abundances
of each gene in all of the samples. Broadly these steps may be classified as image analysis,
quality control, background correction, and normalization; although all of these procedures are
inter–dependent, and not always done in this order. It is important to understand that
the primary goal of array image processing involves measuring the intensity of the spots, then
using this intensity measure to quantify the gene expression values. It's also important to have
a means to test and assess data reliability. A few moments will be taken to explain some basics
of digital imaging.|
Basics of Digital Imaging
|When converting any analog image to a digital image a conversion must occur.
This converstion is a combination of sampling and quantification. Sampling is the process
of taking samples or reading values at regular intervals from a continuous function. It may help to
think of an analog image as a continuous 2–dimensional function. When this analog image is scanned, say
in the horizontal direction, the variation of intensity along the horizontal direction forms a
single dimension intensity function. Scanning a 2–dimensional function, both in the horizontal direction
as well as the vertical direction (another way to say this would be; the image is being sampled),
captures a digital image, that is, a rectangular array of discrete values (see figure 1).
Figure 1. A digital image is a 2–dimensional array of pixels.
Each pixel has an intensity value
(represented by a digital number) and a location
address (referenced by its row and column numbers).
For this figure the yellow
square is representative of one pixel.
|Each picture element is called a pixel. This process
of converting an analog image to a digital one is called digitization.
It is not necessary to understand the details of how computer systems store images, however, one should
know a more powerful computing system will be capable of producing higher quality images. Distortions,
refered to as aliasing effects, will occur if a computing system is not capable of sampling
an analog image at the ideal rate. Another way of saying this would be; the clarity and accuracy of a
digital image depends on the abilities of the technology being used to create it.
Obviously, the best resolution possible is prefered. A general rule of thumb is to have the pixel size approximately
1/10 the size of the spot diameter. Given that the average spot diameter is at least 10 to 12 pixels,
each spot will have a hundred pixels or more in the signal area, this is usually adequate for an
appropriate statistical analysis, one which will help the segmentation. Segmentation is
the process of classifying pixels as either foreground (signal pixels) or background (this will be reviewed further
in the next section).
Microarray Image Processing
|As most know, a typical two-channel (two-color) microarray experiment involves two samples;
often a sample of interest and a control. These might be a disease sample and a healthy sample
or an experimentally treated sample and an untreated sample. RNA from the two samples are
reverse transcribed to obtain cDNA. The products are then labeled with
fluorescent dyes; say, red for the sample of interest and green for the control sample.
These labled cDNAs are then hybridized to the probes on the glass slides, which are in turn
scanned to produce digital images (as discussed above).
When a gene is expressed abundantly in the sample of interest and hardly at all in the healthy tissue sample
it appears as a red spot. When a gene is expressed in the control sample and not expressed in the sample
of interest, that spot appears green. When a gene is expressed in both samples it appears as a yellow spot.
Genes not expressed in either tissue sample will appear as a black spot. These possibilities are
illustrated below (see figure 2).
Figure 2. A general scheme for cDNA microarray. (1)mRNA is extracted from
both a control and a test cell. (2)Using reverse transcriptase, the mRNA is made into
cDNA and tagged with fluorescent colors. (3)The cDNA is exposed to the microarray,
already spotted with the genes of interest. (4)Computers scan the microarray, looking
for patterns of color. (Image courtesy of Bioteach)
|The final product of the proccess illustrated above
(found in the lower right hand corner of the diagram) is a
sythetic image obtained by overlapping the two channels. Alot of information
can be contained in these images (see figure 3).
Figure 3. A two–color microarray experiment, showing
how gene expression can be altered by a disease such as cancer
|Hybridized arrays, like the one above, are scanned to produce
high resolution tiff files. These files are then processed (with specialized software)
in order to produce quantitative intensity values of the spots and their local background
on each channel. The goal is to produce a large matrix or data frame of the expression
data; traditionally, the genes are represented by the rows, while the conditions are
represented as columns. Ultimately, through analysis and data minning one should
be able to extract biologically relavent information from the quatitative data.
|Although the lab technician usually handles this step (using
the default settings on an image quantification program) the program and the
settings can have a noticeable impact on the noise level of the subsequent
estimates. There are several steps in image quantification: |
- Array Localization – laying the grid; finding where the printed spots ought to be in the image;
- Image Segmentation – identifying the extent of each spot, and separating foreground from background;
- Quantification – summarizing the varying brightnesses of the pixels in the foreground of each spot;
- Dealing with scanner saturation; and
- Dealing with variable backgrounds.
|It would seem that Array Localization, or spot finding, would be simple
and straightforward; given that one knows in advance the number of spots, as well as their size,
and the pattern used to print them. It appears that what is needed is a program that
designates a circle of the appropriate size, while the technician scans the spots relative to the pre–set circle,
and then classifies all pixels falling within the cirlce as foreground (or signal pixels) and all the pixels falling outside
the circle as background. The technician does the first step interactively. The quantification
program must deal with several problems in the second step. It is rare that the probes are
uniform in size and shape; most programs try to adapt to the sizes, and sometimes shapes,
of individual spots. Three different methods used by programs for
addressing (griding), that is the process of identifying the location of the probes on the surface of the
microarray are: fixed circle, adaptive circle, and adaptive shape. fixed circle uses the manufacturer's diameter specifications for each spot,
all pixels within the enclosed circle are considered foreground, while all pixels
exterior to the circle are considered background. adaptive circle starts with a fixed circle diamter,
then the diamter of each spot is re–estimated using the intesities of the surrounding pixels.
adaptive shape uses spot coordinates and intensities to get the "best fitting shape"
for the spot. Sometimes (poorly stabilized) neighboring
probes bleed for some region around which may invert the relation between foreground and background (see figure 4). Three
different types of spot finding are: |
- Manual Spot Finding – This method involves computer assistance for a lab technician
who must determine the circle size for each individual spot. In the early days of microarray analysis this
was the only method, however, in the modern era of microarray spot finding this is an essentially obsolete procedure.
Clearly, this would be extremely time consuming, while introducing possible noise through human error.
- Semi-Automatic Spot Finding – This method still requires some technician assistance, however,
it is vastly superior to the Manual Spot Finding method with regards to time and labor. Algorithms are used to
automatically adjust the location of the grid lines, or individual grid points, after the approximate
location of the grid has been specified by the technician.
- Automatic Spot Finding – This method requires the least amount of time and labor on the part
of the technician. The technician would only have to provide an array configuration, that is the number of rows
and columns of spots. Once this information has been provided the processing system would search the image for
the grid position. This method elimates human error, while providing the most consistency.
|Image Segmentation is the process of partitioning an image into a set of non–overlapping
regions whose union is the entire image. Put a different way, it is the process of classifying pixels as foreground (signal pixels)
or background. It is important to realize that only the pixel intensities determined to be foreground will be used in calculating
the signal. Pixels determined to be background are considered noise which should, of course, be eliminated.
There are several different types of segmentation: |
- Pure Spatial–Based Signal Segmentation – This is perhaps the least reliable
method of segmentation listed here. For this method two circles are used (selected by a technician
or software), one circle slightly smaller than the other. These circles will be placed
over spot locations. Software will consider any pixels inside the
smaller of the circles to be foreground, and any pixels outside the larger circle to be background (there is a limiting
square that encases both circles and the edges of this square and the larger circle dictate the area
to be counted as background). Some obvious shortcomings of this method are some spots will be smaller than the smaller circle,
causing white space to be counted as interior pixel intensity. Also, the background intensity values may be obscured
due to foriegn contaminants, or a spot whose outer diameter is greater than that of the larger circle.
- Intensity Based Segmentation – This method is more reliable than the aforementioned one, however, it too has
its shortcomings. Using a computing system with only moderate power, one may obtain results with more speed and simplicity
than other segtmentation methods. Specifically this method determines which pixels to use as signal by considering
only pixel intensity. Working from the assumption that foreground pixels will be, on average, brighter than background pixels,
the software determines the brightest 20% of the pixels for a given spot. These brightest pixels are considered foreground,
or signal pixels, while the other pixels are considered background. Spots with low intensities, or arrays with foriegn
contamination can really throw this method off, especially if exterior pixels (that should be classified as background)
are artificially bright.
- Mann–Whitney Segmentation – This method is a combination of the two mentioned previously.
Once again a circle is placed in a prep–secified target area, where a spot is known to be. Given that the pixels
outside the circle are background, one may use statistical properties (in this case Mann–Whitney) of said background
pixels in order to ascertain which foreground pixels should be used as signal pixels. A Mann–Whitney test gives
technicians an intensity level that will used to divide signal pixels from the rest. This method will determine a signal
pixel regardless of whether or not the pixels are withen the area of the spot. As always, any type of foriegn contamination
will drastically effect the technician's ability to accurately determine which pixels are signal pixels to be used for further
- Combined Intensity–Spatial Segmentation (or the trimmed measurements approach) – This method is extremely similar
to the Mann–Whiney method. Given this similarity it will simply be mentioned here as another possible method. One may
further read about it in any microarray text.
Figure 4. Some problems that quantification must deal with
|The different programs each try different approaches to these problems,
and they make a difference in the reproducibility of gene expression measures. A study done
in California showed considerable differences among results from different quantification
programs applied to 8 arrays that compared the same two samples. Several different settings
were used on most programs. Since ratios should be identical, the standard deviation is an
(inverse) measure of quality.|
Figure 5. Noise levels from the same 8 cDNA slides, as
quantified in many different ways (courtesy Jean Yang, UCSF).
|The previous figure (figure 5) shows that background subtraction added to the noise
in those measures; many researchers have had similar experience. On the other hand it
seems in principle wrong to ignore background, and some substrates (eg. poly-L lysine) show
substantial additive background; ignoring this leads to substantial bias in the estimates
of most gene ratios. At the moment there is no consensus, but two suggestions may help for
the time being.|
|It may be pragmatic to ignore the background correction, if
the goal is to detect a few differentially expressed genes among the multitude
of noisy similarly expressed genes. For this purpose noise control is most
important. On the other hand, when it comes time to estimate the fold-change, subtracting
background, or some other form of background correction gives more accurate estimates.|
|Perhaps the resolution of this paradox will come with more sophisticated
forms of background correction. The Spot program and the Agilent software both do something
different than local background estimation, and both give less noisy estimates than subtraction.
In principle the raw intensity of a spot is made up of fluorescence from labeled transcripts,
plus reflection or emission from the substrate. The label fluorescence comes from the target
transcript and also a mixture of other transcripts that have bound non–specifically to the
spot. The local background is made up of the reflection from the substrate, and stray bits
of labeled transcript that have bound to the surface (sometimes if a nearby probe has spread,
it includes labeled transcript from that other gene). The make-up of the local background
differs from the background we want to correct on the spot. On the other hand the negative
controls show just non-specific hybridization, which is the bulk of what we want to correct
on each gene probe. Perhaps a better way would be to subtract a weighted average
of the local background and the values of negative controls nearby.|