The Assay

Summarization of ChIP-chip Data

Chromatin Immunoprecipitation has become a major assay for assessing protein interactions with DNA, and DNA methylation. The two major kinds of protein interactions are: sporadic specific interactions, where a protein binds to specific sequences on the DNA, and histone modifications. Generally the former involves identifying a small number of sites, and as of 2008, it was already cheaper to do this kind of assay using high-throughput sequencing (ChIP-Seq), if that technology is available, than to use ChIP. For the latter kind of assay, which involves many reads, ChIP-Seq is catching up, but still for most labs, it is cheaper and faster to get an initial look via array. Methylation may be assayed in many ways, of which methylated-DNA precipitation (MeDIP) has been one of the most common.

There are some normalization issues specific to ChIP arrays, and indeed some major array normalization approaches were tried first on ChIP arrays, and are still best known there. See ChIP-chip Normalization.

In the early days of ChIP-chips the most common practice was to try to call whether or not an enrichment peak was present or not, a binary decision. While this kind of choice was convenient for later follow-up studies, most labs have come to the conclusion that most kinds of chromatin modifications or protein binding are continuous quantitative affairs, rather than discrete 'all or nothing' events. Nowadays the standard approach is to estimate the degree and statistical significance of enrichment on a site.

Moving Average and Convolution Methods

The assay itself has a resolution of several hundred base pairs, so that measures obtained on neighboring probes should reflect very similar degrees of enrichment. Therefore the simplest idea to compensate the errors in any individual probe is to smooth the signal over neighboring probes. The idea is to obtain a smooth curve. The only arithmetic complication is that the probes are usually at uneven distances from one another, in order to avoid intervening regions of highly repetitive sequence.

Usually some local weighting scheme, in mathematical jargon, a kernel is chosen to determine the local averaging procedure. Many people use a Gaussian kernel, f(x) = exp(-(x-m)2), where the weights assigned to each probe are given by a Gaussian density function centered at the point where the estimated enrichment is desired. Note that the peak of enrichment need not occur at the site of a probe, for example if two neighboring probes have roughly equal high signal.

One color and two color methods

The appropriate signal to work with in a two-color assay is the log-ratios.

Handling Gaps

One of the peculiarities of ChIP-chip data is the presence of extremely low intensity values in some regions on some arrays. This may appear as a drop in signal in a one-color assay and as a sharp decrease in log-ratio in two-color assays. With histone data the most natural explanation is that these dips represent a region which lacks histones, as is typical of chromatin where transcription is being initiated, in many or all of the cells in the preparation. This interpretation is borne out by the fact that in two-color ChIP data, most of the extremely low log-ratios occur in regions where there is very low signal in both channels as well. In one-color data, it is impossible to separate these regions.

The presence of gaps complicates the moving average strategy, somewhat. A natural strategy is to allow more sudden transitions by shortening the effective averaging range.