GreyListChIP is an R package available in Bioconductor for identifying genomic regions with high signal in the control samples, known as inputs, in ChIP experiments. It was developed by a Gord Brown within the Bioinformatics Core and is now maintained by Rory Stark.

Many cell lines and tumour samples show anomalous signal in the input or control sample in some regions. These regions also show high signal in the corresponding ChIPs. Peak callers are not, in general, well-behaved in these regions, tending to call many spurious peaks. The purpose of this package is to identify those regions, so that reads in those regions may be removed prior to peak calling, allowing for more accurate insert size estimation and reducing the number of false-positive peaks.

As part of the ENCODE project, Anshul Kundaje identified regions that show enrichment in ChIP experiments independent of what factor is being ChIPped, or what cell line the sample comes from. The regions were labeled “signal artefact” regions, or colloquially “black lists”. We call our lists of high signal grey lists, to distinguish them from ENCODE’s black lists, because they are not universal, but rather cell line (or sample) specific, and because they can be tuned depending on the stringency required.

The GreyListChIP package provides functions to support the following steps involved in generating a grey list:

  1. Generate a tiling of the genome

  2. Count reads from a BAM file for the tiling

  3. Sample from the counts and fit samples to a negative binomial distribution to calculate a read count threshold

  4. Filter the tiling to identify regions of high signal

  5. Export the resulting set to a BED file that can be used in an analysis of ChIP-seq data