CG CLUSTERS

For some time, people have struggled with the definition of CpG islands in the genome.

Although the original definition was based on very limited sequence data and 1985's GenBank database, the definition has proved to be extremely robust over the last two decades as a means of predicting gene promoters or imprinted genes, and sites at which cytosine methylation changes in cancer.

When you look at the CpG island track on a genome browser, you'll see a nice, limited number of loci of the magnitude of the number of genes in the genome. However, this track has excluded all of the CpG islands located within transposons, removing >90% of the CpG islands in doing so. This raises the question why the same kind of sequences in transposons are unimportant while in unique sequences essential. In an attempt to address this,
Takai and Jones made the base compositional definition more stringent and improved the ability to discriminate promoters from transposons.

However, another problem was beginning to emerge with high-throughput techniques to study cytosine methylation -- not all CpG islands were unmethylated in normal cells, as previously believed. This was a concern to those for whom the genomic annotation served as a surrogate for testing normal cells when studying dysregulation of cytosine methylation in disease. We
reviewed the topic in detail in 2004.

We have now used a simple approach to redefine these interesting sequence elements. We ignored base composition entirely and only required that CG dinucleotides cluster to define the element. We find that the new approach is better than CpG islands at defining functionally-important sites in the genome (transcription start sites, hypomethylated loci), and allows a species-specific definition that reveals much greater conservation of these sequence features than for CpG islands.

The manuscript describing this approach is now
published at NAR. We are making the annotations of the human and mouse genomes freely-available through this link. We welcome your feedback about the CG clusters annotation.