CG
CLUSTERS
For some time, people have struggled with the definition of
CpG islands in the genome.
Although the original definition was based on very limited
sequence data and 1985's GenBank database, the definition
has proved to be extremely robust over the last two decades
as a means of predicting gene promoters or imprinted genes,
and sites at which cytosine methylation changes in cancer.
When you look at the CpG island track on a genome browser,
you'll see a nice, limited number of loci of the magnitude
of the number of genes in the genome. However, this track
has excluded all of the CpG islands located within
transposons, removing >90% of the CpG islands in doing
so. This raises the question why the same kind of sequences
in transposons are unimportant while in unique sequences
essential. In an attempt to address this,
Takai and Jones made the base compositional definition
more stringent and improved the ability to discriminate
promoters from transposons.
However, another problem was beginning to emerge with
high-throughput techniques to study cytosine methylation --
not all CpG islands were unmethylated in normal cells, as
previously believed. This was a concern to those for whom
the genomic annotation served as a surrogate for testing
normal cells when studying dysregulation of cytosine
methylation in disease. We reviewed the topic in detail in 2004.
We have now used a simple approach to redefine these
interesting sequence elements. We ignored base composition
entirely and only required that CG dinucleotides cluster to
define the element. We find that the new approach is better
than CpG islands at defining functionally-important sites
in the genome (transcription start sites, hypomethylated
loci), and allows a species-specific definition that
reveals much greater conservation of these sequence
features than for CpG islands.
The manuscript describing this approach is now
published at NAR. We are making the annotations of the
human and mouse genomes freely-available through
this link. We welcome your feedback about the CG clusters
annotation.