International Human Genome Sequencing Consortium
Authors:
Addresses:
Correspondence:
Abstract:
Introduction:
Summary of Findings:
Methods:
Supplementary Information:
...
Repeat Content of
the Human Genome:
Transposon-Derived
Repeats:
Classes
of Transposable Elements:
Figure
17. Classes of interspersed repeat transposons in the human genome:
LINEs:
SINEs:
LTR retroposons:
DNA
retroposons:
Census
of human repeats:
Table
11. Number of copies and fraction of genome for interspersed repeat:
Age
distribution:
Figure
18. Age distribution of interspersed repeats in the human and mouse genomes:
Comparisons
with other organisms:
Table
12. Number and nature of interspersed repeats in eukaryote genomes:
Figure
20. Comparison of the age of interspersed repeats in eukaryote genomes:
Variation
in the distribution of repeats:
...
Active
transposons:
Transposons
as a creative force:
Simple sequence
repeats:
...
Gene Content of the
Human Genome:
Noncoding RNAs:
Transfer RNA genes:
Ribosomal
RNA genes:
Small
nucleolar RNA genes:
Spliceosomal
RNAs and other ncRNA genes:
Protein-coding
genes:
Exploring
properties of known genes:
...
The next steps:
Finishing
the human sequence:
Developing
the IGI and IPI:
Large-scale
identification of regulatory regions:
Sequencing
of additional large genomes:
Completing
the catalogue of human variation:
From
sequence to function:
Concluding
thoughts:
DNA Sequence Databases:
Supplementary Information:
Acknowledgements:
References:
Background References:
Additional References:
Other Sites:
Feedback:
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century [1-3] sparked a scientific quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scientific progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The first established the cellular basis of heredity: the chromosomes. The second defined the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same.
The last quarter of a century has been marked by a relentless drive to decipher first genes and then entire genomes, spawning the field of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant.
Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly fifteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a finished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in final form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly.
The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the first vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species.
Much work remains to be done to produce a complete finished sequence,
but the vast trove of information that has become available through this
collaborative effort allows a global perspective on the human genome. Although
the details will change as the sequence is finished, many points are already
clear.
In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruitfly Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, fly and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. (underlines by WebEditor).
A full description of the methods is provided as Supplementary Information on Nature's web site.
Repeat content of the human genome:
A puzzling observation in the early days of molecular biology was that genome size does not correlate well with organismal complexity. For example, Homo sapiens has a genome thais 200 times as large as that of the yeast S. cerevisiae, but 200 times as small as that of Amoeba dubia [139, 140]. This mystery (the C-value paradox) was largely resolved with the recognition that genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes (reviewed in refs [140, 141]).
In the human, coding sequences comprise less than 5% of the genome (see below), whereas repeat sequences account for at least 50% and probably much more. Broadly, the repeats fall into five classes: (1) transposon-derived repeats, often referred to as interspersed repeats; (2) inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small structural RNAs), usually referred to as processed pseudogenes; (3) simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)nor (CGG)n; (4) segmental duplications, consisting of blocks of around 10–300 kb that have been copied from one region of the genome into another region; and (5) blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of acrocentric chromosomes and ribosomal gene clusters. (These regions are intentionally under-represented in the draft genome sequence and are not discussed here.)
Repeats are often described as 'junk' and dismissed as uninteresting. However, they actually represent an extraordinary trove of information about biological processes. The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. It is possible to recognize cohorts of repeats 'born' at the same time and to follow their fates in different regions of the genome or in different species. As active agents, repeats have reshaped the genome by causing ectopic rearrangements, creating entirely new genes, modifying and reshuffling existing genes, and modulating overall GC content. They also shed light on chromosome structure and dynamics, and provide tools for medical genetic and population genetic studies.
The human is the first repeat-rich genome to be sequenced, and so we investigated what information could be gleaned from this majority component of the human genome. Although some of the general observations about repeats were suggested by previous studies, the draft genome sequence provides the first comprehensive view, allowing some questions to be resolved and new mysteries to emerge.
Most human repeat sequence is derived from transposable elements [142, 143]. We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining 'unique' DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized as such. To describe our analyses of interspersed repeats, it is necessary briefly to review the relevant features of human transposable elements.
Classes of transposable elements.
In mammals, almost all transposable elements fall into one of four types (Fig. 17), of which three transpose through RNA intermediates and one transposes directly as DNA. These are long interspersed elements (LINEs), shortinterspersed elements (SINEs), LTR retrotransposons and DNA transposons.
Figure 17. Almost all transposable elements in mammals fall into
one of four classes. (See text for details).
SINEs are wildly successful freeloaders on the backs of LINE elements. They are short (about 100–400 bp), harbour an internal polymerase III promoter and encode no proteins.These non-autonomous transposons are thought to use the LINE machinery for transposition. Indeed, most SINEs 'live' by sharing the 3' end with a resident LINE element [144]. The promoter regions of all known SINEs are derived from tRNA sequences, with the exception of a single monophyletic family of SINEs derived from the signal recognition particle component 7SL. This family, which also does not share its 3' end with a LINE, includes the only active SINE in the human genome: the Alu element. By contrast, the mouse has both tRNA-derived and 7SL-derived SINEs. The human genome contains three distinct monophyletic families of SINEs: the active Alu, and the inactive MIR and Ther2/MIR3.
LTR retroposons are flanked by long terminal direct repeats that contain all of the necessary transcriptional regulatory elements. The autonomous elements (retrotransposons) contain gag and pol genes, which encode a protease, reverse transcriptase, RNAse H and integrase. Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a cellular envelope gene (env) [147]. Transposition occurs through the retroviral mechanism with reverse transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA (in contrast to the nuclear location and chromosomal priming of LINEs). Although a variety of LTR retrotransposons exist, only the vertebrate-specific endogenous retroviruses (ERVs) appear to have been active in the mammalian genome. Mammalian retroviruses fall into three classes (I–III), each comprising many families with independent origins. Most (85%) of the LTR retroposon-derived 'fossils' consist only of an isolated LTR, with the internal sequence having been lost by homologous recombination between the flanking LTRs.
DNA transposons resemble bacterial transposons, having terminal inverted repeats and encoding a transposase that binds near the inverted repeats and mediates mobility through a 'cut-and-paste' mechanism. The human genome contains at least seven major classes of DNA transposon, which can be subdivided into many families with independent origins [148] (see RepBase, http://www.girinst.org/~server/repbase.html DNA transposons tend to have short life spans within a species. This can be explained by contrasting the modes of transposition of DNA transposons and LINE elements. LINE transposition tends to involve only functional elements, owing to the cis-preference by which LINE proteins assemble with the RNA from which they were translated. By contrast, DNA transposons cannot exercise a cis-preference: the encoded transposase is produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from inactive elements. As inactive copies accumulate in the genome, transposition becomes less efficient. This checks the expansion of any DNA transposon family and in due course causes it to die out. To survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is considerable evidence for such transfer [149-153].
Transposable elements employ different strategies to ensure their evolutionary survival. LINEs and SINEs rely almost exclusively on vertical transmission within the host genome [154] (but see refs 148, 155).DNA transposons are more promiscuous, requiring relatively frequent horizontal transfer. LTR retroposons use both strategies, with some being long-term active residents of the human genome (such as members of the ERVL family) and others having only short residence times.
We began by taking a census of the transposable elements in the draft
genome sequence, using a recently updated version of the RepeatMasker program
(version 09092000) run under sensitive settings (see
http://repeatmasker.genome.washington.edu).
This program scans sequences to identify full-length and partial members
of all known repeat families represented in RepBase Update (version 5.08;
see http://www.girinst.org/~server/repbase.html
and ref. [156]). Table
11 shows the number of copies and fraction of the draft genome sequence
occupied by each of the four major classes and the main subclasses.
| Number of
copies x1,000 |
Total number of
bases in the draft genome sequence (Mb) |
Fraction of the
draft genome sequence (%) |
Number of
families (subfamilies) |
|
| SINEs | 1,558 | 359.6 | 13.14 | |
| Alu | 1,090 | 290.1 | 10.60 | 1(~20) |
| MIR | 393 | 60.1 | 2.20 | 1(1) |
| MIR3 | 75 | 9.3 | 0.34 | 1(1) |
| LINEs | 868 | 558.8 | 20.42 | |
| LINE1 | 516 | 462.1 | 16.89 | 1(~55) |
| LINE2 | 315 | 88.2 | 3.22 | 1(2) |
| LINE3 | 37 | 8.4 | 0.31 | 1(2) |
| LTR elements | 443 | 227.0 | 8.29 | |
| ERV-class I | 112 | 79.2 | 2.89 | 72(132) |
| ERV(K)-class II | 8 | 8.5 | 0.31 | 10(20) |
| ERV(L)-class III | 83 | 39.5 | 1.44 | 21(42) |
| MaLR | 240 | 99.8 | 3.65 | 1(31) |
| DNA elements | 294 | 77.6 | 2.84 | |
| hAT group | ||||
| MER1-Charlie | 182 | 38.1 | 1.39 | 25(50 |
| Zaphod | 13 | 4.3 | 0.16 | 4(10) |
| Tc-1 group | ||||
| MER2-Tigger | 57 | 28.0 | 1.02 | 12(28) |
| Tc2 | 4 | 0.9 | 0.03 | 1(5) |
| Mariner | 14 | 2.6 | 0.10 | 4(5) |
| PiggyBac-like | 2 | 0.5 | 0.02 | 10(20) |
| Unclassified | 22 | 3.2 | 0.12 | 7(7) |
| Unclassified | 3 | 3.8 | 0.14 | 3(4) |
| Total interspersed repeats | 1,226.8 | 44.83 |
The age distribution of the repeats in the human genome provides a rich 'fossil record' stretching over several hundred million years. The ancestry and approximate age of each fossil can be inferred by exploiting the fact that each copy is derived from, and therefore initially carried the sequence of, a then-active transposon and, being generally under no functional constraint, has accumulated mutations randomly and independently of other copies. We can infer the sequence of the ancestral active elements by clustering the modern derivatives into phylogenetic trees and building a consensus based on the multiple sequence alignment of a cluster of copies. Using available consensus sequences for known repeat subfamilies, we calculated the per cent divergence from the inferred ancestral active transposon for each of three million interspersed repeats in the draft genome sequence.
The percentage of sequence divergence can be converted into an approximate age in millions of years (Myr) on the basis of evolutionary information. Care is required in calibrating the clock, because the rate of sequence divergence may not be constant over time or between lineages [139]. The relative-rate test [157] can be used to calculate the sequence divergence that accumulated in a lineage after a given timepoint, on the basis of comparison with a sibling species that diverged at that time and an outgroup species. For example, the substitution rate over roughly the last 25 Myr in the human lineage can be calculated by using old world monkeys (which diverged about 25 Myr ago) as a sibling species and new world monkeys as an outgroup. We have used currently available calibrations for the human lineage, but the issue should be revisited as sequence information becomes available fromdifferent mammals.
Figure 18a shows the representation of various
classes of transposable elements in categories reflecting equal amounts
of sequence divergence. In Fig. 18b the data are grouped into four bins
corresponding to successive 25-Myr periods, on the basis of an approximate
clock. Figure 19 shows the mean ages of various subfamilies of DNA transposons.
Figure 18. Age distribution of interspersed repeats in the human
and mouse genomes. Bases covered by interspersed repeats were sorted by
their divergence from their consensus sequence (which approximates the
repeat's original sequence at the time of insertion). The average number
of substitutions per 100 bp (substitution level, K ) was calculated from
the mismatch level p assuming equal frequency of all substitutions (the
one-parameter Jukes±Cantor model, K = -3/4ln(1- 4/3p)). This model
tends to underestimate higher substitution levels. CpG dinucleotides in
the consensus were excluded from the substitution level calculations because
the C ! T transition rate in CpG pairs is about tenfold higher than other
transitions and causes distortions in comparing transposable elements with
high and low CpG content. a, The distribution, for the human genome, in
bins corresponding to 1% increments in substitution levels. b, The data
grouped into bins representing roughly equal time periods of 25 Myr. c,d,
Equivalent data for available mouse genomic sequence. There is a different
correspondence between substitution levels and time periods owing to different
rates of nucleotide substitution in the two species. The correspondence
between substitution levels and time periods was largely derived from three-way
species comparisons (relative rate test [139, 157])
with the age
estimates based on fossil data. Human divergence from gibbon 20±30
Myr; old world monkey 25±35 Myr; prosimians 55±80 Myr; eutherian
mammalian radiation, 100 Myr.
Several facts are apparent from these graphs. First, most interspersed repeats in the human genome predate the eutherian radiation. This is a testament to the extremely slow rate with which nonfunctional sequences are cleared from vertebrate genomes (see below concerning comparison with the fly).
Second, LINE and SINE elements have extremely long lives. The monophyletic LINE1 and Alu lineages are at least 150 and 80 Myr old, respectively. In earlier times, the reigning transposons were LINE2 and MIR [148, 158]. The SINE MIR was perfectly adapted for reverse transcription by LINE2, as it carried the same 50-base sequence at its 3' end. When LINE2 became extinct 80–100 Myr ago, it spelled the doom of MIR.
Third, there were two major peaks of DNA transposon activity (Fig. 19). The first involved Charlie elements and occurred long before the eutherian radiation; the second involved Tigger elements and occurred after this radiation. Because DNA transposons can produce large-scale chromosome rearrangements [159-162], it is possible that widespread activity could be involved in speciation events.
Fourth, there is no evidence for DNA transposon activity in the past
50 Myr in the human genome. The youngest two DNA transposon families that
we can identify in the draft genome sequence (MER75 and MER85) show 6–7%
divergence from their respective consensus sequences representing the ancestral
element (Fig. 19), indicating that they were active before the divergence
of humans and new world monkeys. Moreover, these elements were relatively
unsuccessful, together contributing just 125 kb to the draft genome
sequence.
Finally, LTR retroposons appear to be teetering on the brink of extinction, if they have not already succumbed. For example, the most prolific elements (ERVL and MaLRs) flourished for more than 100 Myr but appear to have died out about 40 Myr ago [163, 164]. Only a single LTR retroposon family (HERVK10) is known to have transposed since our divergence from the chimpanzee 7 Myr ago, with only one known copy (in the HLA region) that is not shared between all humans [165]. In the draft genome sequence, we can identify only three full-length copies with all ORFs intact (the final total may be slightly higher owing to the imperfect state of the draft genome sequence).
More generally, the overall activity of all transposons has declined
markedly over the past 35–50 Myr, with the possible exception of LINE1
(Fig. 18). Indeed, apart from an exceptional burst
of activity of Alus peaking around 40 Myr ago, there would appear to have
been a fairly steady decline in activity in the hominid lineage since the
mammalian radiation. The extent of the decline must be even greater than
it appears because old
repeats are gradually removed by random deletion and because old
repeat families are harder to recognize and likely to be under-represented
in the repeat databases. (We confirmed that the decline in transposition
is not an artefact arising from errors in the draft genome sequence, which,
in principle, could increase the divergence level in recent elements. First,
the sequence error rate (Table 9) is far too low to have a significant
effect on the apparent age of recent transposons; and second, the same
result is seen if one considers only finished sequence.)
What explains the decline in transposon activity in the lineage leading to humans? We return to this question below, in the context of the observation that there is no similar decline in the mouse genome.
Comparison with other organisms
We compared the complement of transposable elements in the human
genome with those of the other sequenced eukaryotic genomes. We
analysed the fly, worm and mustard weed genomes for the number and nature
of repeats (Table 12) and the age distribution (Fig.
20). (For the fly, we analysed the 114 Mb of
unfinished 'large' contigs produced by the whole-genome shotgun
assembly [166], which are reported to represent euchromatic
sequence. Similar results were obtained by analysing 30 Mb of finished
euchromatic sequence.) The human genome stands in stark contrast to the
genomes of the other organisms.
The complete genomes of fly, worm, and chromosomes 2 and 4 of mustard
weed (as deposited at http://www.ncbi.nlm.nih.gov/genbank/genomes)
were screened against the repeats in RepBase Update 5.02
(September 2000) with RepeatMasker at sensitive settings.
Figure 20. Comparison of the age of interspersed repeats in eukaryotic genomes. The copies of repeats were pooled by their nucleotide substitution level from the consensus.
(1) The euchromatic portion of the human genome has a much higher density of transposable element copies than the euchromatic DNA of the other three organisms. The repeats in the other organisms may have been slightly underestimated because the repeat databases for the other organisms are less complete than for the human, especially with regard to older elements; on the other hand, recent additions to these databases appear to increase the repeat content only marginally.
(2) The human genome is filled with copies of ancient transposons, whereas the transposons in the other genomes tend to be of more recent origin. The difference is most marked with the fly, but is clear for the other genomes as well. The accumulation of old repeats is likely to be determined by the rate at which organisms engage in 'housecleaning' through genomic deletion. Studies of pseudogenes have suggested that small deletions occur at a rate that is 75-fold higher in flies than in mammals; the half-life of such nonfunctional DNA is estimated at 12 Myr for flies and 800 Myr for mammals [167]. The rate of large deletions has not been systematically compared, but seems likely also to differ markedly.
(3) Whereas in the human two repeat families (LINE1 and Alu) account for 60% of all interspersed repeat sequence, the other organisms have no dominant families. Instead, the worm, fly and mustard weed genomes all contain many transposon families, each consisting of typically hundreds to thousands of elements. This difference may be explained by the observation that the vertically transmitted, long-term residential LINE and SINE elements represent 75% of interspersed repeats in the human genome, but only 5–25% in the other genomes. In contrast, the horizontally transmitted and shorter-lived DNA transposons represent only a small portion of all interspersed repeats in humans (6%) but a much larger fraction in fly, mustard weed and worm (25%, 49% and 87%, respectively). These features of the human genome are probably general to all mammals. The relative lack of horizontallytransmitted elements may have its origin in the well developed immune system of mammals, as horizontal transfer requires infectious vectors, such as viruses, against which the immune system guards.
We also looked for differences among mammals, by comparing the transposons in the human and mouse genomes. As with the human genome, care is required in calibrating the substitution clock for the mouse genome. There is considerable evidence that the rate of substitution per Myr is higher in rodent lineages than in the hominid lineages [139, 168, 169]. In fact, we found clear evidence for different rates of substitution by examining families of transposable elements whose insertions predate the divergence of the human and mouse lineages. In an analysis of 22 such families, we found that the substitution level was an average of 1.7-fold higher in mouse than human (not shown). (This is likely to be an underestimate because of an ascertainment bias against the most diverged copies.) The faster clock in mouse is also evident from the fact that the ancient LINE2 and MIR elements, which transposed before the mammalian radiation and are readily detectable in the human genome, cannot be readily identified in available mouse genomic sequence (Fig.18).
We used the best available estimates to calibrate substitution levels and time [169]. The ratio of substitution rates varied from about 1.7-fold higher over the past 100 Myr to about 2.6-fold higher over the past 25 Myr.
The analysis shows that, although the overall density of the four
transposon types in human and mouse is similar, the age distribution is
strikingly different (Fig. 18). Transposon activity
in the mouse genome has not undergone the decline seen in humans and proceeds
at a much higher rate. In contrast to their possible extinction in humans,
LTR retroposons are alive and well in the mouse with such representatives
as the active IAP family and putatively active members of the long-lived
ERVL and MaLR families. LINE1 and a
variety of SINEs are quite active. These evolutionary findings are
consistent with the empirical observations that new spontaneous mutations
are 30 times more likely to be caused by LINE insertions in mouse than
in human (3% versus 0.1%) [170] and 60 times more likely
to be caused by transposable elements in general. It is estimated that
around 1 in 600 mutations in human are due to transpositions, whereas 10%
of mutations in mouse are due to transpositions (mostly IAP insertions).
The contrast between human and mouse suggests that the explanation
for the decline of transposon activity in humans may lie in some fundamental
difference between hominids and rodents. Population structure and dynamics
would seem to be likely suspects. Rodents tend to have large populations,
whereas hominid populations tend to be small and may undergo frequent bottlenecks.
Evolutionary forces affected by such factors include inbreeding and genetic
drift, which might affect the persistence of active transposable
elements [171]. Studies in additional mammalian
lineages may shed light on the forces responsible for the differences in
the activity of transposable elements [172].
Variation in the distribution of repeats.
We next explored variation in the distribution of repeats across the draft genome sequence, by calculating the repeat density in windows of various sizes across the genome. There is striking variation at smaller scales.
Some regions of the genome are extraordinarily dense in repeats.
The prizewinner appears
to be a 525-kb region on chromosome Xp11, with an overall transposable
element density of
89%. This region contains a 200-kb segment with 98% density, as
well as a segment of
100 kb in which LINE1 sequences alone comprise 89% of the sequence.
In addition, there
are regions of more than 100 kb with extremely high densities of
Alu (> 56% at three loci,
including one on 7q11 with a 50-kb stretch of > 61% Alu) and the
ancient transposons MIR
(> 15% on chromosome 1p36) and LINE2 (> 18% on chromosome 22q12).
In contrast, some genomic regions are nearly devoid of repeats. The
absence of repeats
may be a sign of large-scale cis-regulatory elements that cannot
tolerate being interrupted
by insertions. The four regions with the lowest density of interspersed
repeats in the human
genome are the four homeobox gene clusters, HOXA, HOXB, HOXC and
HOXD (Fig.
21). Each locus contains regions of around 100 kb containing less
than 2% interspersed
repeats. Ongoing sequence analysis of the four HOX clusters in mouse,
rat and baboon
shows a similar absence of transposable elements, and reveals a
high density of conserved
noncoding elements (K. Dewar and B. Birren, manuscript in preparation).
The presence of
a complex collection of regulatory regions may explain why individual
HOX genes carried in
transgenic mice fail to show proper regulation.
It may be worth investigating other repeat-poor regions, such as
a region on chromosome
8q21 (1.5% repeat over 63 kb) containing a gene encoding a homeodomain
zinc-finger
protein (homologous to mouse pID 9663936), a region on chromosome
1p36 (5% repeat
over 100 kb) with no obvious genes and a region on chromosome 18q22
(4% over 100 kb)
containing three genes of unknown function (among which is KIAA0450).
It will be
interesting to see whether the homologous regions in the mouse genome
have similarly
resisted the insertion of transposable elements during rodent evolution.
We next focused on the correlation between the nature of
the transposons in a region and its GC content. We calculated the
density of each repeat
type as a function of the GC content in 50-kb windows (Fig. 22).
As has been reported [142,
173-176], LINE sequences occur at much higher
density in AT-rich regions (roughly fourfold
enriched), whereas SINEs (MIR, Alu) show the opposite trend (for
Alu, up to fivefold
lower in AT-rich DNA). LTR retroposons and DNA transposons show
a more uniform
distribution, dipping only in the most GC-rich regions.
The preference of LINEs for AT-rich DNA seems like a reasonable way
for a genomic
parasite to accommodate its host, by targeting gene-poor AT-rich
DNA and thereby
imposing a lower mutational burden. Mechanistically, selective targeting
is nicely explained
by the fact that the preferred cleavage site of the LINE endonuclease
is TTTT/A (where
the slash indicates the point of cleavage), which is used to prime
reverse transcription from
the poly(A) tail of LINE RNA [177].
The contrary behaviour of SINEs, however, is baffling. How do SINEs
accumulate in
GC-rich DNA, particularly if they depend on the LINE transposition
machinery [178]?
Notably, the same pattern is seen for the Alu-like B1 and the tRNA-derived
SINEs in
mouse and for MIR in human [142]. One possibility
is that SINEs somehow target GC-rich
DNA for insertion. The alternative is that SINEs initially insert
with the same proclivity for
AT-rich DNA as LINEs, but that the distribution is subsequently
reshaped by evolutionary
forces [142, 179].
We used the draft genome sequence to investigate this mystery by
comparing the
proclivities of young, adolescent, middle-aged and old Alus (Fig.
23). Strikingly, recent Alus
show a preference for AT-rich DNA resembling that of LINEs, whereas
progressively
older Alus show a progressively stronger bias towards GC-rich DNA.
These results
indicate that the GC bias must result from strong pressure: Fig.
23 shows that a 13-fold
enrichment of Alus in GC-rich DNA has occurred within the last 30
Myr, and possibly more
recently.
These results raise a new mystery. What is the force that produces
the great and rapid
enrichment of Alus in GC-rich DNA? One explanation may be that deletions
are more
readily tolerated in gene-poor AT-rich regions than in gene-rich
GC-rich regions, resulting in
older elements being enriched in GC-rich regions. Such an enrichment
is seen for
transposable elements such as DNA transposons (Fig. 24). However,
this effect seems too
slow and too small to account for the observed remodelling of the
Alu distribution. This can
be seen by performing a similar analysis for LINE elements (Fig.
25). There is no
significant change in the LINE distribution over the past 100 Myr,
in contrast to the rapid
change seen for Alu. There is an eventual shift after more than
100 Myr, although its
magnitude is still smaller than seen for Alus.
These observations indicate that there may be some force acting particularly
on Alus. This
could be a higher rate of random loss of Alus in AT-rich DNA, negative
selection against
Alus in AT-rich DNA or positive selection in favour of Alus in GC-rich
DNA. The first two
possibilities seem unlikely because AT-rich DNA is gene-poor and
tolerates the
accumulation of other transposable elements. The third seems more
feasible, in that it
involves selecting in favour of the minority of Alus in GC-rich
regions rather than against
the majority that lie in AT-rich regions. But positive selection
for Alus in GC-rich regions
would imply that they benefit the organism.
Schmid [180] has proposed such a function for
SINEs. This hypothesis is based on the
observation that in many species SINEs are transcribed under conditions
of stress, and the
resulting RNAs specifically bind a particular protein kinase (PKR)
and block its ability to
inhibit protein translation [181-183]. SINE
RNAs would thus promote protein translation under
stress. SINE RNA may be well suited to such a role in regulating
protein translation,
because it can be quickly transcribed in large quantities from thousands
of elements and it
can function without protein translation. Under this theory, there
could be positive selection
for SINEs in readily transcribed open chromatin such as is found
near genes. This could
explain the retention of Alus in gene-rich GC-rich regions. It is
also consistent with the
observation that SINE density in AT-rich DNA is higher near genes
[142].
Further insight about Alus comes from the relationship between Alu
density and GC content
on individual chromosomes (Fig. 26). There are two outliers. Chromosome
19 is even richer
in Alus than predicted by its (high) GC content; the chromosome
comprises 2% of the
genome, but contains 5% of Alus. On the other hand, chromosome Y
shows the lowest
density of Alus relative to its GC content, being higher than average
for GC content less
than 40% and lower than average for GC content over 40%. Even in
AT-rich DNA, Alus
are under-represented on chromosome Y compared with other young
interspersed repeats
(see below). These phenomena may be related to an unusually high
gene density on
chromosome 19 and an unusually low density of somatically active
genes on chromosome Y
(both relative to GC content). This would be consistent with the
idea that Alu correlates not
with GC content but with actively transcribed genes.
Our results may support the controversial idea that SINEs actually
earn their keep in the
genome. Clearly, much additional work will be needed to prove or
disprove the hypothesis
that SINEs are genomic symbionts.
Indirect studies have suggested that nucleotide substitution is not
uniform across mammalian genomes [184-187]. By studying
sets of repeat elements
belonging to a common cohort, one can directly measure nucleotide
substitution rates in
different regions of the genome. We find strong evidence that the
pattern of neutral
substitution differs as a function of local GC content (Fig. 27).
Because the results are
observed in repetitive elements throughout the genome, the variation
in the pattern of
nucleotide substitution seems likely to be due to differences in
the underlying mutational
process rather than to selection.
The effect can be seen most clearly by focusing on the substitution
process , where
denotes GC or CG base pairs and denotes AT or
TA base pairs. If K is the equilibrium
constant in the direction of base pairs (defined by the ratio
of the forward and reverse
rates), then the equilibrium GC content should be 1/(1 + K). Two
observations emerge.
First, there is a regional bias in substitution patterns. The equilibrium
constant varies as a
function of local GC content: base pairs are more likely to
mutate towards base pairs in
AT-rich regions than in GC-rich regions. For the analysis in Fig.
27, the equilibrium constant
K is 2.5, 1.9 and 1.2 when the draft genome sequence is partitioned
into three bins with
average GC content of 37, 43 and 50%, respectively. This bias could
be due to a reported
tendency for GC-rich regions to replicate earlier in the cell cycle
than AT-rich regions and
for guanine pools, which are limiting for DNA replication, to become
depleted late in the cell
cycle, thereby resulting in a small but significant shift in substitution
towards base pairs [186,
188]. Another theory proposes that many substitutions
are due to differences in DNA repair
mechanisms, possibly related to transcriptional activity and thereby
to gene density and GC
content [185, 189, 190].
There is also an absolute bias in substitution patterns resulting
in directional pressure
towards lower GC content throughout the human genome. The genome
is not at equilibrium
with respect to the pattern of nucleotide substitution: the expected
equilibrium GC content
corresponding to the values of K above is 29, 35 and 44% for regions
with average GC
contents of 37, 43 and 50%, respectively. Recent observations on
SNPs [190] confirm that the
mutation pattern in GC-rich DNA is biased towards base pairs;
it should be possible to
perform similar analyses throughout the genome with the availability
of 1.4 million SNPs [97,
191]. On the basis solely of nucleotide substitution
patterns, the GC content would be
expected to be about 7% lower throughout the genome.
What accounts for the higher GC content? One possible explanation
is that in GC-rich
regions, a considerable fraction of the nucleotides is likely to
be under functional constraint
owing to the high gene density. Selection on coding regions and
regulatory CpG islands may
maintain the higher-than-predicted GC content. Another is that throughout
the rest of the
genome, a constant influx of transposable elements tends to increase
GC content (Fig. 28).
Young repeat elements clearly have a higher GC content than their
surrounding regions,
except in extremely GC-rich regions. Moreover, repeat elements clearly
shift with age
towards a lower GC content, closer to that of the neighbourhood
in which they reside. Much
of the 'non-repeat' DNA in AT-rich regions probably consists of
ancient repeats that are not
detectable by current methods and that have had more time to approach
the local
equilibrium value.
The repeats can also be used to study how the mutation process is
affected by the
immediately adjacent nucleotide. Such 'context effects' will be
discussed elsewhere (A. Kas
and A. F. A. Smit, unpublished results).
The pattern of interspersed repeats can be used to shed
light on the unusual evolutionary history of chromosome Y. Our analysis
shows that the
genetic material on chromosome Y is unusually young, probably owing
to a high tolerance
for gain of new material by insertion and loss of old material by
deletion. Several lines of
evidence support this picture. For example, LINE elements on chromosome
Y are on
average much younger than those on autosomes (not shown). Similarly,
MaLR-family
retroposons on chromosome Y are younger than those on autosomes,
with the
representation of subfamilies showing a strong inverse correlation
with the age of the
subfamily. Moreover, chromosome Y has a relative over-representation
of the younger
retroviral class II (ERVK) and a relative under-representation of
the primarily older class
III (ERVL) compared with other chromosomes. Overall, chromosome
Y seems to maintain
a youthful appearance by rapid turnover.
Interspersed repeats on chromosome Y can also be used to estimate
the relative mutation
rates, m and f, in the male and female germlines. Chromosome Y always
resides in males,
whereas chromosome X resides in females twice as often as in males.
The substitution
rates, Y and X, on these two chromosomes should thus be in the ratio
Y:X =
(m):(m + 2f)/3, provided that one considers equivalent neutral sequences.
Several authors
have estimated the mutation rate in the male germline to be fivefold
higher than in the
female germline, by comparing the rates of evolution of X- and Y-linked
genes in humans
and primates. However, Page and colleagues [192]
have challenged these estimates as too
high. They studied a 39-kb region that is apparently devoid of genes
and resides within a
large segmental duplication from X to Y that occurred 3–4 Myr ago
in the human lineage.
On the basis of phylogenetic analysis of the sequence on human Y
and human, chimp and
gorilla X, they obtained a much lower estimate of Y:X = 1.36, corresponding
to m:f =
1.7. They suggested that the other estimates may have been higher
because they were
based on much longer evolutionary periods or because the genes studied
may have been
under selection.
Our database of human repeats provides a powerful resource for addressing
this question.
We identified the repeat elements from recent subfamilies (effectively,
birth cohorts dating
from the past 50 Myr) and measured the substitution rates for subfamily
members on
chromosomes X and Y (Fig. 29). There is a clear linear relationship
with a slope of Y:X
= 1.57 corresponding to m:f = 2.1. The estimate is in reasonable
agreement with that of
Page et al., although it is based on much more total sequence (360
kb on Y, 1.6 Mb on X)
and a much longer time period. In particular, the discrepancy with
earlier reports is not
explained by recent changes in the human lineage. Various theories
have been proposed for
the higher mutation rate in the male germline, including the greater
number of cell divisions
in the formation of sperm than eggs and different repair mechanisms
in sperm and eggs.
We were interested in identifying the youngest retrotransposons in the draft genome sequence. This set should contain the currently active retrotransposons, as well as the insertion sites that are still polymorphic in the human population.
The youngest branch in the phylogenetic tree of human LINE1 elements
is called L1Hs
(ref. 158); it differs in its 3' untranslated
region (UTR) by 12 diagnostic substitutions from
the next oldest subfamily (L1PA2). Within the L1Hs family, there
are two subsets referred
to as Ta and pre-Ta, defined by a diagnostic trinucleotide [193,
194]. All active L1 elements are
thought to belong to these two subsets, because they account for
all 14 known cases of
human disease arising from new L1 transposition (with 13 belonging
to the Ta subset and
one to the pre-Ta subset) [195, 196]. These
subsets are also of great interest for population
genetics because at least 50% are still segregating as polymorphisms
in the human
population [194, 197]; they
provide powerful markers for tracing population history because
they represent unique (non-recurrent and non-revertible) genetic
events that can be used
(along with similarly polymorphic Alus) for reconstructing human
migrations.
LINE1 elements that are retrotransposition-competent should consist
of a full-length
sequence and should have both ORFs intact. Eleven such elements
from the Ta subset have
been identified, including the likely progenitors of mutagenic insertions
into the factor VIII
and dystrophin genes [198-202]. A cultured cell
retrotransposition assay has revealed that eight
of these elements remain retrotransposition-competent [200,
202,
203].
We searched the draft genome sequence and identified 535 LINEs belonging
to the Ta
subset and 415 belonging to the pre-Ta subset. These elements provide
a large collection of
tools for probing human population history. We also identified those
consisting of full-length
elements with intact ORFs, which are candidate active LINEs. We
found 39 such elements
belonging to the Ta subset and 22 belonging to the pre-Ta subset;
this substantially increases
the number in the first category and provides the first known examples
in the second
category. These elements can now be tested for retrotransposition
competence in the cell
culture assay. Preliminary analysis resulted in the identification
of two of these elements as
the likely progenitors of mutagenic insertions into the -globin
and RP2 genes (R. Badge
and J. V. Moran, unpublished data). Similar analyses should allow
the identification of the
progenitors of most, if not all, other known mutagenic L1 insertions.
L1 elements can carry extra DNA if transcription extends through
the native transcriptional
termination site into flanking genomic DNA. This process, termed
L1-mediated
transduction, provides a means for the mobilization of DNA sequences
around the genome
and may be a mechanism for 'exon shuffling' [204].
Twenty-one per cent of the 71 full-length
L1s analysed contained non-L1-derived sequences before the 3' target-site
duplication site,
in cases in which the site was unambiguously recognizable. The length
of the transduced
sequence was 30–970 bp, supporting the suggestion that 0.5–1.0%
of the human genome
may have arisen by LINE-based transduction of 3' flanking sequences
[205, 206].
Our analysis also turned up two instances of 5' transduction (145
bp and 215 bp). Although
this possibility had been suggested on the basis of cell culture
models [195, 203], these are the
first documented examples. Such events may arise from transcription
initiating in a cellular
promoter upstream of the L1 elements. L1 transcription is generally
confined to the
germline [207, 208], but transcription from
other promoters could explain a somatic L1
retrotransposition event that resulted in colon cancer [206].
Transposons as a creative force.
The primary force for the origin and expansion of most transposons
has been selection for their ability to create progeny, and not a selective
advantage for the host. However, these selfish pieces of DNA have been
responsible for important innovations in many genomes, for example by contributing
regulatory elements
and even new genes.
Twenty human genes have been recognized as probably derived from
transposons [142, 209].
These include the RAG1 and RAG2 recombinases and the major centromere-binding
protein CENPB. We scanned the draft genome sequence and identified
another 27 cases,
bringing the total to 47 (Table 13; refs 142,
209).
All but four are derived from DNA
transposons, which give rise to only a small proportion of the interspersed
repeats in the
genome. Why there are so many DNA transposase-like genes, many of
which still contain
the critical residues for transposase activity, is a mystery.
To illustrate this concept, we describe the discovery of one of the
new examples. We
searched the draft genome sequence to identify the autonomous DNA
transposon
responsible for the distribution of the non-autonomous MER85 element,
one of the most
recently (40–50 Myr ago) active DNA transposons. Most non-autonomous
elements are
internal deletion products of a DNA transposon. We identified one
instance of a large
(1,782 bp) ORF flanked by the 5' and 3' halves of a MER85 element.
The ORF encodes a
novel protein (partially published as pID 6453533) whose closest
homologue is the
transposase of the piggyBac DNA transposon, which is found in insects
and has the same
characteristic TTAA target-site duplications [210]
as MER85. The ORF is actively transcribed
in fetal brain and in cancer cells. That it has not been lost to
mutation in 40–50 Myr of
evolution (whereas the flanking, noncoding, MER85-like termini show
the typical divergence
level of such elements) and is actively transcribed provides strong
evidence that it has been
adopted by the human genome as a gene. Its function is unknown.
LINE1 activity clearly has also had fringe benefits. We mentioned
above the possibility of
exon reshuffling by cotranscription of neighbouring DNA. The LINE1
machinery can also
cause reverse transcription of genic mRNAs, which typically results
in nonfunctional
processed pseudogenes but can, occasionally, give rise to functional
processed genes. There
are at least eight human and eight mouse genes for which evidence
strongly supports such
an origin [211] (see http://www-ifi.uni-muenster.de/exapted-retrogenes/tables.html).
Many
other intronless genes may have been created in the same way.
Transposons have made other creative contributions to the genome.
A few hundred genes,
for example, use transcriptional terminators donated by LTR retroposons
(data not shown).
Other genes employ regulatory elements derived from repeat elements
[211].
Simple sequence repeats (SSRs) are a rather different type of repetitive
structure that is
common in the human genome—perfect or slightly imperfect tandem
repeats of a particular
k-mer. SSRs with a short repeat unit (n = 1–13 bases) are often
termed microsatellites,
whereas those with longer repeat units (n = 14–500 bases) are often
termed minisatellites.
With the exception of poly(A) tails from reverse transcribed messages,
SSRs are thought to
arise by slippage during DNA replication212, 213.
We compiled a catalogue of all SSRs over a given length in the human
draft genome
sequence, and studied their properties (Table 14). SSRs comprise
about 3% of the human
genome, with the greatest single contribution coming from dinucleotide
repeats (0.5%). (The
precise criteria for the number of repeat units and the extent of
divergence allowed in an
SSR affect the exact census, but not the qualitative conclusions.)
There is approximately one SSR per 2 kb (the number of nonoverlapping
tandem repeats is
437 per Mb). The catalogue confirms various properties of SSRs that
have been inferred
from sampling approaches (Table 15). The most frequent dinucleotide
repeats are AC and
AT (50 and 35% of dinucleotide repeats, respectively), whereas AG
repeats (15%) are less
frequent and GC repeats (0.1%) are greatly under-represented. The
most frequent
trinucleotides are AAT and AAC (33% and 21%, respectively), whereas
ACC (4.0%),
AGC (2.2%), ACT (1.4%) and ACG (0.1%) are relatively rare. Overall,
trinucleotide SSRs
are much less frequent than dinucleotide SSRs214.
SSRs have been extremely important in human genetic studies, because
they show a high
degree of length polymorphism in the human population owing to frequent
slippage by DNA
polymerase during replication. Genetic markers based on SSRs—particularly
(CA)n
repeats—have been the workhorse of most human disease-mapping studies101,
102. The
availability of a comprehensive catalogue of SSRs is thus a boon
for human genetic studies.
The SSR catalogue also allowed us to resolve a mystery regarding
mammalian genetic
maps. Such genetic maps in rat, mouse and human have a deficit of
polymorphic (CA)n
repeats on chromosome X30, 101. There are two possible explanations
for this deficit. There
may simply be fewer (CA)n repeats on chromosome X; or (CA)n repeats
may be as dense
on chromosome X but less polymorphic in the population. In fact,
analysis of the draft
genome sequence shows that chromosome X has the same density of
(CA)n repeats per
Mb as the autosomes (data not shown). Thus, the deficit of polymorphic
markers relative
to autosomes results from population genetic forces. Possible explanations
include that
chromosome X has a smaller effective population size, experiences
more frequent selective
sweeps reducing diversity (owing to its hemizygosity in males),
or has a lower mutation rate
(owing to its more frequent passage through the less mutagenic female
germline). The
availability of the draft genome sequence should provide ways to
test these alternative
explanations.
A remarkable feature of the human genome is the segmental duplication
of portions of
genomic sequence [215-217]. Such duplications involve the transfer
of 1–200-kb blocks of
genomic sequence to one or more locations in the genome. The locations
of both donor and
recipient regions of the genome are often not tandemly arranged,
suggesting mechanisms
other than unequal crossing-over for their origin. They are relatively
recent, inasmuch as
strong sequence identity is seen in both exons and introns (in contrast
to regions that are
considered to show evidence of ancient duplications, characterized
by similarities only in
coding regions). Indeed, many such duplications appear to have arisen
in very recent
evolutionary time, as judged by high sequence identity and by their
absence in closely
related species.
Segmental duplications can be divided into two categories. First,
interchromosomal
duplications are defined as segments that are duplicated among nonhomologous
chromosomes. For example, a 9.5-kb genomic segment of the adrenoleukodystrophy
locus
from Xq28 has been duplicated to regions near the centromeres of
chromosomes 2, 10, 16
and 22 (refs 218, 219). Anecdotal observations suggest that many
interchromosomal
duplications map near the centromeric and telomeric regions of human
chromosomes218-233.
The second category is intrachromosomal duplications, which occur
within a particular
chromosome or chromosomal arm. This category includes several duplicated
segments, also
known as low copy repeat sequences, that mediate recurrent chromosomal
structural
rearrangements associated with genetic disease215, 217. Examples
on chromosome 17
include three copies of a roughly 200-kb repeat separated by around
5 Mb and two
copies of a roughly 24-kb repeat separated by 1.5 Mb. The copies
are so similar (99%
identity) that paralogous recombination events can occur, giving
rise to contiguous gene
syndromes: Smith–Magenis syndrome and Charcot–Marie–Tooth syndrome
1A,
respectively34, 234. Several other examples are known and are also
suspected to be
responsible for recurrent microdeletion syndromes (for example,
Prader–Willi/Angelman,
velocardiofacial/DiGeorge and Williams' syndromes [215, 235-240]).
Until now, the identification and characterization of segmental duplications
have been based
on anecdotal reports—for example, finding that certain probes hybridize
to multiple
chromosomal sites or noticing duplicated sequence at certain recurrent
chromosomal
breakpoints. The availability of the entire genomic sequence will
make it possible to explore
the nature of segmental duplications more systematically. This analysis
can begin with the
current state of the draft genome sequence, although caution is
required because some
apparent duplications may arise from a failure to merge sequence
contigs from overlapping
clones. Alternatively, erroneous assembly of closely related sequences
from nonoverlapping
clones may underestimate the true frequency of such features, particularly
among those
segments with the highest sequence similarity. Accordingly, we adopted
a conservative
approach for estimating such duplication from the available draft
genome sequence.
Pericentromeres and subtelomeres.
We began by re-evaluating the finished sequences
of chromosomes 21 and 22. The initial papers on these chromosomes93,
94 noted some
instances of interchromosomal duplication near each centromere.
With the ability now to
compare these chromosomes to the vast majority of the genome, it
is apparent that the
regions near the centromeres consist almost entirely of interchromosomal
duplicated
segments, with little or no unique sequence. Smaller regions of
interchromosomal duplication
are also observed near the telomeres.
Chromosome 22 contains a region of 1.5 Mb adjacent to the centromere
in which 90% of
sequence can now be recognized to consist of interchromosomal duplication
(Fig. 30).
Conversely, 52% of the interchromosomal duplications on chromosome
22 were located in
this region, which comprises only 5% of the chromosome. Also, the
subtelomeric end
consists of a 50-kb region consisting almost entirely of interchromosomal
duplications.
Chromosome 21 presents a similar landscape (Fig. 31). The first 1
Mb after the centromere
is composed of interchromosomal repeats, as well as the largest
(> 200 kb) block of
intrachromosomally duplicated material. Again, most interchromosomal
duplications on the
chromosome map to this region and the most subtelomeric region (30
kb) shows extensive
duplication among nonhomologous chromosomes.
The pericentromeric regions are structurally very complex, as illustrated
for chromosome 21
in Fig. 32a. The pericentromeric regions appear to have been bombarded
by successive
insertions of duplications; the insertion events must be fairly
recent because the degree of
sequence conservation with the genomic source loci is fairly high
(90–100%, with an
apparent peak around 96%). Distinct insertions are typically separated
by AT-rich or
GC-rich minisatellite-like repeats that have been hypothesized to
have a functional role in
targeting duplications to these regions [233, 241].
A single genomic source locus often gives rise to pericentromeric
copies on multiple
chromosomes, with each having essentially the same breakpoints and
the same degree of
divergence. An example of such a source locus on Xq28 is shown in
Fig. 32b. Phylogenetic
analysis has suggested a two-step mechanism for the origin and dispersal
of these
segments, whereby an initial segmental duplication in the pericentromeric
region of one
chromosome occurs and is then redistributed as part of a larger
cassette to other such
regions [242].
A comprehensive analysis for all chromosomes will have to await complete
sequencing of
the genome, but the evidence from the draft genome sequence indicates
that the same
picture is likely to be seen throughout the genome. Several papers
have analysed finished
segments within pericentromeric regions of chromosomes 2 (160 kb),
10 (400 kb) and 16
(300 kb), all of which show extensive interchromosomal segmental
duplication215, 219, 232,
233. An example from another pericentromeric region on chromosome
11 is shown in Fig.
32c. Interchromosomal duplications in subtelomeric regions also
appear to be a fairly
general phenomenon, as illustrated by a large tract (500 kb) of
complex duplication on
chromosome 7 (Fig. 32d).
The explanation for the clustering of segmental duplications may
be that the genome has a
damage-control mechanism whereby chromosomal breakage products are
preferentially
inserted into pericentromeric and, to a lesser extent, subtelomeric
regions. The possibility of
a specific mechanism for the insertion of these sequences has been
suggested on the basis
of the unusual sequences found flanking the insertions. Although
it is also possible that these
regions simply have greater tolerance for large insertions, many
large gene-poor 'deserts'
have been identified93 and there is no accumulation of duplicated
segments within these
regions. Along with the fact that transitions between duplicons
(from different regions of the
genome) occur at specific sequences, this suggests that active recruitment
of duplications to
such regions may occur. In any case, the duplicated regions are
in general young (with
many duplications showing <6% nucleotide divergence from their
source loci) and in
constant flux, both through additional duplications and by large-scale
exchange among
similar chromosomal environments. There is evidence of structural
polymorphism in the
human population, such as the presence or absence of olfactory receptor
segments located
within the telomeric regions of several human chromosomes226, 227.
Genome-wide analysis of segmental duplications.
We also performed a global
genome-wide analysis to characterize the amount of segmental duplication
in the genome.
We 'repeat-masked' the known interspersed repeats in the draft genome
sequence and
compared the remaining draft genomic sequence with itself in a massive
all-by-all BLASTN
similarity search. We excluded matches in which the sequence identity
was so high that it
might reflect artefactual duplications resulting from a failure
to overlap sequence contigs
correctly in assembling the draft genome sequence. Specifically,
we considered only
matches with less than 99.5% identity for finished sequence and
less than 98% identity for
unfinished sequence.
We took several approaches to avoid counting artefactual duplications
in the sequence. In
the first approach, we studied only finished sequence. We compared
the finished sequence
with itself, to identify segments of at least 1 kb and 90–99.5%
sequence identity. This
analysis will underestimate the extent of segmental duplication,
because it requires that at
least two copies of the segment are present in the finished sequence
and because some true
duplications have over 99.5% identity.
The finished sequence consists of at least 3.3% segmental duplication
(Table 16).
Interchromosomal duplication accounts for about 1.5% and intrachromosomal
duplication
for about 2%, with some overlap (0.2%) between these categories.
We analysed the
lengths and divergence of the segmental duplications (Fig. 33).
The duplications tend to be
large (10–50 kb) and highly homologous, especially for the interchromosomal
segments. The
sequence divergence for the interchromosomal duplications appears
to peak between 96.5%
and 97.5%. This may indicate that interchromosomal duplications
occurred in a punctuated
manner. It will be intriguing to investigate whether such genomic
upheaval has a role in
speciation events.
In a second approach, we compared the entire human draft genome sequence
(finished and
unfinished) with itself to identify duplications with 90–98% sequence
identity (Table 17).
The draft genome sequence contains at least 3.6% segmental duplication.
The actual
proportion will be significantly higher, because we excluded many
true matches with more
than 98% sequence identity (at least 1.1% of the finished sequence).
Although exact
measurement must await a finished sequence, the human genome seems
likely to contain
about 5% segmental duplication, with most of this sequence in large
blocks (> 10 kb). Such
a high proportion of large duplications clearly distinguishes the
human genome from other
sequenced genomes, such as the fly and worm (Table 18).
The structure of large highly paralogous regions presents one of
the 'serious and
unanticipated challenges' to producing a finished sequence of the
genome [46]. The absence of
unique STS or fingerprint signatures over large genomic distances
(1 Mb) and the high
degree of sequence similarity makes the distinction between paralogous
sequence variation
and allelic polymorphism problematic. Furthermore, the fact that
such regions frequently
harbour intron–exon structures of genuine unique sequence will complicate
efforts to
generate a genome-wide SNP map. The data indicate that a modest
portion of the human
genome may be relatively recalcitrant to genomic-based methods for
SNP detection. Owing
to their repetitive nature and their location in the genome, segmental
duplications may well
be underestimated by the current analysis. An understanding of the
biology, pathology and
evolution of these duplications will require specialized efforts
within these exceptional
regions of the human genome. The presence and distribution of such
segments may provide
evolutionary fodder for processes of exon shuffling and a general
increase in protein
diversity associated with domain accretion. It will be important
to consider both
genome-wide duplication events and more restricted punctuated events
of genome
duplication as forces in the evolution of vertebrate genomes.
...
Gene content of the human genome
Genes (or at least their coding regions) comprise only a tiny fraction
of human DNA, but
they represent the major biological function of the genome and the
main focus of interest by
biologists. They are also the most challenging feature to identify
in the human genome
sequence.
The ultimate goal is to compile a complete list of all human genes
and their encoded
proteins, to serve as a 'periodic table' for biomedical research
[243]. But this is a difficult task.
In organisms with small genomes, it is straightforward to identify
most genes by the
presence of long ORFs. In contrast, human genes tend to have small
exons (encoding an
average of only 50 codons) separated by long introns (some exceeding
10 kb). This creates
a signal-to-noise problem, with the result that computer programs
for direct gene prediction
have only limited accuracy. Instead, computational prediction of
human genes must rely
largely on the availability of cDNA sequences or on sequence conservation
with genes and
proteins from other organisms. This approach is adequate for strongly
conserved genes
(such as histones or ubiquitin), but may be less sensitive to rapidly
evolving genes (including
many crucial to speciation, sex determination and fertilization).
Here we describe our efforts to recognize both the RNA genes and
protein-coding genes in
the human genome. We also study the properties of the predicted
human protein set,
attempting to discern how the human proteome differs from those
of invertebrates such as
worm and fly.
Although biologists often speak of a tight coupling between 'genes
and their encoded protein
products', it is important to remember that thousands of human genes
produce noncoding
RNAs (ncRNAs) as their ultimate product [244].
There are several major classes of ncRNA.
(1) Transfer RNAs (tRNAs) are the adapters that translate the triplet
nucleic acid code of
RNA into the amino-acid sequence of proteins; (2) ribosomal RNAs
(rRNAs) are also
central to the translational machinery, and recent X-ray crystallography
results strongly
indicate that peptide bond formation is catalysed by rRNA, not protein
[245, 246]; (3) small
nucleolar RNAs (snoRNAs) are required for rRNA processing and base
modification in the
nucleolus [247, 248]; and
(4) small nuclear RNAs (snRNAs) are critical components of
spliceosomes, the large ribonucleoprotein (RNP) complexes that splice
introns out of
pre-mRNAs in the nucleus. Humans have both a major, U2 snRNA-dependent
spliceosome
that splices most introns, and a minor, U12 snRNA-dependent spliceosome
that splices a
rare class of introns that often have AT/AC dinucleotides at the
splice sites instead of the
canonical GT/AG splice site consensus [249].
Other ncRNAs include both RNAs of known biochemical function (such
as telomerase
RNA and the 7SL signal recognition particle RNA) and ncRNAs of enigmatic
function
(such as the large Xist transcript implicated in X dosage compensation
[250], or the small vault
RNAs found in the bizarre vault ribonucleoprotein complex [251],
which is three times the
mass of the ribosome but has unknown function).
ncRNAs do not have translated ORFs, are often small and are not polyadenylated.
Accordingly, novel ncRNAs cannot readily be found by computational
gene-finding
techniques (which search for features such as ORFs) or experimental
sequencing of cDNA
or EST libraries (most of which are prepared by reverse transcription
using a primer
complementary to a poly(A) tail). Even if the complete finished
sequence of the human
genome were available, discovering novel ncRNAs would still be challenging.
We can,
however, identify genomic sequences that are homologous to known
ncRNA genes, using
BLASTN or, in some cases, more specialized methods.
It is sometimes difficult to tell whether such homologous genes are
orthologues, paralogues
or closely related pseudogenes (because inactivating mutations are
much less obvious than
for protein-coding genes). For tRNA, there is sufficiently detailed
information about the
cloverleaf secondary structure to allow true genes and pseudogenes
to be distinguished with
high sensitivity. For many other ncRNAs, there is much less structural
information and so
we employ an operational criterion of high sequence similarity (>
95% sequence identity and
> 95% full length) to distinguish true genes from pseudogenes. These
assignments will
eventually need to be reconciled with experimental data.
The classical experimental estimate of the number of human tRNA
genes is 1,310 (ref. 252). In the draft genome
sequence, we find only 497 human tRNA
genes (Tables 19, 20). How do we account for this discrepancy? We
believe that the
original estimate is likely to have been inflated in two respects.
First, it came from a
hybridization experiment that probably counted closely related pseudogenes;
by analysis of
the draft genome sequence, there are in fact 324 tRNA-derived putative
pseudogenes
(Table 20). Second, the earlier estimate assumed too high a value
for the size of the human
genome; repeating the calculation using the correct value yields
an estimate of about 890
tRNA-related loci, which is in reasonable accord with our count
of 821 tRNA genes and
pseudogenes in the draft genome sequence.
The human tRNA gene set predicted from the draft genome sequence
appears to include
most of the known human tRNA species. The draft genome sequence
contains 37 of 38
human tRNA species listed in a tRNA database [253],
allowing for up to one mismatch. This
includes one copy of the known gene for a specialized selenocysteine
tRNA, one of several
components of a baroque translational mechanism that reads UGA as
a selenocysteine
codon in certain rare mRNAs that carry a specific cis-acting RNA
regulatory site (a
so-called SECIS element) in their 3' UTRs. The one tRNA gene in
the database not found
in the draft genome sequence is DE9990, a tRNAGlu species, which
differs in two positions
from the most related tRNA gene in the human genome. Possible explanations
are that the
database version of this tRNA contains two errors, the gene is polymorphic
or this is a
genuine functional tRNA that is missing from the draft genome sequence.
(The database
also lists one additional tRNA gene (DS9994), but this is apparently
a contaminant, most
similar to bacterial tRNAs; the parent entry (Z13399) was withdrawn
from the DNA
database, but the tRNA entry has not yet been removed from the tRNA
database.)
Although the human set appears substantially complete by this test,
the tRNA gene numbers
in Table 19 should be considered tentative and used with caution.
The human and fly (but
not the worm) are known to be missing significant amounts of heterochromatic
DNA, and
additional tRNA genes could be located there.
With this caveat, the results indicate that the human has fewer tRNA
genes than the worm,
but more than the fly. This may seem surprising, but tRNA gene number
in metazoans is
thought to be related not to organismal complexity, but more to
idiosyncrasies of the demand
for tRNA abundance in certain tissues or stages of embryonic development.
For example,
the frog Xenopus laevis, which must load each oocyte with a remarkable
40 ng of tRNA,
has thousands of tRNA genes [254].
The degeneracy of the genetic code has allowed an inspired economy
of tRNA anticodon
usage. Although 61 sense codons need to be decoded, not all 61 different
anticodons are
present in tRNAs. Rather, tRNAs generally follow stereotyped and
conserved wobble
rules [255-257]. Wobble reduces the number of
required anticodons substantially, and provides
a connection between the genetic code and the hybridization stability
of modified and
unmodified RNA bases. In eukaryotes, the rules proposed by Guthrie
and Abelson [256]
predict that about 46 tRNA species will be sufficient to read the
61 sense codons (counting
the initiator and elongator methionine tRNAs as two species). According
to these rules, in
the codon's third (wobble) position, U and C are generally decoded
by a single tRNA
species, whereas A and G are decoded by two separate tRNA species.
In 'two-codon boxes' of the genetic code (where codons ending with
U/C encode a
different amino acid from those ending with A/G), the U/C wobble
position should be
decoded by a G at position 34 in the tRNA anticodon. Thus, in the
top left of Fig. 34, there
is no tRNA with an AAA anticodon for Phe, but the GAA anticodon
can recognize both
UUU and UUC codons in the mRNA. In 'four-codon boxes' of the genetic
code (where U,
C, A and G in the wobble position all encode the same amino acid),
the U/C wobble position
is almost always decoded by I34 (inosine) in the tRNA, where the
inosine is produced by
post-transcriptional modification of an adenine (A). In the bottom
left of Fig. 34, for
example, the GUU and GUC codons of the four-codon Val box are decoded
by a tRNA
with an anticodon of AAC, which is no doubt modified to IAC. Presumably
this pattern,
which is strikingly conserved in eukaryotes, has to do with the
fact that IA base pairs are
also possible; thus the IAC anticodon for a Val tRNA could recognize
GUU, GUC and
even GUA codons. Were this same I34 to be utilized in two-codon
boxes, however,
misreading of the NNA codon would occur, resulting in translational
havoc. Eukaryotic
glycine tRNAs represent a conserved exception to this last rule;
they use a GCC anticodon
to decode GGU and GGC, rather than the expected ICC anticodon.
Satisfyingly, the human tRNA set follows these wobble rules almost
perfectly (Fig. 34).
Only three unexpected tRNA species are found: single genes for a
tRNATyr-AUA,
tRNAIle-GAU, and tRNAAsn-AUU. Perhaps these are pseudogenes, but
they appear to
be plausible tRNAs. We also checked the possibility of sequencing
errors in their
anticodons, but each of these three genes is in a region of high
sequence accuracy, with
PHRAP quality scores higher than 70 for every base in their anticodons.
As in all other organisms, human protein-coding genes show codon
bias—preferential use of
one synonymous codon over another [258] (Fig.
34). In less complex organisms, such as yeast
or bacteria, highly expressed genes show the strongest codon bias.
Cytoplasmic abundance
of tRNA species is correlated with both codon bias and overall amino-acid
frequency (for
example, tRNAs for preferred codons and for more common amino acids
are more
abundant). This is presumably driven by selective pressure for efficient
or accurate
translation [259]. In many organisms, tRNA abundance
in turn appears to be roughly
correlated with tRNA gene copy number, so tRNA gene copy number
has been used as a
proxy for tRNA abundance [260]. In vertebrates,
however, codon bias is not so obviously
correlated with gene expression level. Differing codon biases between
human genes is more
a function of their location in regions of different GC composition
[261]. In agreement with the
literature, we see only a very rough correlation of human tRNA gene
number with either
amino-acid frequency or codon bias (Fig. 34). The most obvious outliers
in these weak
correlations are the strongly preferred CUG leucine codon, with
a mere six tRNALeu-CAG
genes producing a tRNA to decode it, and the relatively rare cysteine
UGU and UGC
codons, with 30 tRNA genes to decode them.
The tRNA genes are dispersed throughout the human genome. However,
this dispersal is
nonrandom. tRNA genes have sometimes been seen in clusters at small
scales [262, 263] but
we can now see striking clustering on a genome-wide scale. More
than 25% of the tRNA
genes (140) are found in a region of only about 4 Mb on chromosome
6. This small region,
only about 0.1% of the genome, contains an almost sufficient set
of tRNA genes all by
itself. The 140 tRNA genes contain a representative for 36 of the
49 anticodons found in
the complete set; and of the 21 isoacceptor types, only tRNAs to
decode Asn, Cys, Glu and
selenocysteine are missing. Many of these tRNA genes, meanwhile,
are clustered
elsewhere; 18 of the 30 Cys tRNAs are found in a 0.5-Mb stretch
of chromosome 7 and
many of the Asn and Glu tRNA genes are loosely clustered on chromosome
1. More than
half of the tRNA genes (280 out of 497) reside on either chromosome
1 or chromosome 6.
Chromosomes 3, 4, 8, 9, 10, 12, 18, 20, 21 and X appear to have
fewer than 10 tRNA genes
each; and chromosomes 22 and Y have none at all (each has a single
pseudogene).
The ribosome, the protein synthetic machine of the cell, is made
up of two subunits and contains four rRNA species and many proteins.
The large ribosomal
subunit contains 28S and 5.8S rRNAs (collectively called 'large
subunit' (LSU) rRNA) and
also a 5S rRNA. The small ribosomal subunit contains 18S rRNA ('small
subunit' (SSU)
rRNA). The genes for LSU and SSU rRNA occur in the human genome
as a 44-kb tandem
repeat unit [264]. There are thought to be about
150–200 copies of this repeat unit arrayed on
the short arms of acrocentric chromosomes 13, 14, 15, 21 and 22
(refs 254, 264). There are
no true complete copies of the rDNA tandem repeats in the draft
genome sequence, owing
to the deliberate bias in the initial phase of the sequencing effort
against sequencing BAC
clones whose restriction fragment fingerprints showed them to contain
primarily tandemly
repeated sequence. Sequence similarity analysis with the BLASTN
computer program
does, however, detect hundreds of rDNA-derived sequence fragments
dispersed throughout
the complete genome, including one 'full-length' copy of an individual
5.8S rRNA gene not
associated with a true tandem repeat unit (Table 20).
The 5S rDNA genes also occur in tandem arrays, the largest of which
is on chromosome 1
between 1q41.11 and 1q42.13, close to the telomere [265,
266].
There are 200–300 true 5S
genes in these arrays265, 267. The number of 5S-related sequences
in the genome, including
numerous dispersed pseudogenes, is classically cited as 2,000 (refs
252,
254).
The long
tandem array on chromosome 1 is not yet present in the draft genome
sequence because
there are no EcoRI or HindIII sites present, and thus it was not
cloned in the most heavily
utilized BAC libraries (Table 1). We expect to recover it during
the finishing stage. We do
detect four individual copies of 5S rDNA by our search criteria
( 95% identity and 95%
full length). We also find many more distantly related dispersed
sequences (520 at
P 0.001), which we interpret as probable pseudogenes (Table
20).
Eukaryotic rRNA is extensively processed and modified in
the nucleolus. Much of this activity is directed by numerous snoRNAs.
These come in two
families: C/D box snoRNAs (mostly involved in guiding site-specific
2'-O-ribose
methylations of other RNAs) and H/ACA snoRNAs (mostly involved in
guiding
site-specific pseudouridylations) [247, 248].
We compiled a set of 97 known human snoRNA
gene sequences; 84 of these (87%) have at least one copy in the
draft genome sequence
(Table 20), almost all as single-copy genes.
It is thought that all 2'-O-ribose methylations and pseudouridylations
in eukaryotic rRNA are
guided by snoRNAs. There are 105–107 methylations and around 95
pseudouridylations in
human rRNA [268]. Only about half of these have
been tentatively assigned to known guide
snoRNAs. There are also snoRNA-directed modifications on other stable
RNAs, such as
U6 (ref. 269), and the extent of this is just
beginning to be explored. Sequence similarity has
so far proven insufficient to recognize all snoRNA genes. We therefore
expect that there
are many unrecognized snoRNA genes that are not detected by BLAST
queries.
Spliceosomal RNAs and other ncRNA genes.
We also looked for copies of other known ncRNA genes. We found at least one copy of 21 (95%) of 22 known ncRNAs, including the spliceosomal snRNAs. There were multiple copies for several ncRNAs, as expected; for example, we find 44 dispersed genes for U6 snRNA, and 16 for U1 snRNA (Table 20).
For some of these RNA genes, homogeneous multigene families that
occur in tandem
arrays are again under-represented owing to the restriction enzymes
used in constructing
the BAC libraries and, in some instances, the decision to delay
the sequencing of BAC
clones with low complexity fingerprints indicative of tandemly repeated
DNA. The U2
RNA genes are located at the RNU2 locus, a tandem array of 10–20
copies of nearly
identical 6.1-kb units at 17q21–q22 (refs 270,
271,
272).
Similarly, the U3 snoRNA genes
(included in the aggregate count of C/D snoRNAs in Table 20) are
clustered at the RNU3
locus at 17p11.2, not in a tandem array, but in a complex inverted
repeat structure of about
5–10 copies per haploid genome [273]. The U1
RNA genes are clustered with about 30 copies
at the RNU1 locus at 1p36.1, but this cluster is thought to be loose
and irregularly organized;
no two U1 genes have been cloned on the same cosmid [271].
In the draft genome sequence,
we see six copies of U2 RNA that meet our criteria for true genes,
three of which appear
to be in the expected position on chromosome 17. For U3, so far
we see one true copy at
the correct place on chromosome 17p11.2. For U1, we see 16 true
genes, 6 of which are
loosely clustered within 0.6 Mb at 1p36.1 and another 6 are elsewhere
on chromosome 1.
Again, these and other clusters will be a matter for the finishing
process.
Our observations also confirm the striking proliferation of ncRNA-derived
pseudogenes
(Table 20). There are hundreds or thousands of sequences in the
draft genome sequence
related to some of the ncRNA genes. The most prolific pseudogene
counts generally come
from RNA genes transcribed by RNA polymerase III promoters, including
U6, the hY
RNAs and SRP-RNA. These ncRNA pseudogenes presumably arise through
reverse
transcription. The frequency of such events gives insight into how
ncRNA genes can evolve
into SINE retroposons, such as the tRNA-derived SINEs found in many
vertebrates and the
SRP-RNA-derived Alu elements found in humans.
Identifying the protein-coding genes in the human genome is one of
the most important
applications of the sequence data, but also one of the most difficult
challenges. We describe
below our efforts to create an initial human gene and protein index.
Exploring properties of known genes.
Before attempting to identify new genes, we explored what could be learned by aligning the cDNA sequences of known genes to the draft genome sequence. Genomic alignments allow one to study exon–intron structure and local GC content, and are valuable for biomedical studies because they connect genes with the genetic and cytogenetic map, link them with regulatory sequences and facilitate the development of polymerase chain reaction (PCR) primers to amplify exons. Until now, genomic alignment was available for only about a quarter of known genes.
The 'known' genes studied were those in the RefSeq database [110],
a manually curated
collection designed to contain nonredundant representatives of most
full-length human
mRNA sequences in GenBank (RefSeq intentionally contains some alternative
splice forms
of the same genes). The version of RefSeq used contained 10,272
mRNAs.
The RefSeq genes were aligned with the draft genome sequence, using
both the Spidey (S.
Wheelan, personal communication) and Acembly (D. Thierry-Mieg and
J. Thierry-Mieg,
unpublished; http://www.acedb.org)
computer programs. Because this sequence is
incomplete and contains errors, not all genes could be fully aligned
and some may have been
incorrectly aligned. More than 92% of the RefSeq entries could be
aligned at high
stringency over at least part of their length, and 85% could be
aligned over more than half
of their length. Some genes (16%) had high stringency alignments
to more than one location
in the draft genome sequence owing, for example, to paralogues or
pseudogenes. In such
cases, we considered only the best match. In a few of these cases,
the assignment may not
be correct because the true matching region has not yet been sequenced.
Three per cent of
entries appeared to be alternative splice products of the same gene,
on the basis of their
alignment to the same location in the draft genome sequence. In
all, we obtained at least
partial genomic alignments for 9,212 distinct known genes and essentially
complete
alignment for 5,364 of them.
Previous efforts to study human gene structure [116,
274,
275] have been hampered by limited
sample sizes and strong biases in favour of compact genes. Table
21 gives the mean and
median values of some basic characteristics of gene structures.
Some of the values may be
underestimates. In particular, the UTRs given in the RefSeq database
are likely to be
incomplete; they are considerably shorter, for example, than those
derived from careful
reconstructions on chromosome 22. Intron sizes were measured only
for genes in finished
genomic sequence, to mitigate the bias arising from the fact that
long introns are more likely
than short introns to be interrupted by gaps in the draft genome
sequence. Nonetheless,
there may be some residual bias against long genes and long introns.
There is considerable variation in overall gene size and intron size,
with both distributions
having very long tails. Many genes are over 100 kb long, the largest
known example being
the dystrophin gene (DMD) at 2.4 Mb. The variation in the size distribution
of coding
sequences and exons is less extreme, although there are still some
remarkable outliers. The
titin gene276 has the longest currently known coding sequence at
80,780 bp; it also has the
largest number of exons (178) and longest single exon (17,106 bp).
It is instructive to compare the properties of human genes with those
from worm and fly.
For all three organisms, the typical length of a coding sequence
is similar (1,311 bp for
worm, 1,497 bp for fly and 1,340 bp for human), and most internal
exons fall within a
common peak between 50 and 200 bp (Fig. 35a). However, the worm
and fly exon
distributions have a fatter tail, resulting in a larger mean size
for internal exons (218 bp for
worm versus 145 bp for human). The conservation of preferred exon
size across all three
species supports suggestions of a conserved exon-based component
of the splicing
machinery [277]. Intriguingly, the few extremely
short human exons show an unusual base
composition. In 42 detected human exons of less than 19 bp, the
nucleotide frequencies of
A, G, T and C are 39, 33, 15 and 12%, respectively, showing a strong
purine bias.
Purine-rich sequences may enhance splicing [278,
279], and it is possible that such sequences
are required or strongly selected for to ensure correct splicing
of very short exons. Previous
studies have shown that short exons require intronic, but not exonic,
splicing enhancers [280].
In contrast to the exons, the intron size distributions differ substantially
among the three
species (Fig. 35b, c). The worm and fly each have a reasonably tight
distribution, with most
introns near the preferred minimum intron length (47 bp for worm,
59 bp for fly) and an
extended tail (overall average length of 267 bp for worm and 487
bp for fly). Intron size is
much more variable in humans, with a peak at 87 bp but a very long
tail resulting in a mean
of more than 3,300 bp. The variation in intron size results in great
variation in gene size.
The variation in gene size and intron size can partly be explained
by the fact that GC-rich
regions tend to be gene-dense with many compact genes, whereas AT-rich
regions tend to
be gene-poor with many sprawling genes containing large introns.
The correlation of gene
density with GC content is shown in Fig. 36a, b; the relative density
increases more than
tenfold as GC content increases from 30% to 50%. The correlation
appears to be due
primarily to intron size, which drops markedly with increasing GC
content (Fig. 36c). In
contrast, coding properties such as exon length (Fig. 36c) or exon
number (data not shown)
vary little. Intergenic distance is also probably lower in high-GC
areas, although this is hard
to prove directly until all genes have been identified.
The large number of confirmed human introns allows us to analyse
variant splice sites,
confirming and extending recent reports [281].
Intron positions were confirmed by applying a
stringent criterion that EST or mRNA sequence show an exact match
of 8 bp in the
flanking exonic sequence on each side. Of 53,295 confirmed introns,
98.12% use the
canonical dinucleotides GT at the 5' splice site and AG at the 3'
site (GT–AG pattern).
Another 0.76% use the related GC–AG. About 0.10% use AT–AC, which
is a rare
alternative pattern primarily recognized by the variant U12 splicing
machinery [282]. The
remaining 1% belong to 177 types, some of which undoubtedly reflect
sequencing or
alignment errors.
Finally, we looked at alternative splicing of human genes. Alternative
splicing can allow
many proteins to be produced from a single gene and can be used
for complex gene
regulation. It appears to be prevalent in humans, with lower estimates
of about 35% of
human genes being subject to alternative splicing [283-285].
These studies may have
underestimated the prevalence of alternative splicing, because they
examined only EST
alignments covering only a portion of a gene.
To investigate the prevalence of alternative splicing, we analysed
reconstructed mRNA
transcripts covering the entire coding regions of genes on chromosome
22 (omitting small
genes with coding regions of less than 240 bp). Potential transcripts
identified by alignments
of ESTs and cDNAs to genomic sequence were verified by human inspection.
We found
642 transcripts, covering 245 genes (average of 2.6 distinct transcripts
per gene). Two or
more alternatively spliced transcripts were found for 145 (59%)
of these genes. A similar
analysis for the gene-rich chromosome 19 gave 1,859 transcripts,
corresponding to 544
genes (average 3.2 distinct transcripts per gene). Because we are
sampling only a subset of
all transcripts, the true extent of alternative splicing is likely
to be greater. These figures are
considerably higher than those for worm, in which analysis reveals
alternative splicing for
22% of genes for which ESTs have been found, with an average of
1.34 (12,816/9,516)
splice variants per gene. (The apparently higher extent of alternative
splicing seen in human
than in worm was not an artefact resulting from much deeper coverage
of human genes by
ESTs and mRNAs. Although there are many times more ESTs available
for human than
worm, these ESTs tend to have shorter average length (because many
were the product of
early sequencing efforts) and many match no human genes. We calculated
the actual
coverage per bp used in the analysis of the human and worm genes;
the coverage is only
modestly higher (about 50%) for the human, with a strong bias towards
3' UTRs which tend
to show much less alternative splicing. We also repeated the analysis
using equal coverage
for the two organisms and confirmed that higher levels of alternative
splicing were still seen
in human.)
Seventy per cent of alternative splice forms found in the genes on
chromosomes 19 and 22
affect the coding sequence, rather than merely changing the 3' or
5' UTR. (This estimate
may be affected by the incomplete representation of UTRs in the
RefSeq database and in
the transcripts studied.) Alternative splicing of the terminal exon
was seen for 20% of 6,105
mRNAs that were aligned to the draft genome sequence and correspond
to confirmed 3'
EST clusters. In addition to alternative splicing, we found evidence
of the terminal exon
employing alternative polyadenylation sites (separated by > 100
bp) in 24% of cases.
...
Segmental history of the human genome
In bacteria, genomic segments often convey important information
about function: genes
located close to one another often encode proteins in a common pathway
and are regulated
in a common operon. In mammals, genes found close to each other
only rarely have
common functions, but they are still interesting because they have
a common history. In
fact, the study of genomic segments can shed light on biological
events as long as 500 Myr
ago and as recently as 20,000 years ago.
Conserved segments between human and mouse
Humans and mice shared a common ancestor about 100 Myr ago. Despite
the 200 Myr of
evolutionary distance between the species, a significant fraction
of genes show synteny
between the two, being preserved within conserved segments. Genes
tightly linked in one
mammalian species tend to be linked in others. In fact, conserved
segments have been
observed in even more distant species: humans show conserved segments
with fish350, 351
and even with invertebrates such as fly and worm352. In general,
the likelihood that a
syntenic relationship will be disrupted correlates with the physical
distance between the loci
and the evolutionary distance between the species.
Studying conserved segments between human and mouse has several uses.
First,
conservation of gene order has been used to identify likely orthologues
between the species,
particularly when investigating disease phenotypes. Second, the
study of conserved
segments among genomes helps us to deduce evolutionary ancestry.
And third, detailed
comparative maps may assist in the assembly of the mouse sequence,
using the human
sequence as a scaffold.
Two types of linkage conservation are commonly described353. 'Conserved
synteny'
indicates that at least two genes that reside on a common chromosome
in one species are
also located on a common chromosome in the other species. Syntenic
loci are said to lie in a
'conserved segment' when not only the chromosomal position but the
linear order of the loci
has been preserved, without interruption by other chromosomal rearrangements.
An initial survey of homologous loci in human and mouse354 suggested
that the total number
of conserved segments would be about 180. Subsequent estimates based
on increasingly
detailed comparative maps have remained close to this projection353,
355, 356
(http://www.informatics.jax.org).
The distribution of segment lengths has corresponded
reasonably well to the truncated negative exponential curve predicted
by the random
breakage model357.
The availability of a draft human genome sequence allows the first
global human–mouse
comparison in which human physical distances can be measured in
Mb, rather than cM or
orthologous gene counts. We identified likely orthologues by reciprocal
comparison of the
human and mouse mRNAs in the LocusLink database, using megaBLAST.
For each
orthologous pair, we mapped the location of the human gene in the
draft genome sequence
and then checked the location of the mouse gene in the Mouse Genome
Informatics
database (http://www.informatics.jax.org).
Using a conservative threshold, we identified
3,920 orthologous pairs in which the human gene could be mapped
on the draft genome
sequence with high confidence. Of these, 2,998 corresponding mouse
genes had a known
position in the mouse genome. We then searched for definitive conserved
segments, defined
as human regions containing orthologues of at least two genes from
the same mouse
chromosome region (< 15 cM) without interruption by segments
from other chromosomes.
We identified 183 definitive conserved segments (Fig. 46). The average
segment length was
15.4 Mb, with the largest segment being 90.5 Mb and the smallest
24 kb. There were also
141 'singletons', segments that contained only a single locus; these
are not counted in the
statistics. Although some of these could be short conserved segments,
they could also
reflect incorrect choices of orthologues or problems with the human
or mouse maps.
Because of this conservative approach, the observed number of definitive
segments is likely
be lower than the correct total. One piece of evidence for this
conclusion comes from a
more detailed analysis on human chromosome 7 (ref. 358), which identified
20 conserved
segments, of which three were singletons. Our analysis revealed
only 13 definitive segments
on this chromosome, with nine singletons.
The frequency of observing a particular gene count in a conserved
segment is plotted on a
logarithmic scale in Fig. 47. If chromosomal breaks occur in a random
fashion (as has been
proposed) and differences in gene density are ignored, a roughly
straight line should result.
There is a clear excess for n = 1, suggesting that 50% or more of
the singletons are indeed
artefactual. Thus, we estimate that true number of conserved segments
is around 190–230,
in good agreement with the original Nadeau–Taylor prediction354.
Figure 48 shows a plot of the frequency of lengths of conserved segments,
where the x-axis
scale is shown in Mb. As before, there is a fair amount of scatter
in the data for the larger
segments (where the numbers are small), but the trend appears to
be consistent with a
random breakage model.
We attempted to ascertain whether the breakpoint regions have any
special characteristics.
This analysis was complicated by imprecision in the positioning
of these breaks, which will
tend to blur any relationships. With 2,998 orthologues, the average
interval within which a
break is known to have occurred is about 1.1 Mb. We compared the
aggregate features of
these breakpoint intervals with the genome as a whole. The mean
gene density was lower
in breakpoint regions than in the conserved segments (13.8 versus
18.6 per Mb). This
suggests that breakpoints may be more likely to occur or to undergo
fixation in gene-poor
intervals than in gene-rich intervals. The occurrence of breakpoints
may be promoted by
homologous recombination among repeated sequences359. When the sequence
of the
mouse genome is finished, this analysis can be revisited more precisely.
A number of examples of extended conserved segments and syntenies
are apparent in Fig.
46. As has been noted, almost all human genes on chromosome 17 are
found on mouse
chromosome 11, with two members of the placental lactogen family
from mouse 13
inserted. Apart from two singleton loci, human chromosome 20 appears
to be entirely
orthologous to mouse chromosome 2, apparently in a single segment.
The largest apparently
contiguous conserved segment in the human genome is on chromosome
4, including roughly
90.5 Mb of human DNA that is orthologous to mouse chromosome 5.
This analysis also
allows us to infer the likely location of thousands of mouse genes
for which the human
orthologue has been located in the draft genome sequence but the
mouse locus has not yet
been mapped.
With about 200 conserved segments between mouse and human and about
100 Myr of
evolution from their common ancestor360, we obtain an estimated
rate of about 1.0
chromosomal rearrangement being fixed per Myr. However, there is
good evidence that the
rate of chromosomal rearrangement (like the rate of nucleotide substitutions;
see above)
differs between the two species. Among mammals, rodents may show
unusually rapid
chromosome alteration. By comparison, very few rearrangements have
been observed
among primates, and studies of a broader array of mammalian orders,
including cats, cows,
sheep and pigs, suggest an average rate of chromosome alteration
of only about 0.2
rearrangements per Myr in these lineages361. Additional evidence
that rodents are outliers
comes from a recent analysis of synteny between the human and zebrafish
genomes. From
a study of 523 orthologues, it was possible to project 418 conserved
segments350. Assuming
400 Myr since a common vertebrate ancestor of zebrafish and humans362,
we obtain an
estimate of 0.52 rearrangements per Myr. Recent estimates of rearrangement
rates in
plants have suggested bimodality, with some pairs showing rates
of 0.15–0.41
rearrangements per Myr, and others showing higher rates of 1.1–1.3
rearrangements per
Myr363. With additional detailed genome maps of multiple species,
it should be possible to
determine whether this particular molecular clock is truly operating
at a different rate in
various branches of the evolutionary tree, and whether variations
in that rate are bimodal or
continuous. It should also be possible to reconstruct the karyotypes
of common ancestors.
Ancient duplicated segments in the human genome
Another approach to genomic history is to study segmental duplications
within the human
genome. Earlier, we discussed examples of recent duplications of
genomic segments to
pericentromeric and subtelomeric regions. Most of these events appear
to be evolutionary
dead-ends resulting in nonfunctional pseudogenes; however, segmental
duplication