Published in Nature vol. 409, no. 6822, pp. 860-921 (February 15, 2001):



"Initial Sequencing and Analysis of the Human Genome".

International Human Genome Sequencing Consortium


Authors:
Addresses:
Correspondence:
Abstract:
Introduction:
Summary of Findings:
Methods:
Supplementary Information:
...
Repeat Content of the Human Genome:
    Transposon-Derived Repeats:
        Classes of Transposable Elements:
           Figure 17. Classes of interspersed repeat transposons in the human genome:
           LINEs:
           SINEs:
           LTR retroposons:
           DNA retroposons:
        Census of human repeats:
           Table 11. Number of copies and fraction of  genome for interspersed repeat:
        Age distribution:
           Figure 18. Age distribution of interspersed repeats in the human and mouse genomes:
        Comparisons with other organisms:
           Table 12. Number and nature of interspersed repeats in eukaryote genomes:
           Figure 20. Comparison of the age of interspersed repeats in eukaryote genomes:
         Variation in the distribution of repeats:
          ...
          Active transposons:
           Transposons as a creative force:
    Simple sequence repeats:
    ...
Gene Content of the Human Genome:
    Noncoding RNAs:
         Transfer RNA genes:
         Ribosomal RNA genes:
         Small nucleolar RNA genes:
         Spliceosomal RNAs and other ncRNA genes:
     Protein-coding genes:
          Exploring properties of known genes:
...
The next steps:
     Finishing the human sequence:
     Developing the IGI and IPI:
     Large-scale identification of regulatory regions:
     Sequencing of additional large genomes:
     Completing the catalogue of human variation:
     From sequence to function:
     Concluding thoughts:
DNA Sequence Databases:
Supplementary Information:
Acknowledgements:
References:
Background References:
Additional References:
Other Sites:
Feedback:



Abstract:

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.


The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century [1-3] sparked a scientific quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scientific progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The first established the cellular basis of heredity: the chromosomes. The second defined the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same.

The last quarter of a century has been marked by a relentless drive to decipher first genes and then entire genomes, spawning the field of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant.

Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly fifteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a finished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in final form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly.

The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the first vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species.

Much work remains to be done to produce a complete finished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is finished, many points are already clear.
 

  • The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably reflecting the very complex coordinate regulation of the genes in the clusters.

  •  
  • There appear to be about 30,000-40,000 protein-coding genes in the human genome---only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products.

  •  
  • The full set of proteins (the 'proteome') encoded by the human genome is more complex than those of invertebrates. This is due in part to the presence of vertebrate-specific protein domains and motifs (an estimated 7% of the total), but more to the fact that vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures.

  •  
  • Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage. Dozens of genes appear to have been derived from transposable elements.

  •  
  • Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so.

  •  
  • The pericentromeric and subtelomeric regions of chromosomes are filled with large recent segmental duplications of sequence from elsewhere in the genome. Segmental duplication is much more frequent in humans than in yeast, fly or worm.

  •  
  • Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, and suggests that there may be strong selection in favour of preferential retention of Alu elements in GC-rich regions and that these 'selfish' elements may benefit their human hosts.

  •  
  • The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males.

  •  
  • Cytogenetic analysis of the sequenced clones confirms suggestions that large GC-poor regions are strongly correlated with 'dark G-bands' in karyotypes.

  •  
  • Recombination rates tend to be much higher in distal regions (around 20 megabases (Mb)) of chromosomes and on shorter chromosome arms in general, in a pattern that promotes the occurrence of at least one crossover per chromosome arm in each meiosis.

  •  
  • More than 1.4 million single nucleotide polymorphisms (SNPs) in the human genome have been identified. This collection should allow the initiation of genome-wide linkage disequilibrium mapping of the genes in the human population.

  •  

     
     
     

    In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruitfly Drosophila melanogaster and the mustard weed  Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, fly and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. (underlines by WebEditor).

    A full description of the methods is provided as Supplementary Information on Nature's web site.


    Repeat content of the human genome:

    A puzzling observation in the early days of molecular biology was that genome size does not correlate well with organismal complexity. For example, Homo sapiens has a genome thais 200 times as large as that of the yeast S. cerevisiae, but 200 times as small as that of Amoeba dubia [139, 140]. This mystery (the C-value paradox) was largely resolved with the recognition that genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes (reviewed in refs [140, 141]).

    In the human, coding sequences comprise less than 5% of the genome (see below), whereas repeat sequences account for at least 50% and probably much more. Broadly, the repeats fall into five classes: (1) transposon-derived repeats, often referred to as interspersed repeats; (2) inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small structural RNAs), usually referred to as processed pseudogenes; (3) simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)nor (CGG)n; (4) segmental duplications, consisting of blocks of around 10–300 kb that have been copied from one region of the genome into another region; and (5) blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of acrocentric chromosomes and ribosomal gene clusters. (These regions are intentionally under-represented in the draft genome sequence and are not discussed here.)

    Repeats are often described as 'junk' and dismissed as uninteresting. However, they actually represent an extraordinary trove of information about biological processes. The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. It is possible to recognize cohorts of repeats 'born' at the same time and to follow their fates in different regions of the genome or in different species. As active agents, repeats have reshaped the genome by causing ectopic rearrangements, creating entirely new genes, modifying and reshuffling existing genes, and modulating overall GC content. They also shed light on chromosome structure and dynamics, and provide tools for medical genetic and population genetic studies.

    The human is the first repeat-rich genome to be sequenced, and so we investigated what information could be gleaned from this majority component of the human genome. Although some of the general observations about repeats were suggested by previous studies, the draft genome sequence provides the first comprehensive view, allowing some questions to be resolved and new mysteries to emerge.

    Transposon-derived repeats

    Most human repeat sequence is derived from transposable elements [142, 143]. We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining 'unique' DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized as such. To describe our analyses of interspersed repeats, it is necessary briefly to review the relevant features of human transposable elements.

    Classes of transposable elements.

    In mammals, almost all transposable elements fall into one of four types (Fig. 17), of which three transpose through RNA intermediates and one transposes directly as DNA. These are long interspersed elements (LINEs), shortinterspersed elements (SINEs), LTR retrotransposons and DNA transposons.


    Figure 17. Almost all transposable elements in mammals fall into one of four classes. (See text for details). 



    LINEs are one of the most ancient and successful inventions in eukaryotic genomes. In humans, these transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two open reading frames (ORFs). Upon translation, a LINE RNA assembles with its own encoded proteins and moves to the nucleus, where an endonuclease activity makes a single-stranded nick and the reverse transcriptase uses the nicked DNA to prime reverse transcription from the 3' end of the LINE RNA. Reverse transcription frequently fails to proceed to the 5' end, resulting in many truncated, nonfunctional insertions. Indeed, most LINE-derived repeats are short, with an average size of 900 bp for all LINE1 copies, and a median size of 1,070 bp for copies of the currently active LINE1 element (L1Hs). New insertion sites are flanked by a small target site duplication of 7–20 bp. The LINE machinery is believed to be responsible for most reverse transcription in the genome, including the retrotransposition of the non-autonomous SINEs [144] and the creation of processed pseudogenes [145, 146]. Three distantly related LINE families are found in the human genome: LINE1, LINE2 and LINE3. Only LINE1 is still active.

    SINEs are wildly successful freeloaders on the backs of LINE elements. They are short (about 100–400 bp), harbour an internal polymerase III promoter and encode no proteins.These non-autonomous transposons are thought to use the LINE machinery for transposition. Indeed, most SINEs 'live' by sharing the 3' end with a resident LINE element [144]. The promoter regions of all known SINEs are derived from tRNA sequences, with the exception of a single monophyletic family of SINEs derived from the signal recognition particle component 7SL. This family, which also does not share its 3' end with a LINE, includes the only active SINE in the human genome: the Alu element. By contrast, the mouse has both tRNA-derived and 7SL-derived SINEs. The human genome contains three distinct monophyletic families of SINEs: the active Alu, and the inactive MIR and Ther2/MIR3.

    LTR retroposons are flanked by long terminal direct repeats that contain all of the necessary transcriptional regulatory elements. The autonomous elements (retrotransposons) contain gag and pol genes, which encode a protease, reverse transcriptase, RNAse H and integrase. Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a cellular envelope gene (env) [147]. Transposition occurs through the retroviral mechanism with reverse transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA (in contrast to the nuclear location and chromosomal priming of LINEs). Although a variety of LTR retrotransposons exist, only the vertebrate-specific endogenous retroviruses (ERVs) appear to have been active in the mammalian genome. Mammalian retroviruses fall into three classes (I–III), each comprising many families with independent origins. Most (85%) of the LTR retroposon-derived 'fossils' consist only of an isolated LTR, with the internal sequence having been lost by homologous recombination between the flanking LTRs.

    DNA transposons resemble bacterial transposons, having terminal inverted repeats and encoding a transposase that binds near the inverted repeats and mediates mobility through a 'cut-and-paste' mechanism. The human genome contains at least seven major classes of DNA transposon, which can be subdivided into many families with independent origins [148] (see RepBase, http://www.girinst.org/~server/repbase.html DNA transposons tend to have short life spans within a species. This can be explained by contrasting the modes of transposition of DNA transposons and LINE elements. LINE transposition tends to involve only functional elements, owing to the cis-preference by which LINE proteins assemble with the RNA from which they were translated. By contrast, DNA transposons cannot exercise a cis-preference: the encoded transposase is produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from inactive elements. As inactive copies accumulate in the genome, transposition becomes less efficient. This checks the expansion of any DNA transposon family and in due course causes it to die out. To survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is considerable evidence for such transfer [149-153].

    Transposable elements employ different strategies to ensure their evolutionary survival. LINEs and SINEs rely almost exclusively on vertical transmission within the host genome [154] (but see refs 148, 155).DNA transposons are more promiscuous, requiring relatively frequent horizontal transfer. LTR retroposons use both strategies, with some being long-term active residents of the human genome (such as members of the ERVL family) and others having only short residence times.

    Census of human repeats.

    We began by taking a census of the transposable elements in the draft genome sequence, using a recently updated version of the RepeatMasker program (version 09092000) run under sensitive settings (see
    http://repeatmasker.genome.washington.edu). This program scans sequences to identify full-length and partial members of all known repeat families represented in RepBase Update (version 5.08; see http://www.girinst.org/~server/repbase.html and ref. [156]). Table 11 shows the number of copies and fraction of the draft genome sequence occupied by each of the four major classes and the main subclasses.
     
    Table 11. Number of copies and fraction of genome for classes of interspersed repeat.
    Number of
    copies 
    x1,000
    Total number of
    bases in the draft
    genome sequence 
    (Mb)
    Fraction of the 
    draft genome 
    sequence
    (%)
    Number of
    families
    (subfamilies)
    SINEs 1,558 359.6 13.14
       Alu 1,090 290.1 10.60 1(~20)
       MIR  393 60.1 2.20 1(1)
       MIR3 75 9.3 0.34 1(1)
    LINEs 868 558.8 20.42
       LINE1 516 462.1 16.89 1(~55)
       LINE2 315 88.2 3.22 1(2)
       LINE3 37 8.4 0.31 1(2)
    LTR elements 443 227.0 8.29
       ERV-class I 112 79.2 2.89 72(132)
       ERV(K)-class II 8 8.5 0.31 10(20)
       ERV(L)-class III 83 39.5 1.44 21(42)
       MaLR 240 99.8 3.65 1(31)
    DNA elements 294 77.6 2.84
       hAT group
          MER1-Charlie 182 38.1 1.39 25(50
          Zaphod 13 4.3 0.16 4(10)
       Tc-1 group
          MER2-Tigger 57 28.0 1.02 12(28)
          Tc2 4 0.9 0.03 1(5)
          Mariner 14 2.6 0.10 4(5)
       PiggyBac-like 2 0.5 0.02 10(20)
       Unclassified 22 3.2 0.12 7(7)
    Unclassified 3 3.8 0.14 3(4)
    Total interspersed repeats 1,226.8 44.83
    The number of copies and base pair contributions of the major classes and subclasses of transposable elements in the human genome. Data extracted from a RepeatMasker analysis of the draft genome sequence (RepeatMasker version 09092000, sensitive settings, using RepBase Update 5.08). In calculating percentages, RepeatMasker excluded the runs of Ns linking the contigs in the draft genome sequence. In the last column, separate consensus sequences in the repeat databases are considered subfamilies, rather than families, when the sequences are closely related or related through intermediate subfamilies.



    The precise count of repeats is obviously underestimated because the genome sequence is not finished, but their density and other properties can be stated with reasonable confidence. Currently recognized SINEs, LINEs, LTR retroposons and DNA transposon copies comprise 13%, 20%, 8% and 3% of the sequence, respectively. We expect these densities to grow as more repeat families are recognized, among which will be lower copy numberLTR elements and DNA transposons, and possibly high copy number ancient (highly
    diverged) repeats.

    Age distribution.

    The age distribution of the repeats in the human genome provides a rich 'fossil record' stretching over several hundred million years. The ancestry and approximate age of each fossil can be inferred by exploiting the fact that each copy is derived from, and therefore initially carried the sequence of, a then-active transposon and, being generally under no functional constraint, has accumulated mutations randomly and independently of other copies. We can infer the sequence of the ancestral active elements by clustering the modern derivatives into phylogenetic trees and building a consensus based on the multiple sequence alignment of a cluster of copies. Using available consensus sequences for known repeat subfamilies, we calculated the per cent divergence from the inferred ancestral active transposon for each of three million interspersed repeats in the draft genome sequence.

    The percentage of sequence divergence can be converted into an approximate age in millions of years (Myr) on the basis of evolutionary information. Care is required in calibrating the clock, because the rate of sequence divergence may not be constant over time or between lineages [139]. The relative-rate test [157] can be used to calculate the sequence divergence that accumulated in a lineage after a given timepoint, on the basis of comparison with a sibling species that diverged at that time and an outgroup species. For example, the substitution rate over roughly the last 25 Myr in the human lineage can be calculated by using old world monkeys (which diverged about 25 Myr ago) as a sibling species and new world monkeys as an outgroup. We have used currently available calibrations for the human lineage, but the issue should be revisited as sequence information becomes available fromdifferent mammals.

    Figure 18a shows the representation of various classes of transposable elements in categories reflecting equal amounts of sequence divergence. In Fig. 18b the data are grouped into four bins corresponding to successive 25-Myr periods, on the basis of an approximate clock. Figure 19 shows the mean ages of various subfamilies of DNA transposons.

    Figure 18. Age distribution of interspersed repeats in the human and mouse genomes. Bases covered by interspersed repeats were sorted by their divergence from their consensus sequence (which approximates the repeat's original sequence at the time of insertion). The average number of substitutions per 100 bp (substitution level, K ) was calculated from the mismatch level p assuming equal frequency of all substitutions (the one-parameter Jukes±Cantor model, K = -3/4ln(1- 4/3p)). This model tends to underestimate higher substitution levels. CpG dinucleotides in the consensus were excluded from the substitution level calculations because the C ! T transition rate in CpG pairs is about tenfold higher than other transitions and causes distortions in comparing transposable elements with high and low CpG content. a, The distribution, for the human genome, in bins corresponding to 1% increments in substitution levels. b, The data grouped into bins representing roughly equal time periods of 25 Myr. c,d, Equivalent data for available mouse genomic sequence. There is a different correspondence between substitution levels and time periods owing to different rates of nucleotide substitution in the two species. The correspondence between substitution levels and time periods was largely derived from three-way species comparisons (relative rate test [139, 157]) with the age
    estimates based on fossil data. Human divergence from gibbon 20±30 Myr; old world monkey 25±35 Myr; prosimians 55±80 Myr; eutherian mammalian radiation, 100 Myr.


    Several facts are apparent from these graphs. First, most interspersed repeats in the human genome predate the eutherian radiation. This is a testament to the extremely slow rate with which nonfunctional sequences are cleared from vertebrate genomes (see below concerning comparison with the fly).

    Second, LINE and SINE elements have extremely long lives. The monophyletic LINE1 and Alu lineages are at least 150 and 80 Myr old, respectively. In earlier times, the reigning transposons were LINE2 and MIR [148, 158]. The SINE MIR was perfectly adapted for reverse transcription by LINE2, as it carried the same 50-base sequence at its 3' end. When LINE2 became extinct 80–100 Myr ago, it spelled the doom of MIR.

    Third, there were two major peaks of DNA transposon activity (Fig. 19). The first involved Charlie elements and occurred long before the eutherian radiation; the second involved Tigger elements and occurred after this radiation. Because DNA transposons can produce large-scale chromosome rearrangements [159-162], it is possible that widespread activity could be involved in speciation events.

    Fourth, there is no evidence for DNA transposon activity in the past 50 Myr in the human genome. The youngest two DNA transposon families that we can identify in the draft genome sequence (MER75 and MER85) show 6–7% divergence from their respective consensus sequences representing the ancestral element (Fig. 19), indicating that they were active before the divergence of humans and new world monkeys. Moreover, these elements were relatively unsuccessful, together contributing just 125 kb to the draft genome
    sequence.

    Finally, LTR retroposons appear to be teetering on the brink of extinction, if they have not already succumbed. For example, the most prolific elements (ERVL and MaLRs) flourished for more than 100 Myr but appear to have died out about 40 Myr ago [163, 164]. Only a single LTR retroposon family (HERVK10) is known to have transposed since our divergence from the chimpanzee 7 Myr ago, with only one known copy (in the HLA region) that is not shared between all humans [165]. In the draft genome sequence, we can identify only three full-length copies with all ORFs intact (the final total may be slightly higher owing to the imperfect state of the draft genome sequence).

    More generally, the overall activity of all transposons has declined markedly over the past 35–50 Myr, with the possible exception of LINE1 (Fig. 18). Indeed, apart from an exceptional burst of activity of Alus peaking around 40 Myr ago, there would appear to have been a fairly steady decline in activity in the hominid lineage since the mammalian radiation. The extent of the decline must be even greater than it appears because old
    repeats are gradually removed by random deletion and because old repeat families are harder to recognize and likely to be under-represented in the repeat databases. (We confirmed that the decline in transposition is not an artefact arising from errors in the draft genome sequence, which, in principle, could increase the divergence level in recent elements. First, the sequence error rate (Table 9) is far too low to have a significant effect on the apparent age of recent transposons; and second, the same result is seen if one considers only finished sequence.)

    What explains the decline in transposon activity in the lineage leading to humans? We return to this question below, in the context of the observation that there is no similar decline in the mouse genome.

    Comparison with other organisms

    We compared the complement of transposable elements in the human genome with those of the other sequenced eukaryotic genomes. We analysed the fly, worm and mustard weed genomes for the number and nature of repeats (Table 12) and the age distribution (Fig. 20). (For the fly, we analysed the 114 Mb of
    unfinished 'large' contigs produced by the whole-genome shotgun assembly [166], which are reported to represent euchromatic sequence. Similar results were obtained by analysing 30 Mb of finished euchromatic sequence.) The human genome stands in stark contrast to the genomes of the other organisms.


    The complete genomes of fly, worm, and chromosomes 2 and 4 of mustard weed (as deposited at http://www.ncbi.nlm.nih.gov/genbank/genomes) were screened against the repeats in RepBase Update 5.02
    (September 2000) with RepeatMasker at sensitive settings.




    Figure 20. Comparison of the age of interspersed repeats in eukaryotic genomes. The copies of repeats were pooled by their nucleotide substitution level from the consensus.


    (1) The euchromatic portion of the human genome has a much higher density of transposable element copies than the euchromatic DNA of the other three organisms. The repeats in the other organisms may have been slightly underestimated because the repeat databases for the other organisms are less complete than for the human, especially with regard to older elements; on the other hand, recent additions to these databases appear to increase the repeat content only marginally.

    (2) The human genome is filled with copies of ancient transposons, whereas the transposons in the other genomes tend to be of more recent origin. The difference is most marked with the fly, but is clear for the other genomes as well. The accumulation of old repeats is likely to be determined by the rate at which organisms engage in 'housecleaning' through genomic deletion. Studies of pseudogenes have suggested that small deletions occur at a rate that is 75-fold higher in flies than in mammals; the half-life of such nonfunctional DNA is estimated at 12 Myr for flies and 800 Myr for mammals [167]. The rate of large deletions has not been systematically compared, but seems likely also to differ markedly.

    (3) Whereas in the human two repeat families (LINE1 and Alu) account for 60% of all interspersed repeat sequence, the other organisms have no dominant families. Instead, the worm, fly and mustard weed genomes all contain many transposon families, each consisting of typically hundreds to thousands of elements. This difference may be explained by the observation that the vertically transmitted, long-term residential LINE and SINE elements represent 75% of interspersed repeats in the human genome, but only 5–25% in the other genomes. In contrast, the horizontally transmitted and shorter-lived DNA transposons represent only a small portion of all interspersed repeats in humans (6%) but a much larger fraction in fly, mustard weed and worm (25%, 49% and 87%, respectively). These features of the human genome are probably general to all mammals. The relative lack of horizontallytransmitted elements may have its origin in the well developed immune system of mammals, as horizontal transfer requires infectious vectors, such as viruses, against which the immune system guards.

    We also looked for differences among mammals, by comparing the transposons in the human and mouse genomes. As with the human genome, care is required in calibrating the substitution clock for the mouse genome. There is considerable evidence that the rate of substitution per Myr is higher in rodent lineages than in the hominid lineages [139, 168, 169]. In fact, we found clear evidence for different rates of substitution by examining families of transposable elements whose insertions predate the divergence of the human and mouse lineages. In an analysis of 22 such families, we found that the substitution level was an average of 1.7-fold higher in mouse than human (not shown). (This is likely to be an underestimate because of an ascertainment bias against the most diverged copies.) The faster clock in mouse is also evident from the fact that the ancient LINE2 and MIR elements, which transposed before the mammalian radiation and are readily detectable in the human genome, cannot be readily identified in available mouse genomic sequence (Fig.18).

    We used the best available estimates to calibrate substitution levels and time [169]. The ratio of substitution rates varied from about 1.7-fold higher over the past 100 Myr to about 2.6-fold higher over the past 25 Myr.

    The analysis shows that, although the overall density of the four transposon types in human and mouse is similar, the age distribution is strikingly different (Fig. 18). Transposon activity in the mouse genome has not undergone the decline seen in humans and proceeds at a much higher rate. In contrast to their possible extinction in humans, LTR retroposons are alive and well in the mouse with such representatives as the active IAP family and putatively active members of the long-lived ERVL and MaLR families. LINE1 and a
    variety of SINEs are quite active. These evolutionary findings are consistent with the empirical observations that new spontaneous mutations are 30 times more likely to be caused by LINE insertions in mouse than in human (3% versus 0.1%) [170] and 60 times more likely to be caused by transposable elements in general. It is estimated that around 1 in 600 mutations in human are due to transpositions, whereas 10% of mutations in mouse are due to transpositions (mostly IAP insertions).

    The contrast between human and mouse suggests that the explanation for the decline of transposon activity in humans may lie in some fundamental difference between hominids and rodents. Population structure and dynamics would seem to be likely suspects. Rodents tend to have large populations, whereas hominid populations tend to be small and may undergo frequent bottlenecks. Evolutionary forces affected by such factors include inbreeding and genetic drift, which might affect the persistence of active transposable
    elements [171]. Studies in additional mammalian lineages may shed light on the forces responsible for the differences in the activity of transposable elements [172].

    Variation in the distribution of repeats.

    We next explored variation in the distribution of repeats across the draft genome sequence, by calculating the repeat density in windows of various sizes across the genome. There is striking variation at smaller scales.

    Some regions of the genome are extraordinarily dense in repeats. The prizewinner appears
    to be a 525-kb region on chromosome Xp11, with an overall transposable element density of
    89%. This region contains a 200-kb segment with 98% density, as well as a segment of
    100 kb in which LINE1 sequences alone comprise 89% of the sequence. In addition, there
    are regions of more than 100 kb with extremely high densities of Alu (> 56% at three loci,
    including one on 7q11 with a 50-kb stretch of > 61% Alu) and the ancient transposons MIR
    (> 15% on chromosome 1p36) and LINE2 (> 18% on chromosome 22q12).

    In contrast, some genomic regions are nearly devoid of repeats. The absence of repeats
    may be a sign of large-scale cis-regulatory elements that cannot tolerate being interrupted
    by insertions. The four regions with the lowest density of interspersed repeats in the human
    genome are the four homeobox gene clusters, HOXA, HOXB, HOXC and HOXD (Fig.
    21). Each locus contains regions of around 100 kb containing less than 2% interspersed
    repeats. Ongoing sequence analysis of the four HOX clusters in mouse, rat and baboon
    shows a similar absence of transposable elements, and reveals a high density of conserved
    noncoding elements (K. Dewar and B. Birren, manuscript in preparation). The presence of
    a complex collection of regulatory regions may explain why individual HOX genes carried in
    transgenic mice fail to show proper regulation.

    It may be worth investigating other repeat-poor regions, such as a region on chromosome
    8q21 (1.5% repeat over 63 kb) containing a gene encoding a homeodomain zinc-finger
    protein (homologous to mouse pID 9663936), a region on chromosome 1p36 (5% repeat
    over 100 kb) with no obvious genes and a region on chromosome 18q22 (4% over 100 kb)
    containing three genes of unknown function (among which is KIAA0450). It will be
    interesting to see whether the homologous regions in the mouse genome have similarly
    resisted the insertion of transposable elements during rodent evolution.

    Distribution by GC content.

    We next focused on the correlation between the nature of
    the transposons in a region and its GC content. We calculated the density of each repeat
    type as a function of the GC content in 50-kb windows (Fig. 22). As has been reported [142,
    173-176], LINE sequences occur at much higher density in AT-rich regions (roughly fourfold
    enriched), whereas SINEs (MIR, Alu) show the opposite trend (for Alu, up to fivefold
    lower in AT-rich DNA). LTR retroposons and DNA transposons show a more uniform
    distribution, dipping only in the most GC-rich regions.

    The preference of LINEs for AT-rich DNA seems like a reasonable way for a genomic
    parasite to accommodate its host, by targeting gene-poor AT-rich DNA and thereby
    imposing a lower mutational burden. Mechanistically, selective targeting is nicely explained
    by the fact that the preferred cleavage site of the LINE endonuclease is TTTT/A (where
    the slash indicates the point of cleavage), which is used to prime reverse transcription from
    the poly(A) tail of LINE RNA [177].

    The contrary behaviour of SINEs, however, is baffling. How do SINEs accumulate in
    GC-rich DNA, particularly if they depend on the LINE transposition machinery [178]?
    Notably, the same pattern is seen for the Alu-like B1 and the tRNA-derived SINEs in
    mouse and for MIR in human [142]. One possibility is that SINEs somehow target GC-rich
    DNA for insertion. The alternative is that SINEs initially insert with the same proclivity for
    AT-rich DNA as LINEs, but that the distribution is subsequently reshaped by evolutionary
    forces [142, 179].

    We used the draft genome sequence to investigate this mystery by comparing the
    proclivities of young, adolescent, middle-aged and old Alus (Fig. 23). Strikingly, recent Alus
    show a preference for AT-rich DNA resembling that of LINEs, whereas progressively
    older Alus show a progressively stronger bias towards GC-rich DNA. These results
    indicate that the GC bias must result from strong pressure: Fig. 23 shows that a 13-fold
    enrichment of Alus in GC-rich DNA has occurred within the last 30 Myr, and possibly more
    recently.

    These results raise a new mystery. What is the force that produces the great and rapid
    enrichment of Alus in GC-rich DNA? One explanation may be that deletions are more
    readily tolerated in gene-poor AT-rich regions than in gene-rich GC-rich regions, resulting in
    older elements being enriched in GC-rich regions. Such an enrichment is seen for
    transposable elements such as DNA transposons (Fig. 24). However, this effect seems too
    slow and too small to account for the observed remodelling of the Alu distribution. This can
    be seen by performing a similar analysis for LINE elements (Fig. 25). There is no
    significant change in the LINE distribution over the past 100 Myr, in contrast to the rapid
    change seen for Alu. There is an eventual shift after more than 100 Myr, although its
    magnitude is still smaller than seen for Alus.

    These observations indicate that there may be some force acting particularly on Alus. This
    could be a higher rate of random loss of Alus in AT-rich DNA, negative selection against
    Alus in AT-rich DNA or positive selection in favour of Alus in GC-rich DNA. The first two
    possibilities seem unlikely because AT-rich DNA is gene-poor and tolerates the
    accumulation of other transposable elements. The third seems more feasible, in that it
    involves selecting in favour of the minority of Alus in GC-rich regions rather than against
    the majority that lie in AT-rich regions. But positive selection for Alus in GC-rich regions
    would imply that they benefit the organism.

    Schmid [180] has proposed such a function for SINEs. This hypothesis is based on the
    observation that in many species SINEs are transcribed under conditions of stress, and the
    resulting RNAs specifically bind a particular protein kinase (PKR) and block its ability to
    inhibit protein translation [181-183]. SINE RNAs would thus promote protein translation under
    stress. SINE RNA may be well suited to such a role in regulating protein translation,
    because it can be quickly transcribed in large quantities from thousands of elements and it
    can function without protein translation. Under this theory, there could be positive selection
    for SINEs in readily transcribed open chromatin such as is found near genes. This could
    explain the retention of Alus in gene-rich GC-rich regions. It is also consistent with the
    observation that SINE density in AT-rich DNA is higher near genes [142].

    Further insight about Alus comes from the relationship between Alu density and GC content
    on individual chromosomes (Fig. 26). There are two outliers. Chromosome 19 is even richer
    in Alus than predicted by its (high) GC content; the chromosome comprises 2% of the
    genome, but contains 5% of Alus. On the other hand, chromosome Y shows the lowest
    density of Alus relative to its GC content, being higher than average for GC content less
    than 40% and lower than average for GC content over 40%. Even in AT-rich DNA, Alus
    are under-represented on chromosome Y compared with other young interspersed repeats
    (see below). These phenomena may be related to an unusually high gene density on
    chromosome 19 and an unusually low density of somatically active genes on chromosome Y
    (both relative to GC content). This would be consistent with the idea that Alu correlates not
    with GC content but with actively transcribed genes.

    Our results may support the controversial idea that SINEs actually earn their keep in the
    genome. Clearly, much additional work will be needed to prove or disprove the hypothesis
    that SINEs are genomic symbionts.

    Biases in human mutation.

    Indirect studies have suggested that nucleotide substitution is not uniform across mammalian genomes [184-187]. By studying sets of repeat elements
    belonging to a common cohort, one can directly measure nucleotide substitution rates in
    different regions of the genome. We find strong evidence that the pattern of neutral
    substitution differs as a function of local GC content (Fig. 27). Because the results are
    observed in repetitive elements throughout the genome, the variation in the pattern of
    nucleotide substitution seems likely to be due to differences in the underlying mutational
    process rather than to selection.

    The effect can be seen most clearly by focusing on the substitution process , where
       denotes GC or CG base pairs and  denotes AT or TA base pairs. If K is the equilibrium
    constant in the direction of  base pairs (defined by the ratio of the forward and reverse
    rates), then the equilibrium GC content should be 1/(1 + K). Two observations emerge.

    First, there is a regional bias in substitution patterns. The equilibrium constant varies as a
    function of local GC content:  base pairs are more likely to mutate towards  base pairs in
    AT-rich regions than in GC-rich regions. For the analysis in Fig. 27, the equilibrium constant
    K is 2.5, 1.9 and 1.2 when the draft genome sequence is partitioned into three bins with
    average GC content of 37, 43 and 50%, respectively. This bias could be due to a reported
    tendency for GC-rich regions to replicate earlier in the cell cycle than AT-rich regions and
    for guanine pools, which are limiting for DNA replication, to become depleted late in the cell
    cycle, thereby resulting in a small but significant shift in substitution towards  base pairs [186,
    188]. Another theory proposes that many substitutions are due to differences in DNA repair
    mechanisms, possibly related to transcriptional activity and thereby to gene density and GC
    content [185, 189, 190].

    There is also an absolute bias in substitution patterns resulting in directional pressure
    towards lower GC content throughout the human genome. The genome is not at equilibrium
    with respect to the pattern of nucleotide substitution: the expected equilibrium GC content
    corresponding to the values of K above is 29, 35 and 44% for regions with average GC
    contents of 37, 43 and 50%, respectively. Recent observations on SNPs [190] confirm that the
    mutation pattern in GC-rich DNA is biased towards  base pairs; it should be possible to
    perform similar analyses throughout the genome with the availability of 1.4 million SNPs [97,
    191]. On the basis solely of nucleotide substitution patterns, the GC content would be
    expected to be about 7% lower throughout the genome.

    What accounts for the higher GC content? One possible explanation is that in GC-rich
    regions, a considerable fraction of the nucleotides is likely to be under functional constraint
    owing to the high gene density. Selection on coding regions and regulatory CpG islands may
    maintain the higher-than-predicted GC content. Another is that throughout the rest of the
    genome, a constant influx of transposable elements tends to increase GC content (Fig. 28).
    Young repeat elements clearly have a higher GC content than their surrounding regions,
    except in extremely GC-rich regions. Moreover, repeat elements clearly shift with age
    towards a lower GC content, closer to that of the neighbourhood in which they reside. Much
    of the 'non-repeat' DNA in AT-rich regions probably consists of ancient repeats that are not
    detectable by current methods and that have had more time to approach the local
    equilibrium value.

    The repeats can also be used to study how the mutation process is affected by the
    immediately adjacent nucleotide. Such 'context effects' will be discussed elsewhere (A. Kas
    and A. F. A. Smit, unpublished results).
     

    Fast living on chromosome Y.

    The pattern of interspersed repeats can be used to shed
    light on the unusual evolutionary history of chromosome Y. Our analysis shows that the
    genetic material on chromosome Y is unusually young, probably owing to a high tolerance
    for gain of new material by insertion and loss of old material by deletion. Several lines of
    evidence support this picture. For example, LINE elements on chromosome Y are on
    average much younger than those on autosomes (not shown). Similarly, MaLR-family
    retroposons on chromosome Y are younger than those on autosomes, with the
    representation of subfamilies showing a strong inverse correlation with the age of the
    subfamily. Moreover, chromosome Y has a relative over-representation of the younger
    retroviral class II (ERVK) and a relative under-representation of the primarily older class
    III (ERVL) compared with other chromosomes. Overall, chromosome Y seems to maintain
    a youthful appearance by rapid turnover.

    Interspersed repeats on chromosome Y can also be used to estimate the relative mutation
    rates, m and f, in the male and female germlines. Chromosome Y always resides in males,
    whereas chromosome X resides in females twice as often as in males. The substitution
    rates, Y and X, on these two chromosomes should thus be in the ratio Y:X =
    (m):(m + 2f)/3, provided that one considers equivalent neutral sequences. Several authors
    have estimated the mutation rate in the male germline to be fivefold higher than in the
    female germline, by comparing the rates of evolution of X- and Y-linked genes in humans
    and primates. However, Page and colleagues [192] have challenged these estimates as too
    high. They studied a 39-kb region that is apparently devoid of genes and resides within a
    large segmental duplication from X to Y that occurred 3–4 Myr ago in the human lineage.
    On the basis of phylogenetic analysis of the sequence on human Y and human, chimp and
    gorilla X, they obtained a much lower estimate of Y:X = 1.36, corresponding to m:f =
    1.7. They suggested that the other estimates may have been higher because they were
    based on much longer evolutionary periods or because the genes studied may have been
    under selection.

    Our database of human repeats provides a powerful resource for addressing this question.
    We identified the repeat elements from recent subfamilies (effectively, birth cohorts dating
    from the past 50 Myr) and measured the substitution rates for subfamily members on
    chromosomes X and Y (Fig. 29). There is a clear linear relationship with a slope of Y:X
    = 1.57 corresponding to m:f = 2.1. The estimate is in reasonable agreement with that of
    Page et al., although it is based on much more total sequence (360 kb on Y, 1.6 Mb on X)
    and a much longer time period. In particular, the discrepancy with earlier reports is not
    explained by recent changes in the human lineage. Various theories have been proposed for
    the higher mutation rate in the male germline, including the greater number of cell divisions
    in the formation of sperm than eggs and different repair mechanisms in sperm and eggs.

    Active transposons.

    We were interested in identifying the youngest retrotransposons in the draft genome sequence. This set should contain the currently active retrotransposons, as well as the insertion sites that are still polymorphic in the human population.

    The youngest branch in the phylogenetic tree of human LINE1 elements is called L1Hs
    (ref. 158); it differs in its 3' untranslated region (UTR) by 12 diagnostic substitutions from
    the next oldest subfamily (L1PA2). Within the L1Hs family, there are two subsets referred
    to as Ta and pre-Ta, defined by a diagnostic trinucleotide [193, 194]. All active L1 elements are
    thought to belong to these two subsets, because they account for all 14 known cases of
    human disease arising from new L1 transposition (with 13 belonging to the Ta subset and
    one to the pre-Ta subset) [195, 196]. These subsets are also of great interest for population
    genetics because at least 50% are still segregating as polymorphisms in the human
    population [194, 197]; they provide powerful markers for tracing population history because
    they represent unique (non-recurrent and non-revertible) genetic events that can be used
    (along with similarly polymorphic Alus) for reconstructing human migrations.

    LINE1 elements that are retrotransposition-competent should consist of a full-length
    sequence and should have both ORFs intact. Eleven such elements from the Ta subset have
    been identified, including the likely progenitors of mutagenic insertions into the factor VIII
    and dystrophin genes [198-202]. A cultured cell retrotransposition assay has revealed that eight
    of these elements remain retrotransposition-competent [200, 202, 203].

    We searched the draft genome sequence and identified 535 LINEs belonging to the Ta
    subset and 415 belonging to the pre-Ta subset. These elements provide a large collection of
    tools for probing human population history. We also identified those consisting of full-length
    elements with intact ORFs, which are candidate active LINEs. We found 39 such elements
    belonging to the Ta subset and 22 belonging to the pre-Ta subset; this substantially increases
    the number in the first category and provides the first known examples in the second
    category. These elements can now be tested for retrotransposition competence in the cell
    culture assay. Preliminary analysis resulted in the identification of two of these elements as
    the likely progenitors of mutagenic insertions into the -globin and RP2 genes (R. Badge
    and J. V. Moran, unpublished data). Similar analyses should allow the identification of the
    progenitors of most, if not all, other known mutagenic L1 insertions.

    L1 elements can carry extra DNA if transcription extends through the native transcriptional
    termination site into flanking genomic DNA. This process, termed L1-mediated
    transduction, provides a means for the mobilization of DNA sequences around the genome
    and may be a mechanism for 'exon shuffling' [204]. Twenty-one per cent of the 71 full-length
    L1s analysed contained non-L1-derived sequences before the 3' target-site duplication site,
    in cases in which the site was unambiguously recognizable. The length of the transduced
    sequence was 30–970 bp, supporting the suggestion that 0.5–1.0% of the human genome
    may have arisen by LINE-based transduction of 3' flanking sequences [205, 206].

    Our analysis also turned up two instances of 5' transduction (145 bp and 215 bp). Although
    this possibility had been suggested on the basis of cell culture models [195, 203], these are the
    first documented examples. Such events may arise from transcription initiating in a cellular
    promoter upstream of the L1 elements. L1 transcription is generally confined to the
    germline [207, 208], but transcription from other promoters could explain a somatic L1
    retrotransposition event that resulted in colon cancer [206].

    Transposons as a creative force.

    The primary force for the origin and expansion of most transposons has been selection for their ability to create progeny, and not a selective advantage for the host. However, these selfish pieces of DNA have been responsible for important innovations in many genomes, for example by contributing regulatory elements
    and even new genes.

    Twenty human genes have been recognized as probably derived from transposons [142, 209].
    These include the RAG1 and RAG2 recombinases and the major centromere-binding
    protein CENPB. We scanned the draft genome sequence and identified another 27 cases,
    bringing the total to 47 (Table 13; refs 142, 209). All but four are derived from DNA
    transposons, which give rise to only a small proportion of the interspersed repeats in the
    genome. Why there are so many DNA transposase-like genes, many of which still contain
    the critical residues for transposase activity, is a mystery.

    To illustrate this concept, we describe the discovery of one of the new examples. We
    searched the draft genome sequence to identify the autonomous DNA transposon
    responsible for the distribution of the non-autonomous MER85 element, one of the most
    recently (40–50 Myr ago) active DNA transposons. Most non-autonomous elements are
    internal deletion products of a DNA transposon. We identified one instance of a large
    (1,782 bp) ORF flanked by the 5' and 3' halves of a MER85 element. The ORF encodes a
    novel protein (partially published as pID 6453533) whose closest homologue is the
    transposase of the piggyBac DNA transposon, which is found in insects and has the same
    characteristic TTAA target-site duplications [210] as MER85. The ORF is actively transcribed
    in fetal brain and in cancer cells. That it has not been lost to mutation in 40–50 Myr of
    evolution (whereas the flanking, noncoding, MER85-like termini show the typical divergence
    level of such elements) and is actively transcribed provides strong evidence that it has been
    adopted by the human genome as a gene. Its function is unknown.

    LINE1 activity clearly has also had fringe benefits. We mentioned above the possibility of
    exon reshuffling by cotranscription of neighbouring DNA. The LINE1 machinery can also
    cause reverse transcription of genic mRNAs, which typically results in nonfunctional
    processed pseudogenes but can, occasionally, give rise to functional processed genes. There
    are at least eight human and eight mouse genes for which evidence strongly supports such
    an origin [211] (see http://www-ifi.uni-muenster.de/exapted-retrogenes/tables.html). Many
    other intronless genes may have been created in the same way.

    Transposons have made other creative contributions to the genome. A few hundred genes,
    for example, use transcriptional terminators donated by LTR retroposons (data not shown).
    Other genes employ regulatory elements derived from repeat elements [211].

    Simple sequence repeats

    Simple sequence repeats (SSRs) are a rather different type of repetitive structure that is
    common in the human genome—perfect or slightly imperfect tandem repeats of a particular
    k-mer. SSRs with a short repeat unit (n = 1–13 bases) are often termed microsatellites,
    whereas those with longer repeat units (n = 14–500 bases) are often termed minisatellites.
    With the exception of poly(A) tails from reverse transcribed messages, SSRs are thought to
    arise by slippage during DNA replication212, 213.

    We compiled a catalogue of all SSRs over a given length in the human draft genome
    sequence, and studied their properties (Table 14). SSRs comprise about 3% of the human
    genome, with the greatest single contribution coming from dinucleotide repeats (0.5%). (The
    precise criteria for the number of repeat units and the extent of divergence allowed in an
    SSR affect the exact census, but not the qualitative conclusions.)

    There is approximately one SSR per 2 kb (the number of nonoverlapping tandem repeats is
    437 per Mb). The catalogue confirms various properties of SSRs that have been inferred
    from sampling approaches (Table 15). The most frequent dinucleotide repeats are AC and
    AT (50 and 35% of dinucleotide repeats, respectively), whereas AG repeats (15%) are less
    frequent and GC repeats (0.1%) are greatly under-represented. The most frequent
    trinucleotides are AAT and AAC (33% and 21%, respectively), whereas ACC (4.0%),
    AGC (2.2%), ACT (1.4%) and ACG (0.1%) are relatively rare. Overall, trinucleotide SSRs
    are much less frequent than dinucleotide SSRs214.

    SSRs have been extremely important in human genetic studies, because they show a high
    degree of length polymorphism in the human population owing to frequent slippage by DNA
    polymerase during replication. Genetic markers based on SSRs—particularly (CA)n
    repeats—have been the workhorse of most human disease-mapping studies101, 102. The
    availability of a comprehensive catalogue of SSRs is thus a boon for human genetic studies.

    The SSR catalogue also allowed us to resolve a mystery regarding mammalian genetic
    maps. Such genetic maps in rat, mouse and human have a deficit of polymorphic (CA)n
    repeats on chromosome X30, 101. There are two possible explanations for this deficit. There
    may simply be fewer (CA)n repeats on chromosome X; or (CA)n repeats may be as dense
    on chromosome X but less polymorphic in the population. In fact, analysis of the draft
    genome sequence shows that chromosome X has the same density of (CA)n repeats per
    Mb as the autosomes (data not shown). Thus, the deficit of polymorphic markers relative
    to autosomes results from population genetic forces. Possible explanations include that
    chromosome X has a smaller effective population size, experiences more frequent selective
    sweeps reducing diversity (owing to its hemizygosity in males), or has a lower mutation rate
    (owing to its more frequent passage through the less mutagenic female germline). The
    availability of the draft genome sequence should provide ways to test these alternative
    explanations.

    Segmental duplications

    A remarkable feature of the human genome is the segmental duplication of portions of
    genomic sequence [215-217]. Such duplications involve the transfer of 1–200-kb blocks of
    genomic sequence to one or more locations in the genome. The locations of both donor and
    recipient regions of the genome are often not tandemly arranged, suggesting mechanisms
    other than unequal crossing-over for their origin. They are relatively recent, inasmuch as
    strong sequence identity is seen in both exons and introns (in contrast to regions that are
    considered to show evidence of ancient duplications, characterized by similarities only in
    coding regions). Indeed, many such duplications appear to have arisen in very recent
    evolutionary time, as judged by high sequence identity and by their absence in closely
    related species.

    Segmental duplications can be divided into two categories. First, interchromosomal
    duplications are defined as segments that are duplicated among nonhomologous
    chromosomes. For example, a 9.5-kb genomic segment of the adrenoleukodystrophy locus
    from Xq28 has been duplicated to regions near the centromeres of chromosomes 2, 10, 16
    and 22 (refs 218, 219). Anecdotal observations suggest that many interchromosomal
    duplications map near the centromeric and telomeric regions of human
    chromosomes218-233.

    The second category is intrachromosomal duplications, which occur within a particular
    chromosome or chromosomal arm. This category includes several duplicated segments, also
    known as low copy repeat sequences, that mediate recurrent chromosomal structural
    rearrangements associated with genetic disease215, 217. Examples on chromosome 17
    include three copies of a roughly 200-kb repeat separated by around 5 Mb and two
    copies of a roughly 24-kb repeat separated by 1.5 Mb. The copies are so similar (99%
    identity) that paralogous recombination events can occur, giving rise to contiguous gene
    syndromes: Smith–Magenis syndrome and Charcot–Marie–Tooth syndrome 1A,
    respectively34, 234. Several other examples are known and are also suspected to be
    responsible for recurrent microdeletion syndromes (for example, Prader–Willi/Angelman,
    velocardiofacial/DiGeorge and Williams' syndromes [215, 235-240]).

    Until now, the identification and characterization of segmental duplications have been based
    on anecdotal reports—for example, finding that certain probes hybridize to multiple
    chromosomal sites or noticing duplicated sequence at certain recurrent chromosomal
    breakpoints. The availability of the entire genomic sequence will make it possible to explore
    the nature of segmental duplications more systematically. This analysis can begin with the
    current state of the draft genome sequence, although caution is required because some
    apparent duplications may arise from a failure to merge sequence contigs from overlapping
    clones. Alternatively, erroneous assembly of closely related sequences from nonoverlapping
    clones may underestimate the true frequency of such features, particularly among those
    segments with the highest sequence similarity. Accordingly, we adopted a conservative
    approach for estimating such duplication from the available draft genome sequence.

    Pericentromeres and subtelomeres.

    We began by re-evaluating the finished sequences
    of chromosomes 21 and 22. The initial papers on these chromosomes93, 94 noted some
    instances of interchromosomal duplication near each centromere. With the ability now to
    compare these chromosomes to the vast majority of the genome, it is apparent that the
    regions near the centromeres consist almost entirely of interchromosomal duplicated
    segments, with little or no unique sequence. Smaller regions of interchromosomal duplication
    are also observed near the telomeres.

    Chromosome 22 contains a region of 1.5 Mb adjacent to the centromere in which 90% of
    sequence can now be recognized to consist of interchromosomal duplication (Fig. 30).
    Conversely, 52% of the interchromosomal duplications on chromosome 22 were located in
    this region, which comprises only 5% of the chromosome. Also, the subtelomeric end
    consists of a 50-kb region consisting almost entirely of interchromosomal duplications.

    Chromosome 21 presents a similar landscape (Fig. 31). The first 1 Mb after the centromere
    is composed of interchromosomal repeats, as well as the largest (> 200 kb) block of
    intrachromosomally duplicated material. Again, most interchromosomal duplications on the
    chromosome map to this region and the most subtelomeric region (30 kb) shows extensive
    duplication among nonhomologous chromosomes.

    The pericentromeric regions are structurally very complex, as illustrated for chromosome 21
    in Fig. 32a. The pericentromeric regions appear to have been bombarded by successive
    insertions of duplications; the insertion events must be fairly recent because the degree of
    sequence conservation with the genomic source loci is fairly high (90–100%, with an
    apparent peak around 96%). Distinct insertions are typically separated by AT-rich or
    GC-rich minisatellite-like repeats that have been hypothesized to have a functional role in
    targeting duplications to these regions [233, 241].

    A single genomic source locus often gives rise to pericentromeric copies on multiple
    chromosomes, with each having essentially the same breakpoints and the same degree of
    divergence. An example of such a source locus on Xq28 is shown in Fig. 32b. Phylogenetic
    analysis has suggested a two-step mechanism for the origin and dispersal of these
    segments, whereby an initial segmental duplication in the pericentromeric region of one
    chromosome occurs and is then redistributed as part of a larger cassette to other such
    regions [242].

    A comprehensive analysis for all chromosomes will have to await complete sequencing of
    the genome, but the evidence from the draft genome sequence indicates that the same
    picture is likely to be seen throughout the genome. Several papers have analysed finished
    segments within pericentromeric regions of chromosomes 2 (160 kb), 10 (400 kb) and 16
    (300 kb), all of which show extensive interchromosomal segmental duplication215, 219, 232,
    233. An example from another pericentromeric region on chromosome 11 is shown in Fig.
    32c. Interchromosomal duplications in subtelomeric regions also appear to be a fairly
    general phenomenon, as illustrated by a large tract (500 kb) of complex duplication on
    chromosome 7 (Fig. 32d).

    The explanation for the clustering of segmental duplications may be that the genome has a
    damage-control mechanism whereby chromosomal breakage products are preferentially
    inserted into pericentromeric and, to a lesser extent, subtelomeric regions. The possibility of
    a specific mechanism for the insertion of these sequences has been suggested on the basis
    of the unusual sequences found flanking the insertions. Although it is also possible that these
    regions simply have greater tolerance for large insertions, many large gene-poor 'deserts'
    have been identified93 and there is no accumulation of duplicated segments within these
    regions. Along with the fact that transitions between duplicons (from different regions of the
    genome) occur at specific sequences, this suggests that active recruitment of duplications to
    such regions may occur. In any case, the duplicated regions are in general young (with
    many duplications showing <6% nucleotide divergence from their source loci) and in
    constant flux, both through additional duplications and by large-scale exchange among
    similar chromosomal environments. There is evidence of structural polymorphism in the
    human population, such as the presence or absence of olfactory receptor segments located
    within the telomeric regions of several human chromosomes226, 227.

    Genome-wide analysis of segmental duplications.

    We also performed a global
    genome-wide analysis to characterize the amount of segmental duplication in the genome.
    We 'repeat-masked' the known interspersed repeats in the draft genome sequence and
    compared the remaining draft genomic sequence with itself in a massive all-by-all BLASTN
    similarity search. We excluded matches in which the sequence identity was so high that it
    might reflect artefactual duplications resulting from a failure to overlap sequence contigs
    correctly in assembling the draft genome sequence. Specifically, we considered only
    matches with less than 99.5% identity for finished sequence and less than 98% identity for
    unfinished sequence.

    We took several approaches to avoid counting artefactual duplications in the sequence. In
    the first approach, we studied only finished sequence. We compared the finished sequence
    with itself, to identify segments of at least 1 kb and 90–99.5% sequence identity. This
    analysis will underestimate the extent of segmental duplication, because it requires that at
    least two copies of the segment are present in the finished sequence and because some true
    duplications have over 99.5% identity.

    The finished sequence consists of at least 3.3% segmental duplication (Table 16).
    Interchromosomal duplication accounts for about 1.5% and intrachromosomal duplication
    for about 2%, with some overlap (0.2%) between these categories. We analysed the
    lengths and divergence of the segmental duplications (Fig. 33). The duplications tend to be
    large (10–50 kb) and highly homologous, especially for the interchromosomal segments. The
    sequence divergence for the interchromosomal duplications appears to peak between 96.5%
    and 97.5%. This may indicate that interchromosomal duplications occurred in a punctuated
    manner. It will be intriguing to investigate whether such genomic upheaval has a role in
    speciation events.

    In a second approach, we compared the entire human draft genome sequence (finished and
    unfinished) with itself to identify duplications with 90–98% sequence identity (Table 17).
    The draft genome sequence contains at least 3.6% segmental duplication. The actual
    proportion will be significantly higher, because we excluded many true matches with more
    than 98% sequence identity (at least 1.1% of the finished sequence). Although exact
    measurement must await a finished sequence, the human genome seems likely to contain
    about 5% segmental duplication, with most of this sequence in large blocks (> 10 kb). Such
    a high proportion of large duplications clearly distinguishes the human genome from other
    sequenced genomes, such as the fly and worm (Table 18).

    The structure of large highly paralogous regions presents one of the 'serious and
    unanticipated challenges' to producing a finished sequence of the genome [46]. The absence of
    unique STS or fingerprint signatures over large genomic distances (1 Mb) and the high
    degree of sequence similarity makes the distinction between paralogous sequence variation
    and allelic polymorphism problematic. Furthermore, the fact that such regions frequently
    harbour intron–exon structures of genuine unique sequence will complicate efforts to
    generate a genome-wide SNP map. The data indicate that a modest portion of the human
    genome may be relatively recalcitrant to genomic-based methods for SNP detection. Owing
    to their repetitive nature and their location in the genome, segmental duplications may well
    be underestimated by the current analysis. An understanding of the biology, pathology and
    evolution of these duplications will require specialized efforts within these exceptional
    regions of the human genome. The presence and distribution of such segments may provide
    evolutionary fodder for processes of exon shuffling and a general increase in protein
    diversity associated with domain accretion. It will be important to consider both
    genome-wide duplication events and more restricted punctuated events of genome
    duplication as forces in the evolution of vertebrate genomes.

    ...

    Gene content of the human genome

    Genes (or at least their coding regions) comprise only a tiny fraction of human DNA, but
    they represent the major biological function of the genome and the main focus of interest by
    biologists. They are also the most challenging feature to identify in the human genome
    sequence.

    The ultimate goal is to compile a complete list of all human genes and their encoded
    proteins, to serve as a 'periodic table' for biomedical research [243]. But this is a difficult task.
    In organisms with small genomes, it is straightforward to identify most genes by the
    presence of long ORFs. In contrast, human genes tend to have small exons (encoding an
    average of only 50 codons) separated by long introns (some exceeding 10 kb). This creates
    a signal-to-noise problem, with the result that computer programs for direct gene prediction
    have only limited accuracy. Instead, computational prediction of human genes must rely
    largely on the availability of cDNA sequences or on sequence conservation with genes and
    proteins from other organisms. This approach is adequate for strongly conserved genes
    (such as histones or ubiquitin), but may be less sensitive to rapidly evolving genes (including
    many crucial to speciation, sex determination and fertilization).

    Here we describe our efforts to recognize both the RNA genes and protein-coding genes in
    the human genome. We also study the properties of the predicted human protein set,
    attempting to discern how the human proteome differs from those of invertebrates such as
    worm and fly.

    Noncoding RNAs

    Although biologists often speak of a tight coupling between 'genes and their encoded protein
    products', it is important to remember that thousands of human genes produce noncoding
    RNAs (ncRNAs) as their ultimate product [244]. There are several major classes of ncRNA.
    (1) Transfer RNAs (tRNAs) are the adapters that translate the triplet nucleic acid code of
    RNA into the amino-acid sequence of proteins; (2) ribosomal RNAs (rRNAs) are also
    central to the translational machinery, and recent X-ray crystallography results strongly
    indicate that peptide bond formation is catalysed by rRNA, not protein [245, 246]; (3) small
    nucleolar RNAs (snoRNAs) are required for rRNA processing and base modification in the
    nucleolus [247, 248]; and (4) small nuclear RNAs (snRNAs) are critical components of
    spliceosomes, the large ribonucleoprotein (RNP) complexes that splice introns out of
    pre-mRNAs in the nucleus. Humans have both a major, U2 snRNA-dependent spliceosome
    that splices most introns, and a minor, U12 snRNA-dependent spliceosome that splices a
    rare class of introns that often have AT/AC dinucleotides at the splice sites instead of the
    canonical GT/AG splice site consensus [249].

    Other ncRNAs include both RNAs of known biochemical function (such as telomerase
    RNA and the 7SL signal recognition particle RNA) and ncRNAs of enigmatic function
    (such as the large Xist transcript implicated in X dosage compensation [250], or the small vault
    RNAs found in the bizarre vault ribonucleoprotein complex [251], which is three times the
    mass of the ribosome but has unknown function).

    ncRNAs do not have translated ORFs, are often small and are not polyadenylated.
    Accordingly, novel ncRNAs cannot readily be found by computational gene-finding
    techniques (which search for features such as ORFs) or experimental sequencing of cDNA
    or EST libraries (most of which are prepared by reverse transcription using a primer
    complementary to a poly(A) tail). Even if the complete finished sequence of the human
    genome were available, discovering novel ncRNAs would still be challenging. We can,
    however, identify genomic sequences that are homologous to known ncRNA genes, using
    BLASTN or, in some cases, more specialized methods.

    It is sometimes difficult to tell whether such homologous genes are orthologues, paralogues
    or closely related pseudogenes (because inactivating mutations are much less obvious than
    for protein-coding genes). For tRNA, there is sufficiently detailed information about the
    cloverleaf secondary structure to allow true genes and pseudogenes to be distinguished with
    high sensitivity. For many other ncRNAs, there is much less structural information and so
    we employ an operational criterion of high sequence similarity (> 95% sequence identity and
    > 95% full length) to distinguish true genes from pseudogenes. These assignments will
    eventually need to be reconciled with experimental data.

    Transfer RNA genes.

    The classical experimental estimate of the number of human tRNA
    genes is 1,310 (ref. 252). In the draft genome sequence, we find only 497 human tRNA
    genes (Tables 19, 20). How do we account for this discrepancy? We believe that the
    original estimate is likely to have been inflated in two respects. First, it came from a
    hybridization experiment that probably counted closely related pseudogenes; by analysis of
    the draft genome sequence, there are in fact 324 tRNA-derived putative pseudogenes
    (Table 20). Second, the earlier estimate assumed too high a value for the size of the human
    genome; repeating the calculation using the correct value yields an estimate of about 890
    tRNA-related loci, which is in reasonable accord with our count of 821 tRNA genes and
    pseudogenes in the draft genome sequence.

    The human tRNA gene set predicted from the draft genome sequence appears to include
    most of the known human tRNA species. The draft genome sequence contains 37 of 38
    human tRNA species listed in a tRNA database [253], allowing for up to one mismatch. This
    includes one copy of the known gene for a specialized selenocysteine tRNA, one of several
    components of a baroque translational mechanism that reads UGA as a selenocysteine
    codon in certain rare mRNAs that carry a specific cis-acting RNA regulatory site (a
    so-called SECIS element) in their 3' UTRs. The one tRNA gene in the database not found
    in the draft genome sequence is DE9990, a tRNAGlu species, which differs in two positions
    from the most related tRNA gene in the human genome. Possible explanations are that the
    database version of this tRNA contains two errors, the gene is polymorphic or this is a
    genuine functional tRNA that is missing from the draft genome sequence. (The database
    also lists one additional tRNA gene (DS9994), but this is apparently a contaminant, most
    similar to bacterial tRNAs; the parent entry (Z13399) was withdrawn from the DNA
    database, but the tRNA entry has not yet been removed from the tRNA database.)
    Although the human set appears substantially complete by this test, the tRNA gene numbers
    in Table 19 should be considered tentative and used with caution. The human and fly (but
    not the worm) are known to be missing significant amounts of heterochromatic DNA, and
    additional tRNA genes could be located there.

    With this caveat, the results indicate that the human has fewer tRNA genes than the worm,
    but more than the fly. This may seem surprising, but tRNA gene number in metazoans is
    thought to be related not to organismal complexity, but more to idiosyncrasies of the demand
    for tRNA abundance in certain tissues or stages of embryonic development. For example,
    the frog Xenopus laevis, which must load each oocyte with a remarkable 40 ng of tRNA,
    has thousands of tRNA genes [254].

    The degeneracy of the genetic code has allowed an inspired economy of tRNA anticodon
    usage. Although 61 sense codons need to be decoded, not all 61 different anticodons are
    present in tRNAs. Rather, tRNAs generally follow stereotyped and conserved wobble
    rules [255-257]. Wobble reduces the number of required anticodons substantially, and provides
    a connection between the genetic code and the hybridization stability of modified and
    unmodified RNA bases. In eukaryotes, the rules proposed by Guthrie and Abelson [256]
    predict that about 46 tRNA species will be sufficient to read the 61 sense codons (counting
    the initiator and elongator methionine tRNAs as two species). According to these rules, in
    the codon's third (wobble) position, U and C are generally decoded by a single tRNA
    species, whereas A and G are decoded by two separate tRNA species.

    In 'two-codon boxes' of the genetic code (where codons ending with U/C encode a
    different amino acid from those ending with A/G), the U/C wobble position should be
    decoded by a G at position 34 in the tRNA anticodon. Thus, in the top left of Fig. 34, there
    is no tRNA with an AAA anticodon for Phe, but the GAA anticodon can recognize both
    UUU and UUC codons in the mRNA. In 'four-codon boxes' of the genetic code (where U,
    C, A and G in the wobble position all encode the same amino acid), the U/C wobble position
    is almost always decoded by I34 (inosine) in the tRNA, where the inosine is produced by
    post-transcriptional modification of an adenine (A). In the bottom left of Fig. 34, for
    example, the GUU and GUC codons of the four-codon Val box are decoded by a tRNA
    with an anticodon of AAC, which is no doubt modified to IAC. Presumably this pattern,
    which is strikingly conserved in eukaryotes, has to do with the fact that IA base pairs are
    also possible; thus the IAC anticodon for a Val tRNA could recognize GUU, GUC and
    even GUA codons. Were this same I34 to be utilized in two-codon boxes, however,
    misreading of the NNA codon would occur, resulting in translational havoc. Eukaryotic
    glycine tRNAs represent a conserved exception to this last rule; they use a GCC anticodon
    to decode GGU and GGC, rather than the expected ICC anticodon.

    Satisfyingly, the human tRNA set follows these wobble rules almost perfectly (Fig. 34).
    Only three unexpected tRNA species are found: single genes for a tRNATyr-AUA,
    tRNAIle-GAU, and tRNAAsn-AUU. Perhaps these are pseudogenes, but they appear to
    be plausible tRNAs. We also checked the possibility of sequencing errors in their
    anticodons, but each of these three genes is in a region of high sequence accuracy, with
    PHRAP quality scores higher than 70 for every base in their anticodons.

    As in all other organisms, human protein-coding genes show codon bias—preferential use of
    one synonymous codon over another [258] (Fig. 34). In less complex organisms, such as yeast
    or bacteria, highly expressed genes show the strongest codon bias. Cytoplasmic abundance
    of tRNA species is correlated with both codon bias and overall amino-acid frequency (for
    example, tRNAs for preferred codons and for more common amino acids are more
    abundant). This is presumably driven by selective pressure for efficient or accurate
    translation [259]. In many organisms, tRNA abundance in turn appears to be roughly
    correlated with tRNA gene copy number, so tRNA gene copy number has been used as a
    proxy for tRNA abundance [260]. In vertebrates, however, codon bias is not so obviously
    correlated with gene expression level. Differing codon biases between human genes is more
    a function of their location in regions of different GC composition [261]. In agreement with the
    literature, we see only a very rough correlation of human tRNA gene number with either
    amino-acid frequency or codon bias (Fig. 34). The most obvious outliers in these weak
    correlations are the strongly preferred CUG leucine codon, with a mere six tRNALeu-CAG
    genes producing a tRNA to decode it, and the relatively rare cysteine UGU and UGC
    codons, with 30 tRNA genes to decode them.

    The tRNA genes are dispersed throughout the human genome. However, this dispersal is
    nonrandom. tRNA genes have sometimes been seen in clusters at small scales [262, 263] but
    we can now see striking clustering on a genome-wide scale. More than 25% of the tRNA
    genes (140) are found in a region of only about 4 Mb on chromosome 6. This small region,
    only about 0.1% of the genome, contains an almost sufficient set of tRNA genes all by
    itself. The 140 tRNA genes contain a representative for 36 of the 49 anticodons found in
    the complete set; and of the 21 isoacceptor types, only tRNAs to decode Asn, Cys, Glu and
    selenocysteine are missing. Many of these tRNA genes, meanwhile, are clustered
    elsewhere; 18 of the 30 Cys tRNAs are found in a 0.5-Mb stretch of chromosome 7 and
    many of the Asn and Glu tRNA genes are loosely clustered on chromosome 1. More than
    half of the tRNA genes (280 out of 497) reside on either chromosome 1 or chromosome 6.
    Chromosomes 3, 4, 8, 9, 10, 12, 18, 20, 21 and X appear to have fewer than 10 tRNA genes
    each; and chromosomes 22 and Y have none at all (each has a single pseudogene).

    Ribosomal RNA genes.

    The ribosome, the protein synthetic machine of the cell, is made
    up of two subunits and contains four rRNA species and many proteins. The large ribosomal
    subunit contains 28S and 5.8S rRNAs (collectively called 'large subunit' (LSU) rRNA) and
    also a 5S rRNA. The small ribosomal subunit contains 18S rRNA ('small subunit' (SSU)
    rRNA). The genes for LSU and SSU rRNA occur in the human genome as a 44-kb tandem
    repeat unit [264]. There are thought to be about 150–200 copies of this repeat unit arrayed on
    the short arms of acrocentric chromosomes 13, 14, 15, 21 and 22 (refs 254, 264). There are
    no true complete copies of the rDNA tandem repeats in the draft genome sequence, owing
    to the deliberate bias in the initial phase of the sequencing effort against sequencing BAC
    clones whose restriction fragment fingerprints showed them to contain primarily tandemly
    repeated sequence. Sequence similarity analysis with the BLASTN computer program
    does, however, detect hundreds of rDNA-derived sequence fragments dispersed throughout
    the complete genome, including one 'full-length' copy of an individual 5.8S rRNA gene not
    associated with a true tandem repeat unit (Table 20).

    The 5S rDNA genes also occur in tandem arrays, the largest of which is on chromosome 1
    between 1q41.11 and 1q42.13, close to the telomere [265, 266]. There are 200–300 true 5S
    genes in these arrays265, 267. The number of 5S-related sequences in the genome, including
    numerous dispersed pseudogenes, is classically cited as 2,000 (refs 252, 254). The long
    tandem array on chromosome 1 is not yet present in the draft genome sequence because
    there are no EcoRI or HindIII sites present, and thus it was not cloned in the most heavily
    utilized BAC libraries (Table 1). We expect to recover it during the finishing stage. We do
    detect four individual copies of 5S rDNA by our search criteria ( 95% identity and  95%
    full length). We also find many more distantly related dispersed sequences (520 at
    P  0.001), which we interpret as probable pseudogenes (Table 20).

    Small nucleolar RNA genes.

    Eukaryotic rRNA is extensively processed and modified in
    the nucleolus. Much of this activity is directed by numerous snoRNAs. These come in two
    families: C/D box snoRNAs (mostly involved in guiding site-specific 2'-O-ribose
    methylations of other RNAs) and H/ACA snoRNAs (mostly involved in guiding
    site-specific pseudouridylations) [247, 248]. We compiled a set of 97 known human snoRNA
    gene sequences; 84 of these (87%) have at least one copy in the draft genome sequence
    (Table 20), almost all as single-copy genes.

    It is thought that all 2'-O-ribose methylations and pseudouridylations in eukaryotic rRNA are
    guided by snoRNAs. There are 105–107 methylations and around 95 pseudouridylations in
    human rRNA [268]. Only about half of these have been tentatively assigned to known guide
    snoRNAs. There are also snoRNA-directed modifications on other stable RNAs, such as
    U6 (ref. 269), and the extent of this is just beginning to be explored. Sequence similarity has
    so far proven insufficient to recognize all snoRNA genes. We therefore expect that there
    are many unrecognized snoRNA genes that are not detected by BLAST queries.

    Spliceosomal RNAs and other ncRNA genes.

    We also looked for copies of other known ncRNA genes. We found at least one copy of 21 (95%) of 22 known ncRNAs, including the spliceosomal snRNAs. There were multiple copies for several ncRNAs, as expected; for example, we find 44 dispersed genes for U6 snRNA, and 16 for U1 snRNA (Table 20).

    For some of these RNA genes, homogeneous multigene families that occur in tandem
    arrays are again under-represented owing to the restriction enzymes used in constructing
    the BAC libraries and, in some instances, the decision to delay the sequencing of BAC
    clones with low complexity fingerprints indicative of tandemly repeated DNA. The U2
    RNA genes are located at the RNU2 locus, a tandem array of 10–20 copies of nearly
    identical 6.1-kb units at 17q21–q22 (refs 270, 271, 272). Similarly, the U3 snoRNA genes
    (included in the aggregate count of C/D snoRNAs in Table 20) are clustered at the RNU3
    locus at 17p11.2, not in a tandem array, but in a complex inverted repeat structure of about
    5–10 copies per haploid genome [273]. The U1 RNA genes are clustered with about 30 copies
    at the RNU1 locus at 1p36.1, but this cluster is thought to be loose and irregularly organized;
    no two U1 genes have been cloned on the same cosmid [271]. In the draft genome sequence,
    we see six copies of U2 RNA that meet our criteria for true genes, three of which appear
    to be in the expected position on chromosome 17. For U3, so far we see one true copy at
    the correct place on chromosome 17p11.2. For U1, we see 16 true genes, 6 of which are
    loosely clustered within 0.6 Mb at 1p36.1 and another 6 are elsewhere on chromosome 1.
    Again, these and other clusters will be a matter for the finishing process.

    Our observations also confirm the striking proliferation of ncRNA-derived pseudogenes
    (Table 20). There are hundreds or thousands of sequences in the draft genome sequence
    related to some of the ncRNA genes. The most prolific pseudogene counts generally come
    from RNA genes transcribed by RNA polymerase III promoters, including U6, the hY
    RNAs and SRP-RNA. These ncRNA pseudogenes presumably arise through reverse
    transcription. The frequency of such events gives insight into how ncRNA genes can evolve
    into SINE retroposons, such as the tRNA-derived SINEs found in many vertebrates and the
    SRP-RNA-derived Alu elements found in humans.

    Protein-coding genes

    Identifying the protein-coding genes in the human genome is one of the most important
    applications of the sequence data, but also one of the most difficult challenges. We describe
    below our efforts to create an initial human gene and protein index.

    Exploring properties of known genes.

    Before attempting to identify new genes, we explored what could be learned by aligning the cDNA sequences of known genes to the draft genome sequence. Genomic alignments allow one to study exon–intron structure and local GC content, and are valuable for biomedical studies because they connect genes with the genetic and cytogenetic map, link them with regulatory sequences and facilitate the development of polymerase chain reaction (PCR) primers to amplify exons. Until now, genomic alignment was available for only about a quarter of known genes.

    The 'known' genes studied were those in the RefSeq database [110], a manually curated
    collection designed to contain nonredundant representatives of most full-length human
    mRNA sequences in GenBank (RefSeq intentionally contains some alternative splice forms
    of the same genes). The version of RefSeq used contained 10,272 mRNAs.

    The RefSeq genes were aligned with the draft genome sequence, using both the Spidey (S.
    Wheelan, personal communication) and Acembly (D. Thierry-Mieg and J. Thierry-Mieg,
    unpublished; http://www.acedb.org) computer programs. Because this sequence is
    incomplete and contains errors, not all genes could be fully aligned and some may have been
    incorrectly aligned. More than 92% of the RefSeq entries could be aligned at high
    stringency over at least part of their length, and 85% could be aligned over more than half
    of their length. Some genes (16%) had high stringency alignments to more than one location
    in the draft genome sequence owing, for example, to paralogues or pseudogenes. In such
    cases, we considered only the best match. In a few of these cases, the assignment may not
    be correct because the true matching region has not yet been sequenced. Three per cent of
    entries appeared to be alternative splice products of the same gene, on the basis of their
    alignment to the same location in the draft genome sequence. In all, we obtained at least
    partial genomic alignments for 9,212 distinct known genes and essentially complete
    alignment for 5,364 of them.

    Previous efforts to study human gene structure [116, 274, 275] have been hampered by limited
    sample sizes and strong biases in favour of compact genes. Table 21 gives the mean and
    median values of some basic characteristics of gene structures. Some of the values may be
    underestimates. In particular, the UTRs given in the RefSeq database are likely to be
    incomplete; they are considerably shorter, for example, than those derived from careful
    reconstructions on chromosome 22. Intron sizes were measured only for genes in finished
    genomic sequence, to mitigate the bias arising from the fact that long introns are more likely
    than short introns to be interrupted by gaps in the draft genome sequence. Nonetheless,
    there may be some residual bias against long genes and long introns.

    There is considerable variation in overall gene size and intron size, with both distributions
    having very long tails. Many genes are over 100 kb long, the largest known example being
    the dystrophin gene (DMD) at 2.4 Mb. The variation in the size distribution of coding
    sequences and exons is less extreme, although there are still some remarkable outliers. The
    titin gene276 has the longest currently known coding sequence at 80,780 bp; it also has the
    largest number of exons (178) and longest single exon (17,106 bp).

    It is instructive to compare the properties of human genes with those from worm and fly.
    For all three organisms, the typical length of a coding sequence is similar (1,311 bp for
    worm, 1,497 bp for fly and 1,340 bp for human), and most internal exons fall within a
    common peak between 50 and 200 bp (Fig. 35a). However, the worm and fly exon
    distributions have a fatter tail, resulting in a larger mean size for internal exons (218 bp for
    worm versus 145 bp for human). The conservation of preferred exon size across all three
    species supports suggestions of a conserved exon-based component of the splicing
    machinery [277]. Intriguingly, the few extremely short human exons show an unusual base
    composition. In 42 detected human exons of less than 19 bp, the nucleotide frequencies of
    A, G, T and C are 39, 33, 15 and 12%, respectively, showing a strong purine bias.
    Purine-rich sequences may enhance splicing [278, 279], and it is possible that such sequences
    are required or strongly selected for to ensure correct splicing of very short exons. Previous
    studies have shown that short exons require intronic, but not exonic, splicing enhancers [280].

    In contrast to the exons, the intron size distributions differ substantially among the three
    species (Fig. 35b, c). The worm and fly each have a reasonably tight distribution, with most
    introns near the preferred minimum intron length (47 bp for worm, 59 bp for fly) and an
    extended tail (overall average length of 267 bp for worm and 487 bp for fly). Intron size is
    much more variable in humans, with a peak at 87 bp but a very long tail resulting in a mean
    of more than 3,300 bp. The variation in intron size results in great variation in gene size.

    The variation in gene size and intron size can partly be explained by the fact that GC-rich
    regions tend to be gene-dense with many compact genes, whereas AT-rich regions tend to
    be gene-poor with many sprawling genes containing large introns. The correlation of gene
    density with GC content is shown in Fig. 36a, b; the relative density increases more than
    tenfold as GC content increases from 30% to 50%. The correlation appears to be due
    primarily to intron size, which drops markedly with increasing GC content (Fig. 36c). In
    contrast, coding properties such as exon length (Fig. 36c) or exon number (data not shown)
    vary little. Intergenic distance is also probably lower in high-GC areas, although this is hard
    to prove directly until all genes have been identified.

    The large number of confirmed human introns allows us to analyse variant splice sites,
    confirming and extending recent reports [281]. Intron positions were confirmed by applying a
    stringent criterion that EST or mRNA sequence show an exact match of 8 bp in the
    flanking exonic sequence on each side. Of 53,295 confirmed introns, 98.12% use the
    canonical dinucleotides GT at the 5' splice site and AG at the 3' site (GT–AG pattern).
    Another 0.76% use the related GC–AG. About 0.10% use AT–AC, which is a rare
    alternative pattern primarily recognized by the variant U12 splicing machinery [282]. The
    remaining 1% belong to 177 types, some of which undoubtedly reflect sequencing or
    alignment errors.

    Finally, we looked at alternative splicing of human genes. Alternative splicing can allow
    many proteins to be produced from a single gene and can be used for complex gene
    regulation. It appears to be prevalent in humans, with lower estimates of about 35% of
    human genes being subject to alternative splicing [283-285]. These studies may have
    underestimated the prevalence of alternative splicing, because they examined only EST
    alignments covering only a portion of a gene.

    To investigate the prevalence of alternative splicing, we analysed reconstructed mRNA
    transcripts covering the entire coding regions of genes on chromosome 22 (omitting small
    genes with coding regions of less than 240 bp). Potential transcripts identified by alignments
    of ESTs and cDNAs to genomic sequence were verified by human inspection. We found
    642 transcripts, covering 245 genes (average of 2.6 distinct transcripts per gene). Two or
    more alternatively spliced transcripts were found for 145 (59%) of these genes. A similar
    analysis for the gene-rich chromosome 19 gave 1,859 transcripts, corresponding to 544
    genes (average 3.2 distinct transcripts per gene). Because we are sampling only a subset of
    all transcripts, the true extent of alternative splicing is likely to be greater. These figures are
    considerably higher than those for worm, in which analysis reveals alternative splicing for
    22% of genes for which ESTs have been found, with an average of 1.34 (12,816/9,516)
    splice variants per gene. (The apparently higher extent of alternative splicing seen in human
    than in worm was not an artefact resulting from much deeper coverage of human genes by
    ESTs and mRNAs. Although there are many times more ESTs available for human than
    worm, these ESTs tend to have shorter average length (because many were the product of
    early sequencing efforts) and many match no human genes. We calculated the actual
    coverage per bp used in the analysis of the human and worm genes; the coverage is only
    modestly higher (about 50%) for the human, with a strong bias towards 3' UTRs which tend
    to show much less alternative splicing. We also repeated the analysis using equal coverage
    for the two organisms and confirmed that higher levels of alternative splicing were still seen
    in human.)

    Seventy per cent of alternative splice forms found in the genes on chromosomes 19 and 22
    affect the coding sequence, rather than merely changing the 3' or 5' UTR. (This estimate
    may be affected by the incomplete representation of UTRs in the RefSeq database and in
    the transcripts studied.) Alternative splicing of the terminal exon was seen for 20% of 6,105
    mRNAs that were aligned to the draft genome sequence and correspond to confirmed 3'
    EST clusters. In addition to alternative splicing, we found evidence of the terminal exon
    employing alternative polyadenylation sites (separated by > 100 bp) in 24% of cases.

    ...

    Segmental history of the human genome

    In bacteria, genomic segments often convey important information about function: genes
    located close to one another often encode proteins in a common pathway and are regulated
    in a common operon. In mammals, genes found close to each other only rarely have
    common functions, but they are still interesting because they have a common history. In
    fact, the study of genomic segments can shed light on biological events as long as 500 Myr
    ago and as recently as 20,000 years ago.

    Conserved segments between human and mouse

    Humans and mice shared a common ancestor about 100 Myr ago. Despite the 200 Myr of
    evolutionary distance between the species, a significant fraction of genes show synteny
    between the two, being preserved within conserved segments. Genes tightly linked in one
    mammalian species tend to be linked in others. In fact, conserved segments have been
    observed in even more distant species: humans show conserved segments with fish350, 351
    and even with invertebrates such as fly and worm352. In general, the likelihood that a
    syntenic relationship will be disrupted correlates with the physical distance between the loci
    and the evolutionary distance between the species.

    Studying conserved segments between human and mouse has several uses. First,
    conservation of gene order has been used to identify likely orthologues between the species,
    particularly when investigating disease phenotypes. Second, the study of conserved
    segments among genomes helps us to deduce evolutionary ancestry. And third, detailed
    comparative maps may assist in the assembly of the mouse sequence, using the human
    sequence as a scaffold.

    Two types of linkage conservation are commonly described353. 'Conserved synteny'
    indicates that at least two genes that reside on a common chromosome in one species are
    also located on a common chromosome in the other species. Syntenic loci are said to lie in a
    'conserved segment' when not only the chromosomal position but the linear order of the loci
    has been preserved, without interruption by other chromosomal rearrangements.

    An initial survey of homologous loci in human and mouse354 suggested that the total number
    of conserved segments would be about 180. Subsequent estimates based on increasingly
    detailed comparative maps have remained close to this projection353, 355, 356
    (http://www.informatics.jax.org). The distribution of segment lengths has corresponded
    reasonably well to the truncated negative exponential curve predicted by the random
    breakage model357.

    The availability of a draft human genome sequence allows the first global human–mouse
    comparison in which human physical distances can be measured in Mb, rather than cM or
    orthologous gene counts. We identified likely orthologues by reciprocal comparison of the
    human and mouse mRNAs in the LocusLink database, using megaBLAST. For each
    orthologous pair, we mapped the location of the human gene in the draft genome sequence
    and then checked the location of the mouse gene in the Mouse Genome Informatics
    database (http://www.informatics.jax.org). Using a conservative threshold, we identified
    3,920 orthologous pairs in which the human gene could be mapped on the draft genome
    sequence with high confidence. Of these, 2,998 corresponding mouse genes had a known
    position in the mouse genome. We then searched for definitive conserved segments, defined
    as human regions containing orthologues of at least two genes from the same mouse
    chromosome region (< 15 cM) without interruption by segments from other chromosomes.

    We identified 183 definitive conserved segments (Fig. 46). The average segment length was
    15.4 Mb, with the largest segment being 90.5 Mb and the smallest 24 kb. There were also
    141 'singletons', segments that contained only a single locus; these are not counted in the
    statistics. Although some of these could be short conserved segments, they could also
    reflect incorrect choices of orthologues or problems with the human or mouse maps.
    Because of this conservative approach, the observed number of definitive segments is likely
    be lower than the correct total. One piece of evidence for this conclusion comes from a
    more detailed analysis on human chromosome 7 (ref. 358), which identified 20 conserved
    segments, of which three were singletons. Our analysis revealed only 13 definitive segments
    on this chromosome, with nine singletons.

    The frequency of observing a particular gene count in a conserved segment is plotted on a
    logarithmic scale in Fig. 47. If chromosomal breaks occur in a random fashion (as has been
    proposed) and differences in gene density are ignored, a roughly straight line should result.
    There is a clear excess for n = 1, suggesting that 50% or more of the singletons are indeed
    artefactual. Thus, we estimate that true number of conserved segments is around 190–230,
    in good agreement with the original Nadeau–Taylor prediction354.

    Figure 48 shows a plot of the frequency of lengths of conserved segments, where the x-axis
    scale is shown in Mb. As before, there is a fair amount of scatter in the data for the larger
    segments (where the numbers are small), but the trend appears to be consistent with a
    random breakage model.

    We attempted to ascertain whether the breakpoint regions have any special characteristics.
    This analysis was complicated by imprecision in the positioning of these breaks, which will
    tend to blur any relationships. With 2,998 orthologues, the average interval within which a
    break is known to have occurred is about 1.1 Mb. We compared the aggregate features of
    these breakpoint intervals with the genome as a whole. The mean gene density was lower
    in breakpoint regions than in the conserved segments (13.8 versus 18.6 per Mb). This
    suggests that breakpoints may be more likely to occur or to undergo fixation in gene-poor
    intervals than in gene-rich intervals. The occurrence of breakpoints may be promoted by
    homologous recombination among repeated sequences359. When the sequence of the
    mouse genome is finished, this analysis can be revisited more precisely.

    A number of examples of extended conserved segments and syntenies are apparent in Fig.
    46. As has been noted, almost all human genes on chromosome 17 are found on mouse
    chromosome 11, with two members of the placental lactogen family from mouse 13
    inserted. Apart from two singleton loci, human chromosome 20 appears to be entirely
    orthologous to mouse chromosome 2, apparently in a single segment. The largest apparently
    contiguous conserved segment in the human genome is on chromosome 4, including roughly
    90.5 Mb of human DNA that is orthologous to mouse chromosome 5. This analysis also
    allows us to infer the likely location of thousands of mouse genes for which the human
    orthologue has been located in the draft genome sequence but the mouse locus has not yet
    been mapped.

    With about 200 conserved segments between mouse and human and about 100 Myr of
    evolution from their common ancestor360, we obtain an estimated rate of about 1.0
    chromosomal rearrangement being fixed per Myr. However, there is good evidence that the
    rate of chromosomal rearrangement (like the rate of nucleotide substitutions; see above)
    differs between the two species. Among mammals, rodents may show unusually rapid
    chromosome alteration. By comparison, very few rearrangements have been observed
    among primates, and studies of a broader array of mammalian orders, including cats, cows,
    sheep and pigs, suggest an average rate of chromosome alteration of only about 0.2
    rearrangements per Myr in these lineages361. Additional evidence that rodents are outliers
    comes from a recent analysis of synteny between the human and zebrafish genomes. From
    a study of 523 orthologues, it was possible to project 418 conserved segments350. Assuming
    400 Myr since a common vertebrate ancestor of zebrafish and humans362, we obtain an
    estimate of 0.52 rearrangements per Myr. Recent estimates of rearrangement rates in
    plants have suggested bimodality, with some pairs showing rates of 0.15–0.41
    rearrangements per Myr, and others showing higher rates of 1.1–1.3 rearrangements per
    Myr363. With additional detailed genome maps of multiple species, it should be possible to
    determine whether this particular molecular clock is truly operating at a different rate in
    various branches of the evolutionary tree, and whether variations in that rate are bimodal or
    continuous. It should also be possible to reconstruct the karyotypes of common ancestors.

    Ancient duplicated segments in the human genome

    Another approach to genomic history is to study segmental duplications within the human
    genome. Earlier, we discussed examples of recent duplications of genomic segments to
    pericentromeric and subtelomeric regions. Most of these events appear to be evolutionary
    dead-ends resulting in nonfunctional pseudogenes; however, segmental duplication is also an
    important mode of evolutionary innovation: a duplication permits one copy of each gene to
    drift and potentially to acquire a new function.

    Segmental duplications can occur through unequal crossing over to create gene families in
    specific chromosomal regions. This mechanism can create both small families, such as the
    five related genes of the -globin cluster on chromosome 11, and large ones, such as the
    olfactory receptor gene clusters, which together contain nearly 1,000 genes and
    pseudogenes.

    The most extreme mechanism is whole-genome duplication (WGD), through a
    polyploidization event in which a diploid organism becomes tetraploid. Such events are
    classified as autopolyploidy or allopolyploidy, depending on whether they involve
    hybridization between members of the same species or different species. Polyploidization is
    common in the plant kingdom, with many known examples among wild and domesticated
    crop species. Alfalfa (Medicago sativa) is a naturally occurring autotetraploid364, and
    Nicotiana tabacum, some species of cotton (Gossypium) and several of the common
    brassicas are allotetraploids containing pairs of 'homeologous' chromosome pairs.

    In principle, WGD provides the raw material for great bursts of innovation by allowing the
    duplication and divergence of entire pathways. Ohno365 suggested that WGD has played a
    key role in evolution. There is evidence for an ancient WGD event in the ancestry of yeast
    and several independent such events in the ancestry of mustard weed366-369. Such ancient
    WGD events can be hard to detect because only a minority of the duplicated loci may be
    retained, with the result that the genes in duplicated segments cannot be aligned in a
    one-to-one correspondence but rather require many gaps. In addition, duplicated segments
    may be subsequently rearranged. For example, the ancient duplication in the yeast genome
    appears to have been followed by loss of more than 90% of the newly duplicated genes366.

    One of the most controversial hypotheses about vertebrate evolution is the proposal that two
    WGD events occurred early in the vertebrate lineage, around the time of jawed fishes some
    500 Myr ago. Some authors370 370 371 372 have seen support for this theory in the fact that
    many human genes occur in sets of four homologues—most notably the four extensive
    HOX gene clusters on chromosomes 2, 7, 12 and 17, whose duplication dates to around the
    correct time. However, other authors have disputed this interpretation373, suggesting that
    these cases may reflect unrelated duplications of specific regions rather than successive
    WGD.

    We analysed the draft genome sequence for evidence that might bear on this question. The
    analysis provides many interesting observations, but no convincing evidence of ancient
    WGD. We looked for evidence of pairs of chromosomal regions containing many
    homologous genes. Although we found many pairs containing a few homologous genes, the
    human genome does not appear to contain any pairs of regions where the density of
    duplicated genes approaches the densities seen in yeast or mustard weed366-369.

    We also examined human proteins in the IPI for which the orthologues among fly or worm
    proteins occur in the ratios 2:1:1, 3:1:1, 4:1:1 and so on (Fig. 49). The number of such
    families falls smoothly, with no peak at four and some instances of five or more
    homologues. Although this does not rule out two rounds of WGD followed by extensive
    gene loss and some unrelated gene duplication, it provides no support for the theory. More
    probatively, if two successive rounds of genome duplication occurred, phylogenetic analysis
    of the proteins having 4:1:1 ratios between human, fly and worm would be expected to show
    more trees with the topology (A,B)(C,D) for the human sequences than (A,(B,(C,D)))374.
    However, of 57 sets studied carefully, only 24% of the trees constructed from the 4:1:1 set
    have the former topology; this is not significantly different from what would be expected
    under the hypothesis of random sequential duplication of individual loci.

    We also searched for sets of four chromosomes where there are multiple genes with
    homologues on each of the four. The strongest example was chromosomes 2, 7, 12 and 17,
    containing the HOX clusters as well as additional genes. These four chromosomes appear
    to have an excess of quadruplicated genes. The genes are not all clustered in a single
    region; this may reflect intrachromosomal rearrangement since the duplication of these
    genes, or it may indicate that they result from several independent events. Of the genes
    with homologues on chromosomes 2, 12 and 17, many of those missing on chromosome 7
    are clustered on chromosome 3, suggesting a translocation. Several additional examples of
    groups of four chromosomes were found, although they were connected by fewer
    homologous genes.

    Although the analyses are sensitive to the imperfect quality of the gene predictions, our
    results so far are insufficient to settle whether two rounds of WGD occurred around 500
    Myr ago. It may be possible to resolve the issue by systematically estimating the time of
    each of the many gene duplication events on the basis of sequence divergence, although this
    is beyond the scope of this report. Another approach to determining whether a widespread
    duplication occurred at a particular time in vertebrate evolution would be to sequence the
    genomes of organisms whose lineages diverged from vertebrates at appropriate times, such
    as amphioxus.

    Recent history from human polymorphism

    The recent history of genomic segments can be probed by studying the properties of SNPs
    segregating in the current human population. The sequence information generated in the
    course of this project has yielded a huge collection of SNPs. These SNPs were extracted in
    two ways: by comparing overlapping large-insert clones derived from distinct haplotypes
    (either different individuals or different chromosomes within an individual) and by comparing
    random reads from whole-genome shotgun libraries derived from multiple individuals. The
    analysis confirms an average heterozygosity rate in the human population of about 1 in
    1,300 bp (ref. 97).

    More than 1.42 million SNPs have been assembled into a genome-wide map and are
    analysed in detail in an accompanying paper97. SNP density is also displayed across the
    genome in Fig. 9. The SNPs have an average spacing of 1.9 kb and 63% of 5-kb intervals
    contain a SNP. These polymorphisms are of immediate utility for medical genetic studies.
    Whereas investigators studying a gene previously had to expend considerable effort to
    discover polymorphisms across the region of interest, the current collection now provides
    then with about 15 SNPs for gene loci of average size.

    The density of SNPs (adjusted for ascertainment—that is, polymorphisms per base
    screened) varies considerably across the genome97 and sheds light on the unique properties
    and history of each genomic region. The average heterozygosity at a locus will tend to
    increase in proportion to the local mutation rate and the 'age' of the locus (which can be
    defined as the average number of generations since the most recent common ancestor of
    two randomly chosen copies in the population). For example, positive selection can cause a
    locus to be unusually 'young' and balancing selection can cause it to be unusually 'old'. An
    extreme example is the HLA region, in which a high SNP density is observed, reflecting the
    fact that diverse HLA haplotypes have been maintained for many millions of years by
    balancing selection and greatly predate the origin of the human species.

    SNPs can also be used to study linkage disequilibrium in the human genome375. Linkage
    disequilibrium refers to the persistence of ancestral haplotypes—that is, genomic segments
    carrying particular combinations of alleles descended from a common ancestor. It can
    provide a powerful tool for mapping disease genes376, 377 and for probing population
    history378-380. There has been considerably controversy concerning the typical distance
    over which linkage disequilibrium extends in the human genome381-386. With the collection
    of SNPs now available, it should be possible to resolve this important issue.

    Applications to medicine and biology

    In most research papers, the authors can only speculate about future applications of the
    work. Because the genome sequence has been released on a daily basis over the past four
    years, however, we can already cite many direct applications. We focus on a handful of
    applications chosen primarily from medical research.

    Disease genes

    A key application of human genome research has been the ability to find disease genes of
    unknown biochemical function by positional cloning387. This method involves mapping the
    chromosomal region containing the gene by linkage analysis in affected families and then
    scouring the region to find the gene itself. Positional cloning is powerful, but it has also been
    extremely tedious. When the approach was first proposed in the early 1980s9, a researcher
    wishing to perform positional cloning had to generate genetic markers to trace inheritance;
    perform chromosomal walking to obtain genomic DNA covering the region; and analyse a
    region of around 1 Mb by either direct sequencing or indirect gene identification methods.
    The first two barriers were eliminated with the development in the mid-1990s of
    comprehensive genetic and physical maps of the human chromosomes, under the auspices
    of the Human Genome Project. The remaining barrier, however, has continued to be
    formidable.

    All that is changing with the availability of the human draft genome sequence. The human
    genomic sequence in public databases allows rapid identification in silico of candidate
    genes, followed by mutation screening of relevant candidates, aided by information on gene
    structure. For a mendelian disorder, a gene search can now often be carried out in a matter
    of months with only a modestly sized team.

    At least 30 disease genes55, 388-421 (Table 26) have been positionally cloned in research
    efforts that depended directly on the publicly available genome sequence. As most of the
    human sequence has only arrived in the past twelve months, it is likely that many similar
    discoveries are not yet published. In addition, there are many cases in which the genome
    sequence played a supporting role, such as providing candidate microsatellite markers for
    finer genetic linkage analysis.

    The genome sequence has also helped to reveal the mechanisms leading to some common
    chromosomal deletion syndromes. In several instances, recurrent deletions have been found
    to result from homologous recombination and unequal crossing over between large, nearly
    identical intrachromosomal duplications. Examples include the DiGeorge/velocardiofacial
    syndrome region on chromosome 22 (ref. 238) and the Williams–Beuren syndrome
    recurrent deletion on chromosome 7 (ref. 239).

    The availability of the genome sequence also allows rapid identification of paralogues of
    disease genes, which is valuable for two reasons. First, mutations in a paralogous gene may
    give rise to a related genetic disease. A good example, discovered through use of the
    genome sequence, is achromatopsia (complete colour blindness). The CNGA3 gene,
    encoding the -subunit of the cone photoreceptor cyclic GMP-gated channel, had been
    shown to harbour mutations in some families with achromatopsia. Computational searching
    of the genome sequences revealed the paralogous gene encoding the corresponding
      -subunit, CNGB3 (which had not been apparent from EST databases). The CNGB3 gene
    was rapidly shown to be the cause of achromatopsia in other families406, 407. Another
    example is provided by the presenilin-1 and presenilin-2 genes, in which mutations can
    cause early-onset Alzheimer's disease422, 423. Second, the paralogue may provide an
    opportunity for therapeutic intervention, as exemplified by attempts to reactivate the fetally
    expressed haemoglobin genes in individuals with sickle cell disease or -thalassaemia,
    caused by mutations in the -globin gene424.

    We undertook a systematic search for paralogues of 971 known human disease genes with
    entries in both the Online Mendelian Inheritance in Man (OMIM) database
    (http://www.ncbi.nlm.nih.gov/Omim/) and either the SwissProt or TrEMBL protein
    databases. We identified 286 potential paralogues (with the requirement of a match of at
    least 50 amino acids with identity greater than 70% but less than 90% if on the same
    chromosome, and less than 95% if on a different chromosome). Although this analysis may
    have identified some pseudogenes, 89% of the matches showed homology over more than
    one exon in the new target sequence, suggesting that many are functional. This analysis
    shows the potential for rapid identification of disease gene paralogues in silico.

    Drug targets

    Over the past century, the pharmaceutical industry has largely depended upon a limited set
    of drug targets to develop new therapies. A recent compendium425, 426 lists 483 drug
    targets as accounting for virtually all drugs on the market. Knowing the complete set of
    human genes and proteins will greatly expand the search for suitable drug targets. Although
    only a minority of human genes may be drug targets, it has been predicted that the number
    will exceed several thousand, and this prospect has led to a massive expansion of genomic
    research in pharmaceutical research and development. A few examples will illustrate the
    point.

    (1) The neurotransmitter serotonin (5-HT) mediates rapid excitatory responses through
    ligand-gated channels. The previously identified 5-HT3A receptor gene produces functional
    receptors, but with a much smaller conductance than observed in vivo. Cross-hybridization
    experiments and analysis of ESTs failed to reveal any other homologues of the known
    receptor. Recently, however, by searching the human draft genome sequence at low
    stringency, a putative homologue was identified within a PAC clone from the long arm of
    chromosome 11 (ref. 428). The homologue was shown to be expressed in the amygdala,
    caudate and hippocampus, and a full-length cDNA was subsequently obtained. The gene,
    which codes for a serotonin receptor, was named 5-HT3B. When assembled in a
    heterodimer with 5-HT3A, it was shown to account for the large-conductance neuronal
    serotonin channel. Given the central role of the serotonin pathway in mood disorders and
    schizophrenia, the discovery of a major new therapeutic target is of considerable interest.

    (2) The contractile and inflammatory actions of the cysteinyl leukotrienes, formerly known
    as the slow reacting substance of anaphylaxis (SRS-A), are mediated through specific
    receptors. The second such receptor, CysLT2, was identified using the combination of a rat
    EST and the human genome sequence. This led to the cloning of a gene with 38%
    amino-acid identity to the only other receptor that had previously been identified428. This
    new receptor, which shows high-affinity binding to several leukotrienes, maps to a region of
    chromosome 13 that is linked to atopic asthma. The gene is expressed in airway smooth
    muscles and in the heart. As the leukotriene pathway has been a significant target for the
    development of drugs against asthma, the discovery of a new receptor has obvious and
    important consequences.

    (3) Abundant deposition of -amyloid in senile plaques is the hallmark of Alzheimer's
    disease. -Amyloid is generated by proteolytic processing of the amyloid precursor protein
    (APP). One of the enzymes involved is the -site APP-cleaving enzyme (BACE), which is
    a transmembrane aspartyl protease. Computational searching of the public human draft
    genome sequence recently identified a new sequence homologous to BACE, encoding a
    protein now named BACE2429, 430. BACE2, which has 52% amino-acid sequence identity
    to BACE, contains two active protease sites and maps to the obligatory Down's syndrome
    region of chromosome 21, as does APP. This raises the question of whether the extra
    copies of both BACE2 and APP may contribute to accelerated deposition of -amyloid in
    the brains of Down's syndrome patients. The development of antagonists to BACE and
    BACE2 represents a promising approach to preventing Alzheimer's disease.

    Given these examples, we undertook a systematic effort to identify paralogues of the classic
    drug target proteins in the draft genome sequence. The target list426 was used to identify
    603 entries in the SwissProt database with unique accession numbers. These were then
    searched against the current genome sequence database, using the requirement that a
    match should have 70–100% identity to at least 50 amino acids. Matches to named proteins
    were ignored, as we assumed that these represented known homologues.

    We found 18 putative novel paralogues (Table 27), including apparent dopamine receptors,
    purinergic receptors and insulin-like growth factor receptors. In six cases, the novel
    paralogue matches at least one EST, adding confidence that this search process can identify
    novel functional genes. For the remaining 12 putative paralogues without an EST match, all
    have long ORFs and all but one show similarity spanning multiple exons separated by
    introns, so these are not processed pseudogenes. They are likely to represent interesting
    new candidate drug targets.

    Basic biology

    Although the examples above reflect medical applications, there are also many similar
    applications to basic physiology and cell biology. To cite one satisfying example, the publicly
    available sequence was used to solve a mystery that had vexed investigators for several
    decades: the molecular basis of bitter taste [431]. Humans and other animals are polymorphic
    for response to certain bitter tastes. Recently, investigators mapped this trait in both humans
    and mice and then searched the relevant region of the human draft genome sequence for
    G-protein coupled receptors. These studies led, in quick succession, to the discovery of a
    new family of such proteins, the demonstration that they are expressed almost exclusively in
    taste buds, and the experimental confirmation that the receptors in cultured cells respond to
    specific bitter substances432-434.

    The next steps

    Considerable progress has been made in human sequencing, but much remains to be done to
    produce a finished sequence. Even more work will be required to extract the full
    information contained in the sequence. Many of the key next steps are already underway.

    Finishing the human sequence

    The human sequence will serve as a foundation for biomedical research in the years ahead,
    and it is thus crucial that the remaining gaps be filled and ambiguities be resolved as quickly
    as possible. This will involve a three-step program.

    The first stage involves producing finished sequence from clones spanning the current
    physical map, which covers more than 96% of the euchromatic regions of the genome.
    About 1 Gb of finished sequence is already completed. Almost all of the remaining clones
    are already sequenced to at least draft coverage, and the rest have been selected for
    sequencing. All clones are expected to reach 'full shotgun' coverage (8–10-fold
    redundancy) by about mid-2001 and finished form (99.99% accuracy) not long thereafter,
    using established and increasingly automated protocols.

    The next stage will be to screen additional libraries to close gaps between clone contigs.
    Directed probing of additional large-insert clone libraries should close many of the remaining
    gaps. Unclosed gaps will be sized by FISH techniques or other methods. Two
    chromosomes, 22 and 21, have already been assembled in this 'essentially complete' form in
    this manner93, 94, and chromosomes 20, Y, 19, 14 and 7 are likely to reach this status in the
    next few months. All chromosomes should be essentially completed by 2003, if not sooner.

    Finally, techniques must be developed to close recalcitrant gaps. Several hundred such gaps
    in the euchromatic sequence will probably remain in the genome after exhaustive screening
    of existing large-insert libraries. New methodologies will be needed to recover sequence
    from these segments, and to define biological reasons for their lack of representation in
    standard libraries. Ideally, it would be desirable to obtain complete sequence from all
    heterochromatic regions, such as centromeres and ribosomal gene clusters, although most of
    this sequence will consist of highly polymorphic tandem repeats containing few
    protein-coding genes.

    Developing the IGI and IPI

    The draft genome sequence has provided an initial look at the human gene content, but
    many ambiguities remain. A high priority will be to refine the IGI and IPI to the point where
    they accurately reflect every gene and every alternatively spliced form. Several steps are
    needed to reach this ambitious goal.

    Finishing the human sequence will assist in this effort, but the experiences gained on
    chromosomes 21 and 22 show that sequence alone is not enough to allow complete gene
    identification. One powerful approach is cross-species sequence comparison with related
    organisms at suitable evolutionary distances. The sequence coverage from the pufferfish T.
    nigroviridis has already proven valuable in identifying potential exons292; this work is
    expected to continue from its current state of onefold coverage to reach at least fivefold
    coverage later this year. The genome sequence of the laboratory mouse will provide a
    particularly powerful tool for exon identification, as sequence similarity is expected to
    identify 95–97% of the exons, as well as a significant number of regulatory domains435-437.
    A public-private consortium is speeding this effort, by producing freely accessible
    whole-genome shotgun coverage that can be readily used for cross-species comparison438.
    More than onefold coverage from the C57BL/6J strain has already been completed and
    threefold is expected within the next few months. In the slightly longer term, a program is
    under way to produce a finished sequence of the laboratory mouse.

    Another important step is to obtain a comprehensive collection of full-length human cDNAs,
    both as sequences and as actual clones. The Mammalian Gene Collection project has been
    underway for a year18 and expects to produce 10,000–15,000 human full-length cDNAs
    over the coming year, which will be available without restrictions on use. The Genome
    Exploration Group of the RIKEN Genomic Sciences Center is similarly developing a
    collection of cDNA clones from mouse309, which is a valuable complement because of the
    availability of tissues from all developmental time points. A challenge will be to define the
    gene-specific patterns of alternative splicing, which may affect half of human genes.
    Existing collections of ESTs and cDNAs may allow identification of the most abundant of
    these isoforms, but systematic exploration of this problem may require exhaustive analysis
    of cDNA libraries from multiple tissues or perhaps high-throughput reverse
    transcription–PCR studies. Deep understanding of gene function will probably require
    knowledge of the structure, tissue distribution and abundance of these alternative forms.

    Large-scale identification of regulatory regions

    The one-dimensional script of the human genome, shared by essentially all cells in all
    tissues, contains sufficient information to provide for differentiation of hundreds of different
    cell types, and the ability to respond to a vast array of internal and external influences.
    Much of this plasticity results from the carefully orchestrated symphony of transcriptional
    regulation. Although much has been learned about the cis-acting regulatory motifs of some
    specific genes, the regulatory signals for most genes remain uncharacterized. Comparative
    genomics of multiple vertebrates offers the best hope for large-scale identification of such
    regulatory sites439. Previous studies of sequence alignment of regulatory domains of
    orthologous genes in multiple species has shown a remarkable correlation between
    sequence conservation, dubbed 'phylogenetic footprints'440, and the presence of binding
    motifs for transcription factors. This approach could be particularly powerful if combined
    with expression array technologies that identify cohorts of genes that are coordinately
    regulated, implicating a common set of cis-acting regulatory sequences441-444. It will also
    be of considerable interest to study epigenetic modifications such as cytosine methylation on
    a genome-wide scale, and to determine their biological consequences445, 446. Towards this
    end, a pilot Human Epigenome Project has been launched447, 448.

    Sequencing of additional large genomes

    More generally, comparative genomics allows biologists to peruse evolution's laboratory
    notebook—to identify conserved functional features and recognize new innovations in
    specific lineages. Determination of the genome sequence of many organisms is very
    desirable. Already, projects are underway to sequence the genomes of the mouse, rat,
    zebrafish and the pufferfishes T. nigroviridis and Takifugu rubripes. Plans are also under
    consideration for sequencing additional primates and other organisms that will help define
    key developments along the vertebrate and nonvertebrate lineages.

    To realize the full promise of comparative genomics, however, it needs to become simple
    and inexpensive to sequence the genome of any organism. Sequencing costs have dropped
    100-fold over the last 10 years, corresponding to a roughly twofold decrease every 18
    months. This rate is similar to 'Moore's law' concerning improvements in semiconductor
    manufacture. In both sequencing and semiconductors, such improvement does not happen
    automatically, but requires aggressive technological innovation fuelled by major investment.
    Improvements are needed to move current dideoxy sequencing to smaller volumes and
    more rapid sequencing times, based upon advances such as microchannel technology. More
    revolutionary methods, such as mass spectrometry, single-molecule sequencing and
    nanopore approaches76, have not yet been fully developed, but hold great promise and
    deserve strong encouragement.

    Completing the catalogue of human variation

    The human draft genome sequence has already allowed the identification of more than 1.4
    million SNPs, comprising a substantial proportion of all common human variation. This
    program should be extended to obtain a nearly complete catalogue of common variants and
    to identify the common ancestral haplotypes present in the population. In principle, these
    genetic tools should make it possible to perform association studies and linkage
    disequilibrium studies375 to identify the genes that confer even relatively modest risk for
    common diseases. Launching such an intense era of human molecular epidemiology will
    also require major advances in the cost efficiency of genotyping technology, in the collection
    of carefully phenotyped patient cohorts and in statistical methods for relating large-scale
    SNP data to disease phenotype.

    From sequence to function

    The scientific program outlined above focuses on how the genome sequence can be mined
    for biological information. In addition, the sequence will serve as a foundation for a broad
    range of functional genomic tools to help biologists to probe function in a more systematic
    manner. These will need to include improved techniques and databases for the global
    analysis of: RNA and protein expression, protein localization, protein–protein interactions
    and chemical inhibition of pathways. New computational techniques will be needed to use
    such information to model cellular circuitry. A full discussion of these important directions is
    beyond the scope of this paper.

    Concluding thoughts

    The Human Genome Project is but the latest increment in a remarkable scientific program
    whose origins stretch back a hundred years to the rediscovery of Mendel's laws and whose
    end is nowhere in sight. In a sense, it provides a capstone for efforts in the past century to
    discover genetic information and a foundation for efforts in the coming century to
    understand it.

    We find it humbling to gaze upon the human sequence now coming into focus. In principle,
    the string of genetic bits holds long-sought secrets of human development, physiology and
    medicine. In practice, our ability to transform such information into understanding remains
    woefully inadequate. This paper simply records some initial observations and attempts to
    frame issues for future study. Fulfilling the true promise of the Human Genome Project will
    be the work of tens of thousands of scientists around the world, in both academia and
    industry. It is for this reason that our highest priority has been to ensure that genome data
    are available rapidly, freely and without restriction.

    The scientific work will have profound long-term consequences for medicine, leading to the
    elucidation of the underlying molecular mechanisms of disease and thereby facilitating the
    design in many cases of rational diagnostics and therapeutics targeted at those mechanisms.
    But the science is only part of the challenge. We must also involve society at large in the
    work ahead. We must set realistic expectations that the most important benefits will not be
    reaped overnight. Moreover, understanding and wisdom will be required to ensure that these
    benefits are implemented broadly and equitably. To that end, serious attention must be paid
    to the many ethical, legal and social implications (ELSI) raised by the accelerated pace of
    genetic discovery. This paper has focused on the scientific achievements of the human
    genome sequencing efforts. This is not the place to engage in a lengthy discussion of the
    ELSI issues, which have also been a major research focus of the Human Genome Project,
    but these issues are of comparable importance and could appropriately fill a paper of equal
    length.

    Finally, it is has not escaped our notice that the more we learn about the human genome, the
    more there is to explore.

    "We shall not cease from exploration. And the end of all our exploring will be to arrive
    where we started, and know the place for the first time."—T. S. Eliot [450]


    DNA sequence databases

    GenBank, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, Maryland 20894, USA

    EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

    DNA Data Bank of Japan, Center for Information Biology, National Institute of Genetics, 1111 Yata, Mishima-shi, Shizuoka-ken 411-8540, Japan


    Supplementary Information on Nature's web site
    Supplementary Information is available on Nature's World-Wide Web site
    (http://www.nature.com) or as paper copy from the London editorial office of Nature. 


    International Human Genome Sequencing Consortium Genome Sequencing Centres: Listed in order of total genomic sequence contributed, with a partial list of personnel. A full list of contributors at each centre is
    available as Supplementary Information. )

    Whitehead Institute for Biomedical Research, Center for Genome Research:

    ERIC S. LANDER1*, LAUREN M. LINTON1, BRUCE BIRREN1*, CHAD NUSBAUM1*,
    MICHAEL C. ZODY1*, JENNIFER BALDWIN1, KERI DEVON1, KEN DEWAR1,
    MICHAEL DOYLE1, WILLIAM FITZHUGH1*, ROEL FUNKE1, DIANE GAGE1,
    KATRINA HARRIS1, ANDREW HEAFORD1, JOHN HOWLAND1, LISA KANN1,
    JESSICA LEHOCZKY1, ROSIE LEVINE1, PAUL MCEWAN1, KEVIN MCKERNAN1,
    JAMES MELDRIM1, JILL P. MESIROV1, CHER MIRANDA1, WILLIAM MORRIS1,
    JEROME NAYLOR1, CHRISTINA RAYMOND1, MARK ROSETTI1, RALPH SANTOS1,
    ANDREW SHERIDAN1, CARRIE SOUGNEZ1, NICOLE STANGE-THOMANN1,
    NIKOLA STOJANOVIC1, ARAVIND SUBRAMANIAN1 & DUDLEY WYMAN1

    The Sanger Centre:

    JANE ROGERS2, JOHN SULSTON2*, RACHAEL AINSCOUGH2, STEPHAN BECK2,
    DAVID BENTLEY2, JOHN BURTON2, CHRISTOPHER CLEE2, NIGEL CARTER2,
    ALAN COULSON2, REBECCA DEADMAN2, PANOS DELOUKAS2, ANDREW DUNHAM2,
    IAN DUNHAM2, RICHARD DURBIN2*, LISA FRENCH2, DARREN GRAFHAM2,
    SIMON GREGORY2, TIM HUBBARD2*, SEAN HUMPHRAY2, ADRIENNE HUNT2,
    MATTHEW JONES2, CHRISTINE LLOYD2, AMANDA MCMURRAY2, LUCY MATTHEWS2,
    SIMON MERCER2, SARAH MILNE2, JAMES C. MULLIKIN2, ANDREW MUNGALL2,
    ROBERT PLUMB2, MARK ROSS2, RATNA SHOWNKEEN2 & SARAH SIMS2

    Washington University Genome Sequencing Center

    ROBERT H. WATERSTON3*, RICHARD K. WILSON3, LADEANA W. HILLIER3,
    JOHN D. MCPHERSON3, MARCO A. MARRA3, ELAINE R. MARDIS3,
    LUCINDA A. FULTON3, ASIF T. CHINWALLA3, KYMBERLIE H. PEPIN3,
    WARREN R. GISH3, STEPHANIE L. CHISSOE3, MICHAEL C. WENDL3,
    KIM D. DELEHAUNTY3, TRACIE L. MINER3, ANDREW DELEHAUNTY3,
    JASON B. KRAMER3, LISA L. COOK3, ROBERT S. FULTON3, DOUGLAS L. JOHNSON3,
    PATRICK J. MINX3 & SANDRA W. CLIFTON3

    US DOE Joint Genome Institute:

    TREVOR HAWKINS4, ELBERT BRANSCOMB4, PAUL PREDKI4, PAUL RICHARDSON4,
    SARAH WENNING4, TOM SLEZAK4, NORMAN DOGGETT4, JAN-FANG CHENG4,
    ANNE OLSEN4, SUSAN LUCAS4, CHRISTOPHER ELKIN4, EDWARD UBERBACHER4 &
    MARVIN FRAZIER4

    Baylor College of Medicine Human Genome Sequencing Center:

    RICHARD A. GIBBS5*, DONNA M. MUZNY5, STEVEN E. SCHERER5, JOHN B. BOUCK5,
    ERICA J. SODERGREN5, KIM C. WORLEY5, CATHERINE M. RIVES5,
    JAMES H. GORRELL5, MICHAEL L. METZKER5, SUSAN L. NAYLOR6,
    RAJU S. KUCHERLAPATI7, DAVID L. NELSON & GEORGE M. WEINSTOCK8

    RIKEN Genomic Sciences Center:

    YOSHIYUKI SAKAKI9, ASAO FUJIYAMA9, MASAHIRA HATTORI9, TETSUSHI YADA9,
    ATSUSHI TOYODA9, TAKEHIKO ITOH9, CHIHARU KAWAGOE9, HIDEMI WATANABE9,
    YASUSHI TOTOKI9 & TODD TAYLOR9

    Genoscope and CNRS UMR-8030:

    JEAN WEISSENBACH10, ROLAND HEILIG10, WILLIAM SAURIN10,
    FRANCOIS ARTIGUENAVE10, PHILIPPE BROTTIER10, THOMAS BRULS10,
    ERIC PELLETIER10, CATHERINE ROBERT10 & PATRICK WINCKER10

    GTC Sequencing Center:

    DOUGLAS R. SMITH11, LYNN DOUCETTE-STAMM11, MARC RUBENFIELD11,
    KEITH WEINSTOCK11, HONG MEI LEE11 & JOANN DUBOIS11

    Department of Genome Analysis, Institute of Molecular Biotechnology:

    ANDRÉ ROSENTHAL12, MATTHIAS PLATZER12, GERALD NYAKATURA12,
    STEFAN TAUDIEN12 & ANDREAS RUMP12

    Beijing Genomics Institute/Human Genome Center:

    HUANMING YANG13, JUN YU13, JIAN WANG13, GUYANG HUANG14 & JUN GU15

    Multimegabase Sequencing Center, The Institute for Systems Biology:

    LEROY HOOD16, LEE ROWEN16, ANUP MADAN16 & SHIZEN QIN16

    Stanford Genome Technology Center:

    RONALD W. DAVIS17, NANCY A. FEDERSPIEL17, A. PIA ABOLA17 &
    MICHAEL J. PROCTOR17

    Stanford Human Genome Center:

    RICHARD M. MYERS18, JEREMY SCHMUTZ18, MARK DICKSON18, JANE GRIMWOOD18
    & DAVID R. COX18

    University of Washington Genome Center:

    MAYNARD V. OLSON19, RAJINDER KAUL19 & CHRISTOPHER RAYMOND19

    Department of Molecular Biology, Keio University School of Medicine:

    NOBUYOSHI SHIMIZU20, KAZUHIKO KAWASAKI20 & SHINSEI MINOSHIMA20

    University of Texas Southwestern Medical Center at Dallas:

    GLEN A. EVANS21†, MARIA ATHANASIOU21 & ROGER SCHULTZ21

    University of Oklahoma's Advanced Center for Genome Technology:

    BRUCE A. ROE22, FENG CHEN22 & HUAQIN PAN22

    Max Planck Institute for Molecular Genetics:

    JULIANE RAMSER23, HANS LEHRACH23 & RICHARD REINHARDT23

    Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center:

    W. RICHARD MCCOMBIE24, MELISSA DE LA BASTIDE24 & NEILAY DEDHIA24

    GBF—German Research Centre for Biotechnology:

    HELMUT BLÖCKER25, KLAUS HORNISCHER25 & GABRIELE NORDSIEK25 


    *Genome Analysis Group (listed in alphabetical order, also includes
    individuals listed under other headings):

    RICHA AGARWALA26, L. ARAVIND26, JEFFREY A. BAILEY27, ALEX BATEMAN2,
    SERAFIM BATZOGLOU1, EWAN BIRNEY28, PEER BORK29, 30, DANIEL G. BROWN1,
    CHRISTOPHER B. BURGE31, LORENZO CERUTTI28, HSIU-CHUAN CHEN26,
    DEANNA CHURCH26, MICHELE CLAMP2, RICHARD R. COPLEY30, TOBIAS DOERKS29,
    30, SEAN R. EDDY32, EVAN E. EICHLER27, TERRENCE S. FUREY33,
    JAMES GALAGAN1, JAMES G. R. GILBERT2, CYRUS HARMON34,
    YOSHIHIDE HAYASHIZAKI35, DAVID HAUSSLER36, HENNING HERMJAKOB28,
    KARSTEN HOKAMP37, WONHEE JANG26, L. STEVEN JOHNSON32,
    THOMAS A. JONES32, SIMON KASIF38, AREK KASPRYZK28, SCOT KENNEDY39,
    W. JAMES KENT40, PAUL KITTS26, EUGENE V. KOONIN26, IAN KORF3, DAVID KULP34,
    DORON LANCET41, TODD M. LOWE42, AOIFE MCLYSAGHT37, TARJEI MIKKELSEN38,
    JOHN V. MORAN43, NICOLA MULDER28, VICTOR J. POLLARA1, CHRIS P. PONTING44,
    GREG SCHULER26, JÖRG SCHULTZ30, GUY SLATER28, ARIAN F. A. SMIT45,
    ELIA STUPKA28, JOSEPH SZUSTAKOWKI38, DANIELLE THIERRY-MIEG26,
    JEAN THIERRY-MIEG26, LUKAS WAGNER26, JOHN WALLIS3, RAYMOND WHEELER34,
    ALAN WILLIAMS34, YURI I. WOLF26, KENNETH H. WOLFE37, SHIAW-PYNG YANG3 &
    RU-FANG YEH31 


    Scientific management: National Human Genome Research Institute, US National Institutes of Health:

    FRANCIS COLLINS46*, MARK S. GUYER46, JANE PETERSON46, ADAM FELSENFELD46*
    & KRIS A. WETTERSTRAND46

    Office of Science, US Department of Energy:

    ARISTIDES PATRINOS47

    The Wellcome Trust:

    MICHAEL J. MORGAN48

    1
      Whitehead Institute for Biomedical Research, Center for Genome Research, Nine Cambridge Center,
      Cambridge, Massachusetts 02142, USA
     2
      The Sanger Centre, The Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1RQ, United
      Kingdom
     3
      Washington University Genome Sequencing Center, Box 8501, 4444 Forest Park Avenue, St. Louis, Missouri
      63108, USA
     4
      US DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA
     5
      Baylor College of Medicine Human Genome Sequencing Center, Department of Molecular and Human
      Genetics, One Baylor Plaza, Houston, Texas 77030, USA;
     6
      Department of Cellular and Structural Biology, The University of Texas Health Science Center at San Antonio,
      7703 Floyd Curl Drive, San Antonio, Texas 78229-3900, USA;
     7
      Department of Molecular Genetics, Albert Einstein College of Medicine, 1635 Poplar Street, Bronx, New York
      10461, USA;
     8
      Baylor College of Medicine Human Genome Sequencing Center and the Department of Microbiology &
      Molecular Genetics, University of Texas Medical School, PO Box 20708, Houston, Texas 77225, USA
     9
      RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku Yokohama-city, Kanagawa 230-0045, Japan
    10
      Genoscope and CNRS UMR-8030, 2 Rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France
    11
      GTC Sequencing Center, Genome Therapeutics Corporation, 100 Beaver Street, Waltham, Massachusetts
      02453-8443, USA
    12
      Department of Genome Analysis, Institute of Molecular Biotechnology, Beutenbergstrasse 11, D-07745 Jena,
      Germany
    13
      Beijing Genomics Institute/Human Genome Center, Institute of Genetics, Chinese Academy of Sciences,
      Beijing 100101, China;
    14
      Southern China National Human Genome Research Center, Shanghai 201203, China;
    15
      Northern China National Human Genome Research Center, Beijing 100176, China
    16
      Multimegabase Sequencing Center, The Institute for Systems Biology, 4225 Roosevelt Way, NE Suite 200,
      Seattle, Washington 98105, USA
    17
      Stanford Genome Technology Center, 855 California Avenue, Palo Alto, California 94304, USA
    18
      Stanford Human Genome Center and Department of Genetics, Stanford University School of Medicine,
      Stanford, California 94305-5120, USA
    19
      University of Washington Genome Center, 225 Fluke Hall on Mason Road, Seattle, Washington 98195, USA
    20
      Department of Molecular Biology, Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo
      160-8582, Japan
    21
      University of Texas Southwestern Medical Center at Dallas, 6000 Harry Hines Blvd., Dallas, Texas 75235-8591,
      USA
    22
      University of Oklahoma's Advanced Center for Genome Technology, Dept. of Chemistry and Biochemistry,
      University of Oklahoma, 620 Parrington Oval, Rm 311, Norman, Oklahoma 73019, USA
    23
      Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
    24
      Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center, 1 Bungtown Road, Cold Spring Harbor,
      New York 11724, USA
    25
      GBF - German Research Centre for Biotechnology, Mascheroder Weg 1, D-38124 Braunschweig, Germany
    26
      National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg.
      38A, 8600 Rockville Pike, Bethesda, Maryland 20894, USA;
    27
      Department of Genetics, Case Western Reserve School of Medicine and University Hospitals of Cleveland, BRB
      720, 10900 Euclid Ave., Cleveland, Ohio 44106, USA;
    28
      EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD,
      United Kingdom;
    29
      Max Delbrück Center for Molecular Medicine, Robert-Rossle-Strasse 10, 13125 Berlin-Buch, Germany;
    30
      EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany;
    31
      Dept. of Biology, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts
      02139-4307, USA;
    32
      Howard Hughes Medical Institute, Dept. of Genetics, Washington University School of Medicine, Saint Louis,
      Missouri 63110, USA;
    33
      Dept. of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA;
    34
      Affymetrix, Inc., 2612 8th St, Berkeley, California 94710, USA;
    35
      Genome Exploration Research Group, Genomic Sciences Center, RIKEN Yokohama Institute, 1-7-22
      Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan;
    36
      Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz,
      California 95064, USA;
    37
      University of Dublin, Trinity College, Department of Genetics, Smurfit Institute, Dublin 2, Ireland;
    38
      Cambridge Research Laboratory, Compaq Computer Corporation and MIT Genome Center, 1 Cambridge
      Center, Cambridge, Massachusetts 02142, USA;
    39
      Dept. of Mathematics, University of California at Santa Cruz, Santa Cruz, California 95064, USA;
    40
      Dept. of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA;
    41
      Crown Human Genetics Center and Department of Molecular Genetics, The Weizmann Institute of Science,
      Rehovot 71600, Israel;
    42
      Dept. of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA;
    43
      The University of Michigan Medical School, Departments of Human Genetics and Internal Medicine, Ann Arbor,
      Michigan 48109, USA;
    44
      MRC Functional Genetics Unit, Department of Human Anatomy and Genetics, University of Oxford, South Parks
      Road, Oxford OX1 3QX, UK;
    45
      Institute for Systems Biology, 4225 Roosevelt Way NE, Seattle, WA 98105, USA
    46
      National Human Genome Research Institute, US National Institutes of Health, 31 Center Drive, Bethesda,
      Maryland 20892, USA;
    47
      Office of Science, US Department of Energy, 19901 Germantown Road, Germantown, Maryland 20874, USA;
    48
      The Wellcome Trust, 183 Euston Road, London, NW1 2BE, UK.
     

    †Present addresses: Genome Sequencing Project, Egea Biosciences, Inc., 4178 Sorrento Valley Blvd., Suite F,
    San Diego, CA 92121, USA (G.A.E.); INRA, Station d'Amélioration des Plantes, 63039 Clermont-Ferrand Cedex 2, France (L.C.). 


    Correspondence and requests for materials should be addressed to E. S. Lander (e-mail:
    lander@genome.wi.mit.edu), R. H. Waterston (e-mail: bwaterst@watson.wustl.edu), J. Sulston (e-mail:
    jes@sanger.ac.uk) or F. S. Collins (e-mail: fc23a@nih.gov).


    Received 7 December 2000;accepted 9 January 2001



    References:

    1. Correns, C. "Untersuchungen uber die Xenien bei Zea mays", Berichte der Deutsche Botanische Gesellschaft 17: 410-418 (1899).

    2. De Vries, H. "Sur la loie de disjonction des hybrides", Comptes Rendue Hebdemodaires, Acad. Sci. Paris 130: 845-847 (1900).

    3. von Tschermack, E., "Uber Kunstliche Kreuzung bei Pisum sativum", Berichte der Deutsche Botanische Gesellschaft 18: 232-239. (1900).

    4.
         Sanger, F. et al. Nucleotide sequence of bacteriophage  X174 DNA. Nature 265,
         687-695 (1977). | PubMed |
       5.
         Sanger, F. et al. The nucleotide sequence of bacteriophage X174. J Mol Biol 125,
         225-246 (1978). | PubMed |
       6.
         Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F. & Petersen, G. B.
         Nucleotide-sequence of bacteriophage Lambda DNA. J. Mol. Biol. 162, 729-773
         (1982). | PubMed |
       7.
         Fiers, W. et al. Complete nucleotide sequence of SV40 DNA. Nature 273, 113-120
         (1978). | PubMed |
       8.
         Anderson, S. et al. Sequence and organization of the human mitochondrial genome.
         Nature 290, 457-465 (1981). | PubMed |
       9.
         Botstein, D., White, R. L., Skolnick, M. & Davis, R. W. Construction of a genetic
         linkage map in man using restriction fragment length polymorphisms. Am. J. Hum.
         Genet. 32, 314-331 (1980). | PubMed |
      10.
         Olson, M. V. et al. Random-clone strategy for genomic restriction mapping in yeast.
         Proc. Natl Acad. Sci. USA 83, 7826-7830 (1986). | PubMed |
      11.
         Coulson, A., Sulston, J., Brenner, S. & Karn, J. Toward a physical map of the genome
         of the nematode Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 83, 7821-7825
         (1986).
      12.
         Putney, S. D., Herlihy, W. C. & Schimmel, P. A new troponin T and cDNA clones for
         13 different muscle proteins, found by shotgun sequencing. Nature 302, 718-721
         (1983). | PubMed |
      13.
         Milner, R. J. & Sutcliffe, J. G. Gene expression in rat brain. Nucleic Acids Res. 11,
         5497-5520 (1983). | PubMed |
      14.
         Adams, M. D. et al. Complementary DNA sequencing: expressed sequence tags and
         human genome project. Science 252, 1651-1656 (1991). | PubMed |
      15.
         Adams, M. D. et al. Initial assessment of human gene diversity and expression
         patterns based upon 83 million nucleotides of cDNA sequence. Nature 377, 3-174
         (1995). | PubMed |
      16.
         Okubo, K. et al. Large scale cDNA sequencing for analysis of quantitative and
         qualitative aspects of gene expression. Nature Genet. 2, 173-179 (1992). | PubMed |
      17.
         Hillier, L. D. et al. Generation and analysis of 280,000 human expressed sequence
         tags. Genome Res. 6, 807-828 (1996). | PubMed |
      18.
         Strausberg, R. L., Feingold, E. A., Klausner, R. D. & Collins, F. S. The mammalian
         gene collection. Science 286, 455-457 (1999). | Article | PubMed |
      19.
         Berry, R. et al. Gene-based sequence-tagged-sites (STSs) as the basis for a human
         gene map. Nature Genet. 10, 415-423 (1995). | PubMed |
      20.
         Houlgatte, R. et al. The Genexpress Index: a resource for gene discovery and the
         genic map of the human genome. Genome Res. 5, 272-304 (1995). | PubMed |
      21.
         Sinsheimer, R. L. The Santa Cruz Workshop--May 1985. Genomics 5, 954-956
         (1989). | PubMed |
      22.
         Palca, J. Human genome--Department of Energy on the map. Nature 321, 371 (1986).
      23.
         National Research Council Mapping and Sequencing the Human Genome (National
         Academy Press, Washington DC, 1988).
      24.
         Bishop, J. E. & Waldholz, M. Genome (Simon and Schuster, New York, 1990).
      25.
         Kevles, D. J. & Hood, L. (eds) The Code of Codes: Scientific and Social Issues in the
         Human Genome Project (Harvard Univ. Press, Cambridge, Massachusetts, 1992).
      26.
         Cook-Deegan, R. The Gene Wars: Science, Politics, and the Human Genome (W. W.
         Norton & Co., New York, London, 1994).
      27.
         Donis-Keller, H. et al. A genetic linkage map of the human genome. Cell 51, 319-337
         (1987). | PubMed |
      28.
         Gyapay, G. et al. The 1993-94 Genethon human genetic linkage map. Nature Genet.
         7, 246-339 (1994). | PubMed |
      29.
         Hudson, T. J. et al. An STS-based map of the human genome. Science 270,
         1945-1954 (1995). | PubMed |
      30.
         Dietrich, W. F. et al. A comprehensive genetic map of the mouse genome. Nature
         380, 149-152 (1996). | PubMed |
      31.
         Nusbaum, C. et al. A YAC-based physical map of the mouse genome. Nature Genet.
         22, 388-393 (1999). | Article | PubMed |
      32.
         Oliver, S. G. et al. The complete DNA sequence of yeast chromosome III. Nature 357,
         38-46 (1992). | PubMed |
      33.
         Wilson, R. et al. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C.
         elegans. Nature 368, 32-38 (1994). | PubMed |
      34.
         Chen, E. Y. et al. The human growth hormone locus: nucleotide sequence, biology,
         and evolution. Genomics 4, 479-497 (1989). | PubMed |
      35.
         McCombie, W. R. et al. Expressed genes, Alu repeats and polymorphisms in
         cosmids sequenced from chromosome 4p16.3. Nature Genet. 1, 348-353
         (1992). | PubMed |
      36.
         Martin-Gallardo, A. et al. Automated DNA sequencing and analysis of 106 kilobases
         from human chromosome 19q13.3. Nature Genet. 1, 34-39 (1992). | PubMed |
      37.
         Edwards, A. et al. Automated DNA sequencing of the human HPRT locus. Genomics
         6, 593-608 (1990). | PubMed |
      38.
         Marshall, E. A strategy for sequencing the genome 5 years early. Science 267,
         783-784 (1995). | PubMed |
      39.
         Project to sequence human genome moves on to the starting blocks. Nature 375,
         93-94 (1995).
      40.
         Shizuya, H. et al. Cloning and stable maintenance of 300-kilobase-pair fragments of
         human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl Acad. Sci.
         USA 89, 8794-8797 (1992). | PubMed |
      41.
         Burke, D. T., Carle, G. F. & Olson, M. V. Cloning of large segments of exogenous
         DNA into yeast by means of artificial chromosome vectors. Science 236, 806-812
         (1987). | PubMed |
      42.
         Marshall, E. A second private genome project. Science 281, 1121 (1998). | PubMed |
      43.
         Marshall, E. NIH to produce a 'working draft' of the genome by 2001. Science 281,
         1774-1775 (1998). | PubMed |
      44.
         Pennisi, E. Academic sequencers challenge Celera in a sprint to the finish. Science
         283, 1822-1823 (1999). | Article | PubMed |
      45.
         Bouck, J., Miller, W., Gorrell, J. H., Muzny, D. & Gibbs, R. A. Analysis of the quality
         and utility of random shotgun sequencing at low redundancies. Genome Res. 8,
         1074-1084 (1998). | PubMed |
      46.
         Collins, F. S. et al. New goals for the U. S. Human Genome Project: 1998-2003.
         Science 282, 682-689 (1998). | PubMed |
      47.
         Sanger, F. & Coulson, A. R. A rapid method for determining sequences in DNA by
         primed synthesis with DNA polymerase. J. Mol. Biol. 94, 441-448 (1975). | PubMed |
      48.
         Maxam, A. M. & Gilbert, W. A new method for sequencing DNA. Proc. Natl Acad.
         Sci. USA 74, 560-564 (1977). | PubMed |
      49.
         Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments.
         Nucleic Acids Res. 9, 3015-3027 (1981). | PubMed |
      50.
         Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of
         cauliflower mosaic virus by M13mp7 shotgun sequencing. Nucleic Acids Res. 9,
         2871-2888 (1981). | PubMed |
      51.
         Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA
         sequence analysis. Anal. Biochem. 129, 216-223 (1983). | PubMed |
      52.
         Chissoe, S. L. et al. Sequence and analysis of the human ABL gene, the BCR gene,
         and regions involved in the Philadelphia chromosomal translocation. Genomics 27,
         67-82 (1995). | Article | PubMed |
      53.
         Rowen, L., Koop, B. F. & Hood, L. The complete 685-kilobase DNA sequence of the
         human beta T cell receptor locus. Science 272, 1755-1762 (1996). | PubMed |
      54.
         Koop, B. F. et al. Organization, structure, and function of 95 kb of DNA spanning the
         murine T-cell receptor C alpha/C delta region. Genomics 13, 1209-1230
         (1992). | PubMed |
      55.
         Wooster, R. et al. Identification of the breast cancer susceptibility gene BRCA2.
         Nature 378, 789-792 (1995). | PubMed |
      56.
         Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of
         Haemophilus influenzae Rd. Science 269, 496-512 (1995). | PubMed |
      57.
         Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones:
         a mathematical analysis. Genomics 2, 231-239 (1988). | PubMed |
      58.
         Weber, J. L. & Myers, E. W. Human whole-genome shotgun sequencing. Genome
         Res. 7, 401-409 (1997). | PubMed |
      59.
         Green, P. Against a whole-genome shotgun. Genome Res. 7, 410-417
         (1997). | PubMed |
      60.
         Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280,
         1540-1542 (1998). | Article | PubMed |
      61.
         Venter, J. C. et al. The sequence of the human genome. Science 291, 1304-1351
         (2001).
      62.
         Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis.
         Nature 321, 674-679 (1986). | PubMed |
      63.
         Ju, J. Y., Ruan, C. C., Fuller, C. W., Glazer, A. N. & Mathies, R. A. Fluorescence
         energy-transfer dye-labeled primers for DNA sequencing and analysis. Proc. Natl
         Acad. Sci. USA 92, 4347-4351 (1995). | PubMed |
      64.
         Lee, L. G. et al. New energy transfer dyes for DNA sequencing. Nucleic Acids Res.
         25, 2816-2822 (1997). | Article | PubMed |
      65.
         Rosenblum, B. B. et al. New dye-labeled terminators for improved DNA sequencing
         patterns. Nucleic Acids Res. 25, 4500-4504 (1997). | Article | PubMed |
      66.
         Metzker, M. L., Lu, J. & Gibbs, R. A. Electrophoretically uniform fluorescent dyes for
         automated DNA sequencing. Science 271, 1420-1422 (1996). | PubMed |
      67.
         Prober, J. M. et al. A system for rapid DNA sequencing with fluorescent
         chain-terminating dideoxynucleotides. Science 238, 336-341 (1987). | PubMed |
      68.
         Reeve, M. A. & Fuller, C. W. A novel thermostable polymerase for DNA sequencing.
         Nature 376, 796-797 (1995). | PubMed |
      69.
         Tabor, S. & Richardson, C. C. Selective inactivation of the exonuclease activity of
         bacteriophage T7 DNA polymerase by in vitro mutagenesis. J. Biol. Chem. 264,
         6447-6458 (1989). | PubMed |
      70.
         Tabor, S. & Richardson, C. C. DNA sequence analysis with a modified bacteriophage
         T7 DNA polymerase--effect of pyrophosphorolysis and metal ions. J. Biol. Chem. 265,
         8322-8328 (1990). | PubMed |
      71.
         Murray, V. Improved double-stranded DNA sequencing using the linear polymerase
         chain reaction. Nucleic Acids Res. 17, 8889 (1989). | PubMed |
      72.
         Guttman, A., Cohen, A. S., Heiger, D. N. & Karger, B. L. Analytical and
         micropreparative ultrahigh resolution of oligonucleotides by polyacrylamide-gel
         high-performance capillary electrophoresis. Anal. Chem. 62, 137-141 (1990).
      73.
         Luckey, J. A. et al. High-speed DNA sequencing by capillary electrophoresis. Nucleic
         Acids Res. 18, 4417-4421 (1990). | PubMed |
      74.
         Swerdlow, H., Wu, S., Harke, H. & Dovichi, N. J. Capillary gel-electrophoresis for DNA
         sequencing--laser-induced fluorescence detection with the sheath flow cuvette. J.
         Chromatogr. 516, 61-67 (1990). | PubMed |
      75.
         Meldrum, D. Automation for genomics, part one: preparation for sequencing. Genome
         Res. 10, 1081-1092 (2000). | PubMed |
      76.
         Meldrum, D. Automation for genomics, part two: sequencers, microarrays, and future
         trends. Genome Res. 10, 1288-1303 (2000). | PubMed |
      77.
         Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II.
         Error probabilities. Genome Res. 8, 186-194 (1998). | PubMed |
      78.
         Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer
         traces using phred. I. Accuracy assessment. Genome Res. 8, 175-185
         (1998). | PubMed |
      79.
         Bentley, D. R. Genomic sequence information should be released immediately and
         freely in the public domain. Science 274, 533-534 (1996). | Article | PubMed |
      80.
         Guyer, M. Statement on the rapid release of genomic DNA sequence. Genome Res.
         8, 413 (1998). | PubMed |
      81.
         Dietrich, W. et al. A genetic map of the mouse suitable for typing intraspecific
         crosses. Genetics 131, 423-447 (1992). | PubMed |
      82.
         Kim, U. J. et al. Construction and characterization of a human bacterial artificial
         chromosome library. Genomics 34, 213-218 (1996). | Article | PubMed |
      83.
         Osoegawa, K. et al. Bacterial artificial chromosome libraries for mouse sequencing
         and functional analysis. Genome Res. 10, 116-128 (2000). | PubMed |
      84.
         Marra, M. A. et al. High throughput fingerprint analysis of large-insert clones. Genome
         Res. 7, 1072-1084 (1997). | PubMed |
      85.
         Marra, M. et al. A map for sequence analysis of the Arabidopsis thaliana genome.
         Nature Genet. 22, 265-270 (1999). | Article | PubMed |
      86.
         The International Human Genome Mapping Consortium. A physical map of the human
         genome. Nature 409, 934-941 (2001). | Article |
      87.
         Zhao, S. et al. Human BAC ends quality assessment and sequence analyses.
         Genomics 63, 321-332 (2000). | Article | PubMed |
      88.
         Mahairas, G. G. et al. Sequence-tagged connectors: A sequence approach to
         mapping and scanning the human genome. Proc. Natl Acad. Sci. USA 96, 9739-9744
         (1999). | Article | PubMed |
      89.
         Tilford, C. A. et al. A physical map of the human Y chromosome. Nature 409, 943-945
         (2001). | Article |
      90.
         Bentley, D. R. et al. The physical maps for sequencing human chromosomes 1, 6, 9,
         10, 13, 20 and X. Nature 409, 942-943 (2001). | Article |
      91.
         Montgomery, K. T. et al. A high-resolution map of human chromosome 12. Nature
         409, 945-946 (2001). | Article |
      92.
         Brüls, T. et al. A physical map of human chromosome 14. Nature 409, 947-948
         (2001). | Article |
      93.
         Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311-319
         (2000). | Article | PubMed |
      94.
         Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489-495
         (1999). | Article | PubMed |
      95.
         Cox, D. et al. Radiation hybrid map of the human genome. Science (in the press).
      96.
         Osoegawa, K. et al. An improved approach for construction of bacterial artificial
         chromosome libraries. Genomics 52, 1-8 (1998). | Article | PubMed |
    97.
         The International SNP Map Working Group. A map of human genome sequence
         variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933
         (2001). | Article |
      98.
         Collins, F. S., Brooks, L. D. & Chakravarti, A. A DNA polymorphism discovery
         resource for research on human genetic variation. Genome Res. 8, 1229-1231
         (1998). | PubMed |
      99.
         Stewart, E. A. et al. An STS-based radiation hybrid map of the human genome.
         Genome Res. 7, 422-433 (1997). | PubMed |
     100.
         Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744-746
         (1998). | PubMed |
     101.
         Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264
         microsatellites. Nature 380, 152-154 (1996). | PubMed |
     102.
         Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L. & Weber, J. L.
         Comprehensive human genetic maps: individual and sex-specific variation in
         recombination. Am. J. Hum. Genet. 63, 861-869 (1998). | Article | PubMed |
     103.
         The BAC Resource Consortium. Integration of cytogenetic landmarks into the draft
         sequence of the human genome. Nature 409, 953-958 (2001). | Article |
     104.
         Kent, W. J. & Haussler, D. GigAssembler: an algorithm for the initial assembly of the
         human working draft . Technical Report UCSC-CRL-00-17 (Univ. California at Santa
         Cruz, Santa Cruz, California, 2001).
     105.
         Morton, N. E. Parameters of the human genome. Proc. Natl Acad. Sci. USA 88,
         7474-7476 (1991). | PubMed |
     106.
         Podugolnikova, O. A. & Blumina, M. G. Heterochromatic regions on chromosomes 1,
         9, 16, and Y in children with some disturbances occurring during embryo development.
         Hum. Genet. 63, 183-188 (1983). | PubMed |
     107.
         Lundgren, R., Berger, R. & Kristoffersson, U. Constitutive heterochromatin C-band
         polymorphism in prostatic cancer. Cancer Genet. Cytogenet. 51, 57-62
         (1991). | PubMed |
     108.
         Lee, C., Wevrick, R., Fisher, R. B., Ferguson-Smith, M. A. & Lin, C. C. Human
         centromeric DNAs. Hum. Genet. 100, 291-304 (1997). | Article | PubMed |
     109.
         Riethman, H. C. et al. Integration of telomere sequences with the draft human genome
         sequence. Nature 409, 953-958 (2001). | Article |
    110.
         Pruit, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources.
         Nucleic Acids Res. 29, 137-140 (2001). | PubMed |
     111.
         Wolfsberg, T. G., McEntyre, J. & Schuler, G. D. Guide to the draft human genome.
         Nature 409, 824-826 (2001). | Article |
     112.
         Hurst, L. D. & Eyre-Walker, A. Evolutionary genomics: reading the bands. Bioessays
         22, 105-107 (2000). | Article | PubMed |
     113.
         Saccone, S. et al. Correlations between isochores and chromosomal bands in the
         human genome. Proc. Natl Acad. Sci. USA 90, 11929-11933 (1993). | PubMed |
     114.
         Zoubak, S., Clay, O. & Bernardi, G. The gene distribution of the human genome. Gene
         174, 95-102 (1996). | Article | PubMed |
     115.
         Gardiner, K. Base composition and gene distribution: critical patterns in mammalian
         genome organization. Trends Genet. 12, 519-524 (1996). | Article | PubMed |
    116.
         Duret, L., Mouchiroud, D. & Gautier, C. Statistical analysis of vertebrate sequences
         reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308-317
         (1995). | PubMed |
     117.
         Saccone, S., De Sario, A., Della Valle, G. & Bernardi, G. The highest gene
         concentrations in the human genome are in telomeric bands of metaphase
         chromosomes. Proc. Natl Acad. Sci. USA 89, 4913-4917 (1992). | PubMed |
     118.
         Bernardi, G. et al. The mosaic genome of warm-blooded vertebrates. Science 228,
         953-958 (1985). | PubMed |
     119.
         Bernardi, G. Isochores and the evolutionary genomics of vertebrates. Gene 241, 3-17
         (2000). | Article | PubMed |
     120.
         Fickett, J. W., Torney, D. C. & Wolf, D. R. Base compositional structure of genomes.
         Genomics 13, 1056-1064 (1992). | PubMed |
     121.
         Churchill, G. A. Stochastic models for heterogeneous DNA sequences. Bull. Math.
         Biol. 51, 79-94 (1989). | PubMed |
     122.
         Bird, A., Taggart, M., Frommer, M., Miller, O. J. & Macleod, D. A fraction of the
         mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell 40,
         91-99 (1985). | PubMed |
     123.
         Bird, A. P. CpG islands as gene markers in the vertebrate nucleus. Trends Genet. 3,
         342-347 (1987).
     124.
         Chan, M. F., Liang, G. & Jones, P. A. Relationship between transcription and DNA
         methylation. Curr. Top. Microbiol. Immunol. 249, 75-86 (2000). | PubMed |
     125.
         Holliday, R. & Pugh, J. E. DNA modification mechanisms and gene activity during
         development. Science 187, 226-232 (1975). | PubMed |
     126.
         Larsen, F., Gundersen, G., Lopez, R. & Prydz, H. CpG islands as gene markers in
         the human genome. Genomics 13, 1095-1107 (1992). | PubMed |
     127.
         Tazi, J. & Bird, A. Alternative chromatin structure at CpG islands. Cell 60, 909-920
         (1990). | PubMed |
     128.
         Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol.
         196, 261-282 (1987). | PubMed |
     129.
         Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse.
         Proc. Natl Acad. Sci. USA 90, 11995-11999 (1993). | PubMed |
     130.
         Ewing, B. & Green, P. Analysis of expressed sequence tags indicates 35,000 human
         genes. Nature Genet. 25, 232-234 (2000). | Article | PubMed |
     131.
         Yu, A. Comparison of human genetic and sequence-based physical maps. Nature
         409, 951-953 (2001). | Article |
     132.
         Kaback, D. B., Guacci, V., Barber, D. & Mahon, J. W. Chromosome size-dependent
         control of meiotic recombination. Science 256, 228-232 (1992). | PubMed |
     133.
         Riles, L. et al. Physical maps of the 6 smallest chromosomes of Saccharomyces
         cerevisiae at a resolution of 2.6-kilobase pairs. Genetics 134, 81-150
         (1993). | PubMed |
     134.
         Lynn, A. et al. Patterns of meiotic recombination on the long arm of human
         chromosome 21. Genome Res. 10, 1319-1332 (2000). | PubMed |
     135.
         Laurie, D. A. & Hulten, M. A. Further studies on bivalent chiasma frequency in human
         males with normal karyotypes. Ann. Hum. Genet. 49, 189-201 (1985). | PubMed |
     136.
         Roeder, G. S. Meiotic chromosomes: it takes two to tango. Genes Dev. 11,
         2600-2621 (1997). | PubMed |
     137.
         Wu, T.-C. & Lichten, M. Meiosis-induced double-strand break sites determined by
         yeast chromatin structure. Science 263, 515-518 (1994). | PubMed |
     138.
         Gerton, J. L. et al. Global mapping of meiotic recombination hotspots and coldspots in
         the yeast Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 97, 11383-11390
         (2000). | PubMed |
    139.
         Li, W. -H. Molecular Evolution (Sinauer, Sunderland, Massachusetts, 1997).
    140.
         Gregory, T. R. & Hebert, P. D. The modulation of DNA content: proximate causes and
         ultimate consequences. Genome Res. 9, 317-324 (1999). | PubMed |
    141.
         Hartl, D. L. Molecular melodies in high and low C. Nature Rev. Genet. 1, 145-149
         (2000). | Article |
    142.
         Smit, A. F. Interspersed repeats and other mementos of transposable elements in
         mammalian genomes. Curr. Opin. Genet. Dev. 9, 657-663 (1999). | Article | PubMed |
    143.
         Prak, E. L. & Haig, H. K. Jr Mobile elements and the human genome. Nature Rev.
         Genet. 1, 134-144 (2000). | Article |
    144.
         Okada, N., Hamada, M., Ogiwara, I. & Ohshima, K. SINEs and LINEs share common
         3' sequences: a review. Gene 205, 229-243 (1997). | Article | PubMed |
    145.
         Esnault, C., Maestre, J. & Heidmann, T. Human LINE retrotransposons generate
         processed pseudogenes. Nature Genet. 24, 363-367 (2000). | Article | PubMed |
    146.
         Wei, W. et al. Human L1 retrotransposition: cis-preference vs. trans-complementation.
         Mol. Cell. Biol. 21, 1429-1439 (2001) | PubMed |
    147.
         Malik, H. S., Henikoff, S. & Eickbush, T. H. Poised for contagion: evolutionary origins
         of the infectious abilities of invertebrate retroviruses. Genome Res. 10, 1307-1318
         (2000). | PubMed |
    148.
         Smit, A. F. The origin of interspersed repeats in the human genome. Curr. Opin.
         Genet. Dev. 6, 743-748 (1996). | PubMed |
    149.
         Clark, J. B. & Tidwell, M. G. A phylogenetic perspective on P transposable element
         evolution in Drosophila. Proc. Natl Acad. Sci. USA 94, 11428-11433
         (1997). | Article | PubMed |
    150.
         Haring, E., Hagemann, S. & Pinsker, W. Ancient and recent horizontal invasions of
         Drosophilids by P elements. J. Mol. Evol. 51, 577-586 (2000). | PubMed |
    151.
         Koga, A. et al. Evidence for recent invasion of the medaka fish genome by the Tol2
         transposable element. Genetics 155, 273-281 (2000). | PubMed |
    152.
         Robertson, H. M. & Lampe, D. J. Recent horizontal transfer of a mariner transposable
         element among and between Diptera and Neuroptera. Mol. Biol. Evol. 12, 850-862
         (1995). | PubMed |
    153.
         Simmons, G. M. Horizontal transfer of hobo transposable elements within the
         Drosophila melanogaster species complex: evidence from DNA sequencing. Mol. Biol.
         Evol. 9, 1050-1060 (1992). | PubMed |
    154.
         Malik, H. S., Burke, W. D. & Eickbush, T. H. The age and evolution of non-LTR
         retrotransposable elements. Mol. Biol. Evol. 16, 793-805 (1999). | PubMed |
    155.
         Kordis, D. & Gubensek, F. Bov-B long interspersed repeated DNA (LINE) sequences
         are present in Vipera ammodytes phospholipase A2 genes and in genomes of
         Viperidae snakes. Eur. J. Biochem. 246, 772-779 (1997). | PubMed |
    156.
         Jurka, J. Repbase update: a database and an electronic journal of repetitive elements.
         Trends Genet. 16, 418-420 (2000). | Article | PubMed |
    157.
         Sarich, V. M. & Wilson, A. C. Generation time and genome evolution in primates.
         Science 179, 1144-1147 (1973). | PubMed |
    158.
         Smit, A. F., Toth, G., Riggs, A. D., & Jurka, J. Ancestral, mammalian-wide
         subfamilies of LINE-1 repetitive sequences. J. Mol. Biol. 246, 401-417
         (1995). | Article | PubMed |
    159.
         Lim, J. K. & Simmons, M. J. Gross chromosome rearrangements mediated by
         transposable elements in Drosophila melanogaster. Bioessays 16, 269-275
         (1994). | PubMed |
    160.
         Caceres, M., Ranz, J. M., Barbadilla, A., Long, M. & Ruiz, A. Generation of a
         widespread Drosophila inversion by a transposable element. Science 285, 415-418
         (1999). | Article | PubMed |
    161.
         Gray, Y. H. It takes two transposons to tango: transposable-element-mediated
         chromosomal rearrangements. Trends Genet. 16, 461-468 (2000). | PubMed |
    162.
         Zhang, J. & Peterson, T. Genome rearrangements by nonlinear transposons in maize.
         Genetics 153, 1403-1410 (1999). | PubMed |
    163.
         Smit, A. F. Identification of a new, abundant superfamily of mammalian
         LTR-transposons. Nucleic Acids Res. 21, 1863-1872 (1993). | PubMed |
    164.
         Cordonnier, A., Casella, J. F. & Heidmann, T. Isolation of novel human endogenous
         retrovirus-like elements with foamy virus-related pol sequence. J. Virol. 69, 5890-5897
         (1995). | PubMed |
    165.
         Medstrand, P. & Mager, D. L. Human-specific integrations of the HERV-K endogenous
         retrovirus family. J. Virol. 72, 9782-9787 (1998). | PubMed |
    166.
         Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196-2204
         (2000). | Article | PubMed |
    167.
         Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. High intrinsic rate of DNA loss in
         Drosophila. Nature 384, 346-349 (1996). | PubMed |
    168.
         Li, W. H., Ellsworth, D. L., Krushkal, J., Chang, B. H. & Hewett-Emmett, D. Rates of
         nucleotide substitution in primates and rodents and the generation-time effect
         hypothesis. Mol. Phylogenet. Evol. 5, 182-187 (1996). | Article | PubMed |
    169.
         Goodman, M. et al. Toward a phylogenetic classification of primates based on DNA
         evidence complemented by fossil evidence. Mol. Phylogenet. Evol. 9, 585-598
         (1998). | Article | PubMed |
    170.
         Kazazian, H. H. Jr & Moran, J. V. The impact of L1 retrotransposons on the human
         genome. Nature Genet. 19, 19-24 (1998). | PubMed |
    171.
         Malik, H. S. & Eickbush, T. H. NeSL-1, an ancient lineage of site-specific non-LTR
         retrotransposons from Caenorhabditis elegans. Genetics 154, 193-203
         (2000). | PubMed |
    172.
         Casavant, N. C. et al. The end of the LINE?: lack of recent L1 activity in a group of
         South American rodents. Genetics 154, 1809-1817 (2000). | PubMed |
    173.
         Meunier-Rotival, M., Soriano, P., Cuny, G., Strauss, F. & Bernardi, G. Sequence
         organization and genomic distribution of the major family of interspersed repeats of
         mouse DNA. Proc. Natl Acad. Sci. USA 79, 355-359 (1982). | PubMed |
    174.
         Soriano, P., Meunier-Rotival, M. & Bernardi, G. The distribution of interspersed
         repeats is nonuniform and conserved in the mouse and human genomes. Proc. Natl
         Acad. Sci. USA 80, 1816-1820 (1983). | PubMed |
    175.
         Goldman, M. A., Holmquist, G. P., Gray, M. C., Caston, L. A. & Nag, A. Replication
         timing of genes and middle repetitive sequences. Science 224, 686-692
         (1984). | PubMed |
    176.
         Manuelidis, L. & Ward, D. C. Chromosomal and nuclear distribution of the HindIII
         1.9-kb human DNA repeat segment. Chromosoma 91, 28-38 (1984). | PubMed |
    177.
         Feng, Q., Moran, J. V., Kazazian, H. H.Jr & Boeke, J. D. Human L1 retrotransposon
         encodes a conserved endonuclease required for retrotransposition. Cell 87, 905-916
         (1996). | PubMed |
    178.
         Jurka, J. Sequence patterns indicate an enzymatic involvement in integration of
         mammalian retroposons. Proc. Natl Acad. Sci. USA 94, 1872-1877
         (1997). | Article | PubMed |
    179.
         Arcot, S. S. et al. High-resolution cartography of recently integrated human
         chromosome 19-specific Alu fossils. J. Mol. Biol. 281, 843-856
         (1998). | Article | PubMed |
    180.
         Schmid, C. W. Does SINE evolution preclude Alu function? Nucleic Acids Res. 26,
         4541-4550 (1998). | Article | PubMed |
    181.
         Chu, W. M., Ballard, R., Carpick, B. W., Williams, B. R. & Schmid, C. W. Potential
         Alu function: regulation of the activity of double-stranded RNA-activated kinase PKR.
         Mol. Cell. Biol. 18, 58-68 (1998). | PubMed |
    182.
         Li, T., Spearow, J., Rubin, C. M. & Schmid, C. W. Physiological stresses increase
         mouse short interspersed element (SINE) RNA expression in vivo. Gene 239, 367-372
         (1999). | Article | PubMed |
    183.
         Liu, W. M., Chu, W. M., Choudary, P. V. & Schmid, C. W. Cell stress and
         translational inhibitors transiently increase the abundance of mammalian SINE
         transcripts. Nucleic Acids Res. 23, 1758-1765 (1995). | PubMed |
    184.
         Filipski, J. Correlation between molecular clock ticking, codon usage fidelity of DNA
         repair, chromosome banding and chromatin compactness in germline cells. FEBS
         Lett. 217, 184-186 (1987). | PubMed |
    185.
         Sueoka, N. Directional mutation pressure and neutral molecular evolution. Proc. Natl
         Acad. Sci. USA 85, 2653-2657 (1988). | PubMed |
    186.
         Wolfe, K. H., Sharp, P. M. & Li, W. H. Mutation rates differ among regions of the
         mammalian genome. Nature 337, 283-285 (1989). | PubMed |
    187.
         Bains, W. Local sequence dependence of rate of base replacement in mammals.
         Mutat. Res. 267, 43-54 (1992). | PubMed |
    188.
         Mathews, C. K. & Ji, J. DNA precursor asymmetries, replication fidelity, and variable
         genome evolution. Bioessays 14, 295-301 (1992). | PubMed |
    189.
         Holmquist, G. P. & Filipski, J. Organization of mutations along the genome: a prime
         determinant of genome evolution. Trends Ecol. Evol. 9, 65-68 (1994).
    190.
         Eyre-Walker, A. Evidence of selection on silent site base composition in mammals:
         potential implications for the evolution of isochores and junk DNA. Genetics 152,
         675-683 (1999). | PubMed |
    191.
         The International SNP Map Working Group. An SNP map of the human genome
         generated by reduced representation shotgun sequencing. Nature 407, 513-516
         (2000). | Article |
    192.
         Bohossian, H. B., Skaletsky, H. & Page, D. C. Unexpectedly similar rates of
         nucleotide substitution found in male and female hominids. Nature 406, 622-625
         (2000). | Article | PubMed |
    193.
         Skowronski, J., Fanning, T. G. & Singer, M. F. Unit-length LINE-1 transcripts in
         human teratocarcinoma cells. Mol. Cell. Biol. 8, 1385-1397 (1988). | PubMed |
    194.
         Boissinot, S., Chevret, P. & Furano, A. V. L1 (LINE-1) retrotransposon evolution and
         amplification in recent human history. Mol. Biol. Evol. 17, 915-928 (2000). | PubMed |
    195.
         Moran, J. V. Human L1 retrotransposition: insights and peculiarities learned from a
         cultured cell retrotransposition assay. Genetica 107, 39-51 (1999). | PubMed |
    196.
         Kazazian, H. H.Jr et al. Haemophilia A resulting from de novo insertion of L1
         sequences represents a novel mechanism for mutation in man. Nature 332, 164-166
         (1988). | PubMed |
    197.
         Sheen, F.-m. et al. Reading between the LINEs: Human genomic variation introduced
         by LINE-1 retrotransposition. Genome Res. 10, 1496-1508 (2000). | PubMed |
    198.
         Dombroski, B. A., Mathias, S. L., Nanthakumar, E., Scott, A. F. & Kazazian, H. H.Jr
         Isolation of an active human transposable element. Science 254, 1805-1808
         (1991). | PubMed |
    199.
         Holmes, S. E., Dombroski, B. A., Krebs, C. M., Boehm, C. D. & Kazazian, H. H.Jr A
         new retrotransposable human L1 element from the LRE2 locus on chromosome 1q
         produces a chimaeric insertion. Nature Genet. 7, 143-148 (1994). | PubMed |
    200.
         Sassaman, D. M. et al. Many human L1 elements are capable of retrotransposition.
         Nature Genet. 16, 37-43 (1997). | PubMed |
    201.
         Dombroski, B. A., Scott, A. F. & Kazazian, H. H.Jr Two additional potential
         retrotransposons isolated from a human L1 subfamily that contains an active
         retrotransposable element. Proc. Natl Acad. Sci. USA 90, 6513-6517
         (1993). | PubMed |
    202.
         Kimberland, M. L. et al. Full-length human L1 insertions retain the capacity for high
         frequency retrotransposition in cultured cells. Hum. Mol. Genet. 8, 1557-1560
         (1999). | Article | PubMed |
    203.
         Moran, J. V. et al. High frequency retrotransposition in cultured mammalian cells. Cell
         87, 917-927 (1996). | PubMed |
    204.
         Moran, J. V., DeBerardinis, R. J. & Kazazian, H. H.Jr Exon shuffling by L1
         retrotransposition. Science 283, 1530-1534 (1999). | Article | PubMed |
    205.
         Pickeral, O. K., Makalowski, W., Boguski, M. S. & Boeke, J. D. Frequent human
         genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res. 10,
         411-415 (2000). | PubMed |
    206.
         Miki, Y. et al. Disruption of the APC gene by a retrotransposal insertion of L1
         sequence in a colon cancer. Cancer Res. 52, 643-645 (1992). | PubMed |
    207.
         Branciforte, D. & Martin, S. L. Developmental and cell type specificity of LINE-1
         expression in mouse testis: implications for transposition. Mol. Cell. Biol. 14,
         2584-2592 (1994). | PubMed |
    208.
         Trelogan, S. A. & Martin, S. L. Tightly regulated, developmentally specific expression
         of the first open reading frame from LINE-1 during mouse embryogenesis. Proc. Natl
         Acad. Sci. USA 92, 1520-1524 (1995). | PubMed |
    209.
         Jurka, J. & Kapitonov, V. V. Sectorial mutagenesis by transposable elements.
         Genetica 107, 239-248 (1999). | PubMed |
    210.
         Fraser, M. J., Ciszczon, T., Elick, T. & Bauser, C. Precise excision of TTAA-specific
         lepidopteran transposons piggyBac (IFP2) and tagalong (TFP3) from the baculovirus
         genome in cell lines from two species of Lepidoptera. Insect Mol. Biol. 5, 141-151
         (1996). | PubMed |
    211.
         Brosius, J. Genomes were forged by massive bombardments with retroelements and
         retrosequences. Genetica 107, 209-238 (1999). | PubMed |
     212.
         Kruglyak, S., Durrett, R. T., Schug, M. D. & Aquadro, C. F. Equilibrium distribution of
         microsatellite repeat length resulting from a balance between slippage events and
         point mutations. Proc. Natl Acad. Sci. USA 95, 10774-10778
         (1998). | Article | PubMed |
     213.
         Toth, G., Gaspari, Z. & Jurka, J. Microsatellites in different eukaryotic genomes:
         survey and analysis. Genome Res. 10, 967-981 (2000). | PubMed |
     214.
         Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA
         sequences. Nature Genet. 24, 400-402 (2000). | Article | PubMed |
     215.
         Ji, Y., Eichler, E. E., Schwartz, S. & Nicholls, R. D. Structure of chromosomal
         duplicons and their role in mediating human genomic disorders. Genome Res. 10,
         597-610 (2000). | PubMed |
     216.
         Eichler, E. E. Masquerading repeats: paralogous pitfalls of the human genome.
         Genome Res. 8, 758-762 (1998). | PubMed |
     217.
         Mazzarella, R. & D. Schlessinger, D. Pathological consequences of sequence
         duplications in the human genome. Genome Res. 8, 1007-1021 (1998). | PubMed |
     218.
         Eichler, E. E. et al. Interchromosomal duplications of the adrenoleukodystrophy locus:
         a phenomenon of pericentromeric plasticity. Hum. Mol. Genet. 6, 991-1002
         (1997). | Article | PubMed |
     219.
         Horvath, J. E., Schwartz, S. & Eichler, E. E. The mosaic structure of human
         pericentromeric DNA: a strategy for characterizing complex regions of the human
         genome. Genome Res. 10, 839-852 (2000). | PubMed |
     220.
         Brand-Arpon, V. et al. A genomic region encompassing a cluster of olfactory receptor
         genes and a myosin light chain kinase (MYLK) gene is duplicated on human
         chromosome regions 3q13-q21 and 3p13. Genomics 56, 98-110
         (1999). | Article | PubMed |
     221.
         Arnold, N., Wienberg, J., Ermert, K. & Zachau, H. G. Comparative mapping of DNA
         probes derived from the V kappa immunoglobulin gene regions on human and great
         ape chromosomes by fluorescence in situ hybridization. Genomics 26, 147-150
         (1995). | PubMed |
     222.
         Eichler, E. E. et al. Duplication of a gene-rich cluster between 16p11.1 and Xq28: a
         novel pericentromeric-directed mechanism for paralogous genome evolution. Hum.
         Mol. Genet. 5, 899-912 (1996). | Article | PubMed |
     223.
         Potier, M. et al. Two sequence-ready contigs spanning the two copies of a 200-kb
         duplication on human 21q: partial sequence and polymorphisms. Genomics 51,
         417-426 (1998). | Article | PubMed |
     224.
         Regnier, V. et al. Emergence and scattering of multiple neurofibromatosis
         (NF1)-related sequences during hominoid evolution suggest a process of
         pericentromeric interchromosomal transposition. Hum. Mol. Genet. 6, 9-16
         (1997). | Article | PubMed |
     225.
         Ritchie, R. J., Mattei, M. G. & Lalande, M. A large polymorphic repeat in the
         pericentromeric region of human chromosome 15q contains three partial gene
         duplications. Hum. Mol. Genet. 7, 1253-1260 (1998). | Article | PubMed |
     226.
         Trask, B. J. et al. Members of the olfactory receptor gene family are contained in large
         blocks of DNA duplicated polymorphically near the ends of human chromosomes.
         Hum. Mol. Genet. 7, 13-26 (1998). | Article | PubMed |
     227.
         Trask, B. J. et al. Large multi-chromosomal duplications encompass many members
         of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7,
         2007-2020 (1998). | Article | PubMed |
     228.
         van Deutekom, J. C. et al. Identification of the first gene (FRG1) from the FSHD region
         on human chromosome 4q35. Hum. Mol. Genet. 5, 581-590
         (1996). | Article | PubMed |
     229.
         Zachau, H. G. The immunoglobulin kappa locus--or--what has been learned from
         looking closely at one-tenth of a percent of the human genome. Gene 135, 167-173
         (1993). | PubMed |
     230.
         Zimonjic, D. B., Kelley, M. J., Rubin, J. S., Aaronson, S. A. & Popescu, N. C.
         Fluorescence in situ hybridization analysis of keratinocyte growth factor gene
         amplification and dispersion in evolution of great apes and humans. Proc. Natl Acad.
         Sci. USA 94, 11461-11465 (1997). | Article | PubMed |
     231.
         van Geel, M. et al. The FSHD region on human chromosome 4q35 contains potential
         coding regions among pseudogenes and a high density of repeat elements. Genomics
         61, 55-65 (1999). | Article | PubMed |
     232.
         Horvath, J. E. et al. Molecular structure and evolution of an alpha satellite/non-alpha
         satellite junction at 16p11. Hum. Mol. Genet. 9, 113-123 (2000). | Article | PubMed |
     233.
         Guy, J. et al. Genomic sequence and transcriptional profile of the boundary between
         pericentromeric satellites and genes on human chromosome arm 10q. Hum. Mol.
         Genet. 9, 2029-2042 (2000). | Article | PubMed |
     234.
         Reiter, L. T., Murakami, T., Koeuth, T., Gibbs, R. A. & Lupski, J. R. The human
         COX10 gene is disrupted during homologous recombination between the 24 kb
         proximal and distal CMT1A-REPs. Hum. Mol. Genet. 6, 1595-1603
         (1997). | Article | PubMed |
     235.
         Amos-Landgraf, J. M. et al. Chromosome breakage in the Prader-Willi and Angelman
         syndromes involves recombination between large, transcribed repeats at proximal and
         distal breakpoints. Am. J. Hum. Genet. 65, 370-386 (1999). | Article | PubMed |
     236.
         Christian, S. L., Fantes, J. A., Mewborn, S. K., Huang, B. & Ledbetter, D. H. Large
         genomic duplicons map to sites of instability in the Prader-Willi/Angelman syndrome
         chromosome region (15q11-q13). Hum. Mol. Genet. 8, 1025-1037
         (1999). | Article | PubMed |
     237.
         Edelmann, L., Pandita, R. K. & Morrow, B. E. Low-copy repeats mediate the common
         3-Mb deletion in patients with velo-cardio-facial syndrome. Am. J. Hum. Genet. 64,
         1076-1086 (1999). | Article | PubMed |
     238.
         Shaikh, T. H. et al. Chromosome 22-specific low copy repeats and the 22q11.2
         deletion syndrome: genomic organization and deletion endpoint analysis. Hum. Mol.
         Genet. 9, 489-501 (2000). | Article | PubMed |
     239.
         Francke, U. Williams-Beuren syndrome: genes and mechanisms. Hum. Mol. Genet.
         8, 1947-1954 (1999). | Article | PubMed |
    240.
         Peoples, R. et al. A physical map, including a BAC/PAC clone contig, of the
         Williams-Beuren syndrome-deletion region at 7q11.23. Am. J. Hum. Genet. 66, 47-68
         (2000). | Article | PubMed |
    241.
         Eichler, E. E., Archidiacono, N. & Rocchi, M. CAGGG repeats and the
         pericentromeric duplication of the hominoid genome. Genome Res. 9, 1048-1058
         (1999). | Article | PubMed |
    242.
         O'Keefe, C. & Eichler, E. in Comparative Genomics: Empirical and Analytical
         Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene
         Families (eds Sankoff, D. & Nadeau, J.) 29-46 (Kluwer Academic, Dordrecht, 2000).
    243.
         Lander, E. S. The new genomics: Global views of biology. Science 274, 536-539
         (1996). | Article | PubMed |
    244.
         Eddy, S. R. Noncoding RNA genes. Curr. Op. Genet. Dev. 9, 695-699
         (1999). | Article | PubMed |
    245.
         Ban, N., Nissen, P., Hansen, J., Moore, P. B. & Steitz, T. A. The complete atomic
         structure of the large ribosomal subunit at 2.4 angstrom resolution. Science 289,
         905-920 (2000). | Article | PubMed |
    246.
         Nissen, P., Hansen, J., Ban, N., Moore, P. B. & Steitz, T. A. The structural basis of
         ribosome activity in peptide bond synthesis. Science 289, 920-930
         (2000). | Article | PubMed |
    247.
         Weinstein, L. B. & Steitz, J. A. Guided tours: from precursor snoRNA to functional
         snoRNP. Curr. Opin. Cell Biol. 11, 378-384 (1999). | Article | PubMed |
    248.
         Bachellerie, J.-P. & Cavaille, J. in Modification and Editing of RNA (ed. Benne, H. G.
         a. R.) 255-272 (ASM, Washington DC, 1998).
    249.
         Burge, C. & Sharp, P. A. Classification of introns: U2-type or U12-type. Cell 91,
         875-879 (1997). | PubMed |
    250.
         Brown, C. J. et al. The Human Xist gene--analysis of a 17 kb inactive X-specific RNA
         that contains conserved repeats and is highly localized within the nucleus. Cell 71,
         527-542 (1992). | PubMed |
    251.
         Kickhoefer, V. A., Vasu, S. K. & Rome, L. H. Vaults are the answer, what is the
         question? Trends Cell Biol. 6, 174-178 (1996). | Article |
    252.
         Hatlen, L. & Attardi, G. Proportion of the HeLa cell genome complementary to the
         transfer RNA and 5S RNA. J. Mol. Biol. 56, 535-553 (1971). | PubMed |
    253.
         Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A. & Steinberg, S. Compilation of tRNA
         sequences and sequences of tRNA genes. Nucleic Acids Res. 26, 148-153
         (1998). | Article | PubMed |
    254.
         Long, E. O. & Dawid, I. B. Repeated genes in eukaryotes. Annu. Rev. Biochem. 49,
         727-764 (1980). | PubMed |
    255.
         Crick, F. H. Codon-anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19, 548-555
         (1966). | PubMed |
    256.
         Guthrie, C. & Abelson, J. in The Molecular Biology of the Yeast Saccharomyces:
         Metabolism and Gene Expression (eds Strathern, J. & Broach J.) 487-528 (Cold
         Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 1982).
    257.
         Soll, D. & RajBhandary, U. (eds) tRNA: Structure, Biosynthesis, and Function (ASM,
         Washington DC, 1995).
    258.
         Ikemura, T. Codon usage and tRNA content in unicellular and multicellular organisms.
         Mol. Biol. Evol. 2, 13-34 (1985). | PubMed |
    259.
         Bulmer, M. Coevolution of codon usage and transfer-RNA abundance. Nature 325,
         728-730 (1987). | PubMed |
    260.
         Duret, L. tRNA gene number and codon usage in the C. elegans genome are
         co-adapted for optimal translation of highly expressed genes. Trends Genet. 16,
         287-289 (2000). | Article | PubMed |
    261.
         Sharp, P. M. & Matassi, G. Codon usage and genome evolution. Curr. Opin. Genet.
         Dev. 4, 851-860 (1994). | PubMed |
    262.
         Buckland, R. A. A primate transfer-RNA gene cluster and the evolution of human
         chromosome 1. Cytogenet. Cell Genet. 61, 1-4 (1992). | PubMed |
    263.
         Gonos, E. S. & Goddard, J. P. Human tRNA-Glu genes: their copy number and
         organization. FEBS Lett. 276, 138-142 (1990). | PubMed |
    264.
         Sylvester, J. E. et al. The human ribosomal RNA genes: structure and organization of
         the complete repeating unit. Hum. Genet. 73, 193-198 (1986). | PubMed |
    265.
         Sorensen, P. D. & Frederiksen, S. Characterization of human 5S ribosomal RNA
         genes. Nucleic Acids Res. 19, 4147-4151 (1991). | PubMed |
    266.
         Timofeeva, M. et al. [Organization of a 5S ribosomal RNA gene cluster in the human
         genome]. Mol. Biol. (Mosk.) 27, 861-868 (1993).
    267.
         Little, R. D. & Braaten, D. C. Genomic organization of human 5S rDNA and sequence
         of one tandem repeat. Genomics 4, 376-383 (1989). | PubMed |
    268.
         Maden, B. E. The numerous modified nucleotides in eukaryotic ribosomal RNA.
         Prog. Nucleic Acid Res. Mol. Biol. 39, 241-303 (1990). | PubMed |
    269.
         Tycowski, K. T., You, Z. H., Graham, P. J. & Steitz, J. A. Modification of U6
         spliceosomal RNA is guided by other small RNAs. Mol. Cell 2, 629-638
         (1998). | PubMed |
    270.
         Pavelitz, T., Liao, D. Q. & Weiner, A. M. Concerted evolution of the tandem array
         encoding primate U2 snRNA (the RNU2 locus) is accompanied by dramatic
         remodeling of the junctions with flanking chromosomal sequences. EMBO J. 18,
         3783-3792 (1999). | Article | PubMed |
    271.
         Lindgren, V., Ares, A., Weiner, A. M. & Francke, U. Human genes for U2 small
         nuclear RNA map to a major adenovirus 12 modification site on chromosome 17.
         Nature 314, 115-116 (1985). | PubMed |
    272.
         Van Arsdell, S. W. & Weiner, A. M. Human genes for U2 small nuclear RNA are
         tandemly repeated. Mol. Cell. Biol. 4, 492-499 (1984). | PubMed |
    273.
         Gao, L. I., Frey, M. R. & Matera, A. G. Human genes encoding U3 snRNA associate
         with coiled bodies in interphase cells and are clustered on chromosome 17p11. 2 in a
         complex inverted repeat structure. Nucleic Acids Res. 25, 4740-4747
         (1997). | Article | PubMed |
    274.
         Hawkins, J. D. A survey on intron and exon lengths. Nucleic Acids Res. 16,
         9893-9908 (1988). | PubMed |
    275.
         Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA.
         J. Mol. Biol. 268, 78-94 (1997). | Article | PubMed |
    276.
         Labeit, S. & Kolmerer, B. Titins: giant proteins in charge of muscle ultrastructure and
         elasticity. Science 270, 293-296 (1995). | PubMed |
    277.
         Sterner, D. A., Carlo, T. & Berget, S. M. Architectural limits on split genes. Proc. Natl
         Acad. Sci. USA 93, 15081-15085 (1996). | Article | PubMed |
    278.
         Sun, Q., Mayeda, A., Hampson, R. K., Krainer, A. R. & Rottman, F. M. General
         splicing factor SF2/ASF promotes alternative splicing by binding to an exonic splicing
         enhancer. Genes Dev. 7, 2598-2608 (1993). | PubMed |
    279.
         Tanaka, K., Watakabe, A. & Shimura, Y. Polypurine sequences within a downstream
         exon function as a splicing enhancer. Mol. Cell. Biol. 14, 1347-1354
         (1994). | PubMed |
    280.
         Carlo, T., Sterner, D. A. & Berget, S. M. An intron splicing enhancer containing a
         G-rich repeat facilitates inclusion of a vertebrate micro-exon. RNA 2, 342-353
         (1996). | PubMed |
    281.
         Burset, M., Seledtsov, I. A. & Solovyev, V. V. Analysis of canonical and non-canonical
         splice sites in mammalian genomes. Nucleic Acids Res. 28, 4364-4375
         (2000). | PubMed |
    282.
         Burge, C. B., Padgett, R. A. & Sharp, P. A. Evolutionary fates and origins of U12-type
         introns. Mol. Cell 2, 773-785 (1998).
    283.
         Mironov, A. A., Fickett, J. W. & Gelfand, M. S. Frequent alternative splicing of human
         genes. Genome Res. 9, 1288-1293 (1999). | Article | PubMed |
    284.
         Hanke, J. et al. Alternative splicing of human genes: more the rule than the exception?
         Trends Genet. 15, 389-390 (1999). | Article | PubMed |
    285.
         Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible
         alternative splice forms. FEBS Lett. 474, 83-86 (2000). | Article | PubMed |
    286.
         Dunham, I. The gene guessing game. Yeast 17, 218-224 (2000). | Article | PubMed |
    287.
         Lewin, B. Gene Expression (Wiley, New York, 1980).
    288.
         Lewin, B. Genes IV 466-481 (Oxford Univ. Press, Oxford, 1990).
    289.
         Smaglik, P. Researchers take a gamble on the human genome. Nature 405, 264
         (2000). | Article | PubMed |
    290.
         Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human
         genome? Nature Genet. 7, 345-346 (1994). | PubMed |
     291.
         Liang, F. et al. Gene index analysis of the human genome estimates approximately
         120,000 genes. Nature Genet. 25, 239-240 (2000). | Article | PubMed |
     292.
         Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide
         analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235-238
         (2000). | Article | PubMed |
     293.
         The C. elegans Sequencing Consortium. Genome sequence of the nematode C.
         elegans: A platform for investigating biology. Science 282, 2012-2018 (1998).
     294.
         Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204-2215
         (2000). | Article | PubMed |
     295.
         Green, P. et al. Ancient conserved regions in new gene sequences and the protein
         databases. Science 259, 1711-1716 (1993). | PubMed |
     296.
         Fraser, A. G. et al. Functional genomic analysis of C. elegans chromosome I by
         systematic RNA interference. Nature 408, 325-330 (2000). | Article | PubMed |
     297.
         Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced
         genomic DNA. Comput. Appl. Biosci. 13, 477-478 (1997). | PubMed |
     298.
         Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. A computer program for
         aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967-974
         (1998). | PubMed |
     299.
         Bailey, L. C.Jr, Searls, D. B. & Overton, G. C. Analysis of EST-driven gene annotation
         in human genomic sequence. Genome Res. 8, 362-376 (1998). | PubMed |
     300.
         Birney, E., Thompson, J. D. & Gibson, T. J. PairWise and SearchWise: finding the
         optimal alignment in a simultaneous comparison of a protein profile against all DNA
         translation frames. Nucleic Acids Res. 24, 2730-2739 (1996). | Article | PubMed |
     301.
         Gelfand, M. S., Mironov, A. A. & Pevzner, P. A. Gene recognition via spliced
         sequence alignment. Proc. Natl Acad. Sci. USA 93, 9061-9066
         (1996). | Article | PubMed |
     302.
         Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H. A generalized hidden Markov
         model for the recognition of human genes in DNA. ISMB 4, 134-142 (1996). | PubMed |
     303.
         Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie--gene finding in
         Drosophila melanogaster. Genome Res. 10, 529-538 (2000). | PubMed |
     304.
         Solovyev, V. & Salamov, A. The Gene-Finder computer tools for analysis of human
         and model organisms genome sequences. ISMB 5, 294-302 (1997). | PubMed |
     305.
         Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of
         gene prediction accuracy in large DNA sequences. Genome Res. 10, 1631-1642
         (2000). | PubMed |
     306.
         Hubbard, T. & Birney, E. Open annotation offers a democratic solution to genome
         sequencing. Nature 403, 825 (2000). | Article | PubMed |
     307.
         Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 28,
         263-266 (2000). | Article | PubMed |
     308.
         Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment.
         Genome Res. 10, 547-548 (2000). | PubMed |
     309.
         The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM
         Consortium. Functional annotation of a full-length mouse cDNA collection. Nature 409,
         685-690 (2001). | Article |
     310.
         Basrai, M. A., Hieter, P. & Boeke, J. D. Small open reading frames: beautiful needles
         in the haystack. Genome Res. 7, 768-771 (1997). | PubMed |
     311.
         Janin, J. & Chothia, C. Domains in proteins: definitions, location, and structural
         principles. Methods Enzymol. 115, 420-430 (1985). | PubMed |
     312.
         Ponting, C. P., Schultz, J., Copley, R. R., Andrade, M. A. & Bork, P. Evolution of
         domain families. Adv. Protein Chem. 54, 185-244 (2000). | PubMed |
     313.
         Doolittle, R. F. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64,
         287-314 (1995). | PubMed |
     314.
         Bateman, A. & Birney, E. Searching databases to find protein domain organization.
         Adv. Protein Chem. 54, 137-157 (2000). | PubMed |
     315.
         Futreal, P. A. et al. Cancer and genomics. Nature 409, 850-852 (2001). | Article |
     316.
         Nestler, E. J. & Landsman, D. Learning about addiction from the human draft genome.
         Nature 409, 834-835 (2001). | Article |
     317.
         Tupler, R., Perini, G. & Green, M. R. Expressing the human genome. Nature 409,
         832-835 (2001). | Article |
     318.
         Fahrer, A. M., Bazan, J. F., Papathanasiou, P., Nelms, K. A. & Goodnow, C. C. A
         genomic view of immunology. Nature 409, 836-838 (2001). | Article |
     319.
         Li, W. -H., Gu, Z., Wang, H. & Nekrutenko, A. Evolutionary analyses of the human
         genome. Nature 409, 847-849 (2001). | Article |
     320.
         Bock, J. B., Matern, H. T., Peden, A. A. & Scheller, R. H. A genomic perspective on
         membrane compartment organization. Nature 409, 839-841 (2001). | Article |
     321.
         Pollard, T. D. Genomics, the cytoskeleton and motility. Nature 409, 842-843
         (2001). | Article |
     322.
         Murray, A. W. & Marks, D. Can sequencing shed light on cell cycling? Nature 409,
         844-846 (2001). | Article |
     323.
         Clayton, J. D., Kyriacou, C. P. & Reppert, S. M. Keeping time with the human
         genome. Nature 409, 829-831 (2001). | Article |
     324.
         Chervitz, S. A. et al. Comparison of the complete protein sets of worm and yeast:
         orthology and divergence. Science 282, 2022-2028 (1998). | PubMed |
     325.
         Aravind, L. & Subramanian, G. Origin of multicellular eukaryotes--insights from
         proteome comparisons. Curr. Opin. Genet. Dev. 9, 688-694
         (1999). | Article | PubMed |
     326.
         Attwood, T. K. et al. PRINTS-S: the database formerly known as PRINTS. Nucleic
         Acids Res. 28, 225-227 (2000). | Article | PubMed |
     327.
         Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status
         in 1999. Nucleic Acids Res. 27, 215-219 (1999). | Article | PubMed |
     328.
         Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein
         database search programs. Nucleic Acids Res. 25, 3389-3402
         (1997). | Article | PubMed |
     329.
         Wolf, Y. I., Kondrashov, F. A. & Koonin, E. V. No footprints of primordial introns in a
         eukaryotic genome. Trends Genet. 16, 333-334 (2000). | PubMed |
     330.
         Brunner, H. G., Nelen, M., Breakefield, X. O., Ropers, H. H. & van Oost, B. B. A.
         Abnormal behavior associated with a point mutation in the structural gene for
         monoamine oxidase A. Science 262, 578-580 (1993). | PubMed |
     331.
         Cases, O. et al. Aggressive behavior and altered amounts of brain serotonin and
         norepinephrine in mice lacking MAOA. Science 268, 1763-1766 (1995). | PubMed |
     332.
         Brunner, H. G. et al. X-linked borderline mental retardation with prominent behavioral
         disturbance: phenotype, genetic localization, and evidence for disturbed monoamine
         metabolism. Am. J. Hum. Genet. 52, 1032-1039 (1993). | PubMed |
     333.
         Deckert, J. et al. Excess of high activity monoamine oxidase A gene promoter alleles
         in female patients with panic disorder. Hum. Mol. Genet. 8, 621-624
         (1999). | Article | PubMed |
     334.
         Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J.
         Mol. Biol. 147, 195-197 (1981). | PubMed |
     335.
         Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein
         families. Science 278, 631-637 (1997). | Article | PubMed |
     336.
         Ponting, C. P., Aravind, L., Schultz, J., Bork, P. & Koonin, E. V. Eukaryotic signalling
         domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene
         transfer. J. Mol. Biol. 289, 729-745 (1999). | Article | PubMed |
     337.
         Zhang, J., Dyer, K. D. & Rosenberg, H. F. Evolution of the rodent
         eosinophil-associated Rnase gene family by rapid gene sorting and positive selection.
         Proc. Natl Acad. Sci. USA 97, 4701-4706 (2000). | Article | PubMed |
     338.
         Shashoua, V. E. Ependymin, a brain extracellular glycoprotein, and CNS plasticity.
         Ann. NY Acad. Sci. 627, 94-114 (1991). | PubMed |
     339.
         Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P. & Bork, P. SMART: a
         web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28,
         231-234 (2000). | Article | PubMed |
     340.
         Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics
         on our understanding of evolution. Cell 101, 573-576 (2000). | PubMed |
     341.
         Bateman, A., Eddy, S. R. & Chothia, C. Members of the immunoglobulin superfamily
         in bacteria. Protein Sci. 5, 1939-1941 (1996). | PubMed |
     342.
         Sutherland, D., Samakovlis, C. & Krasnow, M. A. Branchless encodes a Drosophila
         FGF homolog that controls tracheal cell migration and the pattern of branching. Cell
         87, 1091-1101 (1996). | PubMed |
     343.
         Warburton, D. et al. The molecular basis of lung morphogenesis. Mech. Dev. 92,
         55-81 (2000). | Article | PubMed |
     344.
         Fuchs, T., Glusman, G., Horn-Saban, S., Lancet, D. & Pilpel, Y. The human olfactory
         subgenome: from sequence to structure to evolution. Hum. Genet. 108, 1-13 (2001).
     345.
         Glusman, G. et al. The olfactory receptor gene family: data mining, classification and
         nomenclature. Mamm. Genome 11, 1016-1023 (2000). | PubMed |
     346.
         Rouquier, S. et al. Distribution of olfactory receptor genes in the human genome.
         Nature Genet. 18, 243-250 (1998). | PubMed |
     347.
         Sharon, D. et al. Primate evolution of an olfactory receptor cluster: Diversification by
         gene conversion and recent emergence of a pseudogene. Genomics 61, 24-36
         (1999). | Article | PubMed |
     348.
         Gilad, Y. et al. Dichotomy of single-nucleotide polymorphism haplotypes in olfactory
         receptor genes and pseudogenes. Nature Genet. 26, 221-224 (2000). | PubMed |
     349.
         Gearhart, J. & Kirschner, M. Cells, Embryos, and Evolution (Blackwell Science,
         Malden, Massachusetts, 1997).
     350.
         Barbazuk, W. B. et al. The syntenic relationship of the zebrafish and human
         genomes. Genome Res. 10, 1351-1358 (2000). | PubMed |
     351.
         McLysaght, A., Enright, A. J., Skrabanek, L. & Wolfe, K. H. Estimation of synteny
         conservation and genome compaction between pufferfish (Fugu) and human. Yeast 17,
         22-36 (2000). | Article | PubMed |
     352.
         Trachtulec, Z. et al. Linkage of TATA-binding protein and proteasome subunit C5
         genes in mice and humans reveals synteny conserved between mammals and
         invertebrates. Genomics 44, 1-7 (1997). | Article | PubMed |
     353.
         Nadeau, J. H. Maps of linkage and synteny homologies between mouse and man.
         Trends Genet. 5, 82-86 (1989). | PubMed |
     354.
         Nadeau, J. H. & Taylor, B. A. Lengths of chromosomal segments conserved since
         divergence of man and mouse. Proc. Natl Acad. Sci. USA 81, 814-818
         (1984). | PubMed |
     355.
         Copeland, N. G. et al. A genetic linkage map of the mouse: current applications and
         future prospects. Science 262, 57-66 (1993). | PubMed |
     356.
         DeBry, R. W. & Seldin, M. F. Human/mouse homology relationships. Genomics 33,
         337-351 (1996). | Article | PubMed |
     357.
         Nadeau, J. H. & Sankoff, D. The lengths of undiscovered conserved segments in
         comparative maps. Mamm. Genome 9, 491-495 (1998). | PubMed |
     358.
         Thomas, J. W. et al. Comparative genome mapping in the sequence-based era: early
         experience with human chromosome 7. Genome Res. 10, 624-633 (2000). | PubMed |
     359.
         Pletcher, M. T. et al. Chromosome evolution: The junction of mammalian
         chromosomes in the formation of mouse chromosome 10. Genome Res. 10,
         1463-1467 (2000). | PubMed |
     360.
         Novacek, M. J. Mammalian phylogeny: shaking the tree. Nature 356, 121-125
         (1992). | PubMed |
     361.
         O'Brien, S. J. et al. Genome maps 10. Comparative genomics. Mammalian radiations.
         Wall chart. Science 286, 463-478 (1999). | Article | PubMed |
     362.
         Romer, A. S. Vertebrate Paleontology (Univ. Chicago Press, Chicago and New York,
         1966).
     363.
         Paterson, A. H. et al. Toward a unified genetic map of higher plants, transcending the
         monocot-dicot divergence. Nature Genet. 14, 380-382 (1996). | PubMed |
     364.
         Jenczewski, E., Prosperi, J. M. & Ronfort, J. Differentiation between natural and
         cultivated populations of Medicago sativa (Leguminosae) from Spain: analysis with
         random amplified polymorphic DNA (RAPD) markers and comparison to allozymes.
         Mol. Ecol. 8, 1317-1330 (1999). | Article | PubMed |
     365.
         Ohno, S. Evolution by Gene Duplication (George Allen and Unwin, London, 1970).
     366.
         Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the
         entire yeast genome. Nature 387, 708-713 (1997). | Article | PubMed |
     367.
         Blanc, G., Barakat, A., Guyot, R., Cooke, R. & Delseny, M. Extensive duplication and
         reshuffling in the arabidopsis genome. Plant Cell 12, 1093-1102 (2000). | PubMed |
     368.
         Paterson, A. H. et al. Comparative genomics of plant chromosomes. Plant Cell 12,
         1523-1540 (2000). | PubMed |
     369.
         Vision, T., Brown, D. & Tanksley, S. The origins of genome duplications in
         Arabidopsis. Science 290, 2114-2117 (2000). | Article | PubMed |
     370.
         Sidow, A. & Bowman, B. H. Molecular phylogeny. Curr. Opin. Genet. Dev. 1, 451-456
         (1991). | PubMed |
     371.
         Sidow, A. & Thomas, W. K. A molecular evolutionary framework for eukaryotic model
         organisms. Curr. Biol. 4, 596-603 (1994). | PubMed |
     372.
         Sidow, A. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin.
         Genet. Dev. 6, 715-722 (1996). | PubMed |
     373.
         Spring, J. Vertebrate evolution by interspecific hybridisation--are we polyploid? FEBS
         Lett. 400, 2-8 (1997). | Article | PubMed |
     374.
         Skrabanek, L. & Wolfe, K. H. Eukaryote genome duplication--where's the evidence?
         Curr. Opin. Genet. Dev. 8, 694-700 (1998). | PubMed |
     375.
         Hughes, A. L. Phylogenies of developmentally important proteins do not support the
         hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol.
         Evol. 48, 565-576 (1999). | PubMed |
     376.
         Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265,
         2037-2048 (1994). | PubMed |
     377.
         Horikawa, Y. et al. Genetic variability in the gene encoding calpain-10 is associated
         with type 2 diabetes mellitus. Nature Genet. 26, 163-175 (2000). | PubMed |
     378.
         Hastbacka, J. et al. The diastrophic dysplasia gene encodes a novel sulfate
         transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell
         78, 1073-1087 (1994). | PubMed |
     379.
         Tischkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and
         modern human origins. Science 271, 1380-1387 (1996). | PubMed |
     380.
         Kidd, J. R. et al. Haplotypes and linkage disequilibrium at the phenylalanine
         hydroxylase locus PAH, in a global representation of populations. Am. J. Hum. Genet.
         63, 1882-1899 (2000).
     381.
         Mateu, E. et al. Worldwide genetic analysis of the CFTR region. Am. J. Hum. Genet.
         68, 103-117 (2001). | Article | PubMed |
     382.
         Abecasis, G. R. et al. Extent and distribution of linkage disequilibrium in three
         genomic regions. Am. J. Hum. Genet. 68, 191-197 (2001). | Article |
     383.
         Taillon-Miller, P. et al. Juxtaposed regions of extensive and minimal linkage
         disequilibrium in Xq25 and Xq28. Nature Genet. 25, 324-328
         (2000). | Article | PubMed |
     384.
         Martin, E. R. et al. SNPing away at complex diseases: analysis of single-nucleotide
         polymorphisms around APOE in Alzheimer disease. Am. J. Hum. Genet. 67, 383-394
         (2000). | Article | PubMed |
     385.
         Collins, A., Lonjou, C. & Morton, N. E. Genetic epidemiology of single-nucleotide
         polymorphisms. Proc. Natl Acad. Sci. USA 96, 15173-15177
         (1999). | Article | PubMed |
     386.
         Dunning, A. M. et al. The extent of linkage disequilibrium in four populations with
         distinct demographic histories. Am. J. Hum. Genet. 67, 1544-1554
         (2000). | Article | PubMed |
     387.
         Rieder, M. J., Taylor, S. L., Clark, A. G. & Nickerson, D. A. Sequence variation in the
         human angiotensin converting enzyme. Nature Genet. 22, 59-62
         (1999). | Article | PubMed |
     388.
         Collins, F. S. Positional cloning moves from perditional to traditional. Nature Genet. 9,
         347-350 (1995). | PubMed |
     389.
         Nagamine, K. et al. Positional cloning of the APECED gene. Nature Genet. 17,
         393-398 (1997). | PubMed |
     390.
         Reuber, B. E. et al. Mutations in PEX1 are the most common cause of peroxisome
         biogenesis disorders. Nature Genet. 17, 445-448 (1997). | PubMed |
     391.
         Portsteffen, H. et al. Human PEX1 is mutated in complementation group 1 of the
         peroxisome biogenesis disorders. Nature Genet. 17, 449-452 (1997). | PubMed |
     392.
         Everett, L. A. et al. Pendred syndrome is caused by mutations in a putative sulphate
         transporter gene (PDS). Nature Genet. 17, 411-422 (1997). | PubMed |
     393.
         Coffey, A. J. et al. Host response to EBV infection in X-linked lymphoproliferative
         disease results from mutations in an SH2-domain encoding gene. Nature Genet. 20,
         129-135 (1998). | Article | PubMed |
     394.
         Van Laer, L. et al. Nonsyndromic hearing impairment is associated with a mutation in
         DFNA5. Nature Genet. 20, 194-197 (1998). | Article | PubMed |
     395.
         Sakuntabhai, A. et al. Mutations in ATP2A2, encoding a Ca2+ pump, cause Darier
         disease. Nature Genet. 21, 271-277 (1999). | Article | PubMed |
     396.
         Gedeon, A. K. et al. Identification of the gene (SEDL) causing X-linked
         spondyloepiphyseal dysplasia tarda. Nature Genet. 22, 400-404
         (1999). | Article | PubMed |
     397.
         Hurvitz, J. R. et al. Mutations in the CCN gene family member WISP3 cause
         progressive pseudorheumatoid dysplasia. Nature Genet. 23, 94-98 (1999). | PubMed |
     398.
         Laberge-le Couteulx, S. et al. Truncating mutations in CCM1, encoding KRIT1, cause
         hereditary cavernous angiomas. Nature Genet. 23, 189-193 (1999). | Article | PubMed |
     399.
         Sahoo, T. et al. Mutations in the gene encoding KRIT1, a Krev-1/rap1a binding protein,
         cause cerebral cavernous malformations (CCM1). Hum. Mol. Genet. 8, 2325-2333
         (1999). | Article | PubMed |
     400.
         McGuirt, W. T. et al. Mutations in COL11A2 cause non-syndromic hearing loss
         (DFNA13). Nature Genet. 23, 413-419 (1999). | Article | PubMed |
     401.
         Moreira, E. S. et al. Limb-girdle muscular dystrophy type 2G is caused by mutations
         in the gene encoding the sarcomeric protein telethonin. Nature Genet. 24, 163-166
         (2000). | Article | PubMed |
     402.
         Ruiz-Perez, V. L. et al. Mutations in a new gene in Ellis-van Creveld syndrome and
         Weyers acrodental dysostosis. Nature Genet. 24, 283-286 (2000). | Article | PubMed |
     403.
         Kaplan, J. M. et al. Mutations in ACTN4, encoding alpha-actinin-4, cause familial focal
         segmental glomerulosclerosis. Nature Genet. 24, 251-256 (2000). | Article | PubMed |
     404.
         Escayg, A. et al. Mutations of SCN1A, encoding a neuronal sodium channel, in two
         families with GEFS+2. Nature Genet. 24, 343-345 (2000). | Article | PubMed |
     405.
         Sacksteder, K. A. et al. Identification of the alpha-aminoadipic semialdehyde synthase
         gene, which is defective in familial hyperlysinemia. Am. J. Hum. Genet. 66, 1736-1743
         (2000). | Article | PubMed |
     406.
         Kalaydjieva, L. et al. N-myc downstream-regulated gene 1 is mutated in hereditary
         motor and sensory neuropathy-Lom. Am. J. Hum. Genet. 67, 47-58
         (2000). | Article | PubMed |
     407.
         Sundin, O. H. et al. Genetic basis of total colourblindness among the Pingelapese
         islanders. Nature Genet. 25, 289-293 (2000). | Article | PubMed |
     408.
         Kohl, S. et al. Mutations in the CNGB3 gene encoding the beta-subunit of the cone
         photoreceptor cGMP-gated channel are responsible for achromatopsia (ACHM3)
         linked to chromosome 8q21. Hum. Mol. Genet. 9, 2107-2116
         (2000). | Article | PubMed |
     409.
         Avela, K. et al. Gene encoding a new RING-B-box-coiled-coil protein is mutated in
         mulibrey nanism. Nature Genet. 25, 298-301 (2000). | Article | PubMed |
     410.
         Verpy, E. et al. A defect in harmonin, a PDZ domain-containing protein expressed in
         the inner ear sensory hair cells, underlies usher syndrome type 1C. Nature Genet. 26,
         51-55 (2000). | Article | PubMed |
     411.
         Bitner-Glindzicz, M. et al. A recessive contiguous gene deletion causing infantile
         hyperinsulinism, enteropathy and deafness identifies the usher type 1C gene. Nature
         Genet. 26, 56-60 (2000). | Article | PubMed |
     412.
         The May-Hegglin/Fetchner Syndrome Consortium. Mutations in MYH9 result in the
         May-Hegglin anomaly, and Fechtner and Sebastian syndromes. Nature Genet. 26,
         103-105 (2000). | Article |
     413.
         Kelley, M. J., Jawien, W., Ortel, T. L. & Korczak, J. F. Mutation of MYH9, encoding
         non-muscle myosin heavy chain A, in May-Hegglin anomaly. Nature Genet. 26,
         106-108 (2000). | Article | PubMed |
     414.
         Kirschner, L. S. et al. Mutations of the gene encoding the protein kinase A type I-
         regulatory subunit in patients with the Carney complex. Nature Genet. 26, 89-92
         (2000). | Article | PubMed |
     415.
         Lalwani, A. K. et al. Human nonsyndromic hereditary deafness DFNA17 is due to a
         mutation in non-muscle myosin MYH9. Am. J. Hum. Genet. 67, 1121-1128
         (2000). | PubMed |
     416.
         Matsuura, T. et al. Large expansion of the ATTCT pentanucleotide repeat in
         spinocerebellar ataxia type 10. Nature Genet. 26, 191-194 (2000). | PubMed |
     417.
         Delettre, C. et al. Nuclear gene OPA1, encoding a mitochondrial dynamin-related
         protein, is mutated in dominant optic atrophy. Nature Genet. 26, 207-210
         (2000). | PubMed |
     418.
         Pusch, C. M. et al. The complete form of X-linked congenital stationary night
         blindness is caused by mutations in a gene encoding a leucine-rich repeat protein.
         Nature Genet. 26, 324-327 (2000). | Article | PubMed |
     419.
         The ADHR Consortium. Autosomal dominant hypophosphataemic rickets is
         associated with mutations in FGF23. Nature Genet. 26, 345-348 (2000). | Article |
     420.
         Bomont, P. et al. The gene encoding gigaxonin, a new member of the cytoskeletal
         BTB/kelch repeat family, is mutated in giant axonal neuropathy. Nature Genet. 26,
         370-374 (2000). | Article | PubMed |
     421.
         Tullio-Pelet, A. et al. Mutant WD-repeat protein in triple-A syndrome. Nature Genet.
         26, 332-335 (2000). | Article | PubMed |
     422.
         Nicole, S. et al. Perlecan, the major proteoglycan of basement membranes, is altered
         in patients with Schwartz-Jampel syndrome (chondrodystrophic myotonia). Nature
         Genet. 26, 480-483 (2000). | Article | PubMed |
     423.
         Rogaev, E. I. et al. Familial Alzheimer's disease in kindreds with missense mutations
         in a gene on chromosome 1 related to the Alzheimer's disease type 3 gene. Nature
         376, 775-778 (1995). | PubMed |
     424.
         Sherrington, R. et al. Cloning of a gene bearing missense mutations in early-onset
         familial Alzheimer's disease. Nature 375, 754-760 (1995). | PubMed |
     425.
         Olivieri, N. F. & Weatherall, D. J. The therapeutic reactivation of fetal haemoglobin.
         Hum. Mol. Genet. 7, 1655-1658 (1998). | Article | PubMed |
     426.
         Drews, J. Research & development. Basic science and pharmaceutical innovation.
         Nature Biotechnol. 17, 406 (1999).
     427.
         Drews, J. Drug discovery: a historical perspective. Science 287, 1960-1964
         (2000). | Article | PubMed |
     428.
         Davies, P. A. et al. The 5-HT3B subunit is a major determinant of serotonin-receptor
         function. Nature 397, 359-363 (1999). | Article | PubMed |
     429.
         Heise, C. E. et al. Characterization of the human cysteinyl leukotriene 2 receptor. J.
         Biol. Chem. 275, 30531-30536 (2000). | PubMed |
     430.
         Fan, W. et al. BACE maps to chromosome 11 and a BACE homolog, BACE2, reside
         in the obligate Down Syndrome region of chromosome 21. Science 286, 1255a
         (1999). | Article |
     431.
         Saunders, A. J., Kim, T. -W. & Tanzi, R. E. BACE maps to chromosome 11 and a
         BACE homolog, BACE2, reside in the obligate Down Syndrome region of
         chromosome 21. Science 286, 1255a (1999). | Article |
     432.
         Firestein, S. The good taste of genomics. Nature 404, 552-553
         (2000). | Article | PubMed |
     433.
         Matsunami, H., Montmayeur, J. P. & Buck, L. B. A family of candidate taste
         receptors in human and mouse. Nature 404, 601-604 (2000). | Article | PubMed |
     434.
         Adler, E. et al. A novel family of mammalian taste receptors. Cell 100, 693-702
         (2000). | PubMed |
     435.
         Chandrashekar, J. et al. T2Rs function as bitter taste receptors. Cell 100, 703-711
         (2000). | PubMed |
     436.
         Hardison, R. C. Conserved non-coding sequences are reliable guides to regulatory
         elements. Trends Genet. 16, 369-372 (2000). | Article | PubMed |
     437.
         Onyango, P. et al. Sequence and comparative analysis of the mouse 1-megabase
         region orthologous to the human 11p15 imprinted domain. Genome Res. 10,
         1697-1710 (2000). | PubMed |
     438.
         Bouck, J. B., Metzker, M. L. & Gibbs, R. A. Shotgun sample sequence comparisons
         between mouse and human genomes. Nature Genet. 25, 31-33
         (2000). | Article | PubMed |
     439.
         Marshall, E. Public-private project to deliver mouse genome in 6 months. Science 290,
         242-243 (2000).
     440.
         Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E.
         Human-mouse genome comparisons to locate regulatory sites. Nature Genet. 26,
         225-228 (2000). | PubMed |
     441.
         Tagle, D. A. et al. Embryonic epsilon and gamma globin genes of a prosimian primate
         (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental
         regulation and phylogenetic footprints. J. Mol. Biol. 203, 439-455 (1988). | PubMed |
     442.
         McGuire, A. M., Hughes, J. D. & Church, G. M. Conservation of DNA regulatory motifs
         and discovery of new motifs in microbial genomes. Genome Res. 10, 744-757
         (2000). | PubMed |
     443.
         Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory
         motifs within unaligned noncoding sequences clustered by whole-genome mRNA
         quantitation. Nature Biotechnol. 16, 939-945 (1998).
     444.
         Cheng, Y. & Church, G. M. Biclustering of expression data. ISMB 8, 93-103
         (2000). | PubMed |
     445.
         Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. A computational analysis
         of whole-genome expression data reveals chromosomal domains of gene expression.
         Nature Genet. 26, 183-186 (2000). | PubMed |
     446.
         Feil, R. & Khosla, S. Genomic imprinting in mammals: an interplay between
         chromatin and DNA methylation? Trends Genet. 15, 431-434
         (1999). | Article | PubMed |
     447.
         Robertson, K. D. & Wolffe, A. P. DNA methylation in health and disease. Nature Rev.
         Genet. 1, 11-19 (2000). | Article |
     448.
         Beck, S., Olek, A. & Walter, J. From genomics to epigenomics: a loftier view of life.
         Nature Biotechnol. 17, 1144-1144 (1999).
     449.
         Hagmann, M. Mapping a subtext in our genetic book. Science 288, 945-946
         (2000). | PubMed |
    450.
         Eliot, T. S. in T. S. Eliot. Collected Poems 1909-1962 (Harcourt Brace, New York,
         1963).
     451.
         Soderland, C., Longden, I. & Mott, R. FPC: a system for building contigs from
         restriction fingerprinted clones. Comput. Appl. Biosci. 13, 523-535 (1997). | PubMed |
     452.
         Mott, R. & Tribe, R. Approximate statistics of gapped alignments. J. Comp. Biol. 6,
         91-112 (1999).



    Acknowledgements.
    Beyond the authors, many people contributed to the success of this
    work. E. Jordan provided helpful advice throughout the sequencing effort. We thank D.
    Leja and J. Shehadeh for their expert assistance on the artwork in this paper, especially the
    foldout figure; K. Jegalian for editorial assistance; J. Schloss, E. Green and M. Seldin for
    comments on an earlier version of the manuscript; P. Green and F. Ouelette for critiques of
    the submitted version; C. Caulcott, A. Iglesias, S. Renfrey, B. Skene and J. Stewart of the
    Wellcome Trust, P. Whittington and T. Dougans of NHGRI and M. Meugnier of
    Genoscope for staff support for meetings of the international consortium; and the University
    of Pennsylvania for facilities for a meeting of the genome analysis group. We thank
    Compaq Computer Corporations's High Performance Technical Computing Group for
    providing a Compaq Biocluster (a 27 node configuration of AlphaServer ES40s, containing
    108 CPUs, serving as compute nodes and a file server with one terabyte of secondary
    storage) to assist in the annotation and analysis. Compaq provided the systems and
    implementation services to set up and manage the cluster for continuous use by members of
    the sequencing consortium. Platform Computing Ltd. provided its LSF scheduling and
    loadsharing software without license fee. In addition to the data produced by the members
    of the International Human Genome Sequencing Consortium, the draft genome sequence
    includes published and unpublished human genomic sequence data from many other groups,
    all of whom gave permission to include their unpublished data. Four of the groups that
    contributed particularly significant amounts of data were: M. Adams et al. of the Institute
    for Genomic Research; E. Chen et al. of the Center for Genetic Medicine and Applied
    Biosystems; S.-F. Tsai of National Yang-Ming University, Institute of Genetics, Taipei,
    Taiwan, Republic of China; and Y. Nakamura, K. Koyama et al. of the Institute of Medical
    Science, University of Tokyo, Human Genome Center, Laboratory of Molecular Medicine,
    Minato-ku, Tokyo, Japan.. Many other groups provided smaller numbers of database
    entries. We thank them all; a full list of the contributors of unpublished sequence is available
    as Supplementary Information. This work was supported in part by the National Human
    Genome Research Institute of the US NIH; The Wellcome Trust; the US Department of
    Energy, Office of Biological and Environmental Research, Human Genome Program; the
    UK MRC; the Human Genome Sequencing Project from the Science and Technology
    Agency (STA) Japan; the Ministry of Education, Science, Sport and Culture, Japan; the
    French Ministry of Research; the Federal German Ministry of Education, Research and
    Technology (BMBF) through Projektträger DLR, in the framework of the German Human
    Genome Project; BEO, Projektträger Biologie, Energie, Umwelt des BMBF und BMWT;
    the Max-Planck-Society; DFG—Deutsche Forschungsgemeinschaft; TMWFK, Thüringer
    Ministerium für Wissenschaft, Forschung und Kunst; EC BIOMED2—European
    Commission, Directorate Science, Research and Development; Chinese Academy of
    Sciences (CAS), Ministry of Science and Technology (MOST), National Natural Science
    Foundation of China (NSFC); US National Science Foundation EPSCoR and The SNP
    Consortium Ltd. Additional support for members of the Genome Analysis group came, in
    part, from an ARCS Foundation Scholarship to T.S.F., a Burroughs Wellcome Foundation
    grant to C.B.B. and P.A.S., a DFG grant to P.B., DOE grants to D.H., E.E.E. and T.S.F.,
    an EU grant to P.B., a Marie-Curie Fellowship to L.C., an NIH-NHGRI grant to S.R.E., an
    NIH grant to E.E.E., an NIH SBIR to D.K., an NSF grant to D.H., a Swiss National
    Science Foundation grant to L.C., the David and Lucille Packard Foundation, the Howard
    Hughes Medical Institute, the University of California at Santa Cruz and the W. M. Keck
    Foundation.


    Background References:

    1. "Genome Discovery Shocks Scientists: Genetic blueprint contains far fewer genes than thought --- DNA's importance downplayed".

    2. "Selective Control of DNA Helix Openings During Gene Regulation".

    3. "RNAs from All Categories Generate Retrosequences that May be Exapted as Novel Genes or Regulatory Elements".  Gene 238: 115-134 (1999).
     

    Additional References:

    1. "RNomics: An Experimental Approach that Identifies 201 Candidates for Novel, Small, Non-Messenger RNAs in Mouse",  EMBO J. 20: 2943-2953 (June 1, 2001).



    Top of Page - Euchromatin Network - Current Research - Forums - Other Sites - Future Events -

    For Further Information and Feedback:
    E-mail:  frenster@euchromatin.net
    Phone:  +1 650 367 6483
    Fax:  +1 650 364 1773

    euchromatin: "the most active portion of the genome within the cell nucleus".