"The Sequence of the Human Genome".
Venter, JC, et al.
Celera Genomics
45 West Gude Drive
Rockville, Maryland 20850, USA
E-mail: humangenome@celera.com
J. Craig Venter,1*
Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1 Granger
G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert
A. Holt,1 Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1
Daniel H. Huson,1 Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1
Xiangqun H. Zheng,1 Lin Chen,1 Marian Skupski,1 Gangadharan Subramanian,1
Paul D. Thomas,1 Jinghui Zhang,1 George L. Gabor Miklos,2
Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4
Joe Nadeau,5 Victor A. McKusick,6 Norton
Zinder,7 Arnold J. Levine,7 Richard
J. Roberts,8 Mel Simon,9 Carolyn
Slayman,10 Michael Hunkapiller,11
Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1 Michael Flanigan,1
Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel
Levy,1 Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1
Ellen Beasley,1 Kendra Biddick,1 Vivien Bonazzi,1 Rhonda Brandon,1 Michele
Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1 Kabir Chaturvedi,1
Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1
Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng
Gong,1 Zhiping Gu,1 Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1
Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1 Zhongwu Lai,1 Yiding Lei,1 Zhenya
Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1 Gennady ,V1 Merkulov,1
Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1 Vaibhav A. Narayan,1
Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12
Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin
Wang,1 Jian Wang,1 Ming-Hui Wei,1 Ron Wides,13
Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1
Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1
Wenyan Zhong,1 Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis
Gilbert,1
Suzanna Baumhueter,1 Gene Spier,1 Christine Carter,1 Anibal
Cravchik,1 Trevor Woodage,1 Feroze Ali,1
Huijin An,1 Aderonke Awe,1 Danita Baldwin,1 Holly Baden,1
Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1 Amy Carver,1
Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1
Lionel Davenport,1 Raymond Desilets,1 Susanne Dietz,1 Kristina
Dodson,1 Lisa Doup,1 Steven Ferriera,1
Neha Garg,1 Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1
Charles Haynes,1 Cheryl Heiner,1
Suzanne Hladun,1 Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1
Chinyere Ibegwam,1 Jeffery Johnson,1 Francis Kalush,1 Lesley Kline,1 Shashi
Koduru,1 Amy Love,1 Felecia Mann,1 David May,1 Steven McCawley,1 Tina McIntosh,1
Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1 Keith Nelson,1
Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1
Matthew Reardon,1 Robert Rodriguez,1
Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1
Cynthia Sitter,1 Michelle Smallwood,1
Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1
Ni Ni Tint,1 Sukyee Tse,1 Claire Vech,1
Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1
Sandra Windsor,1 Emily Winn-Deen,1
Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1 Josep
F. Abril,14 Roderic Guigó,14
Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1 Anish Kejariwal,1
Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen
Diemer,1 Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin
Istrail,1 Ross Lippert,1 Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1
David Allen,1 Anand Basu,1 James Baxendale,1 Louis Blick,1 Marcelo Caminha,1
John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1 Carl Dahlke,1
Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva
Esparham,1 Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1
Anna Glodek,1 Mark Gorokhov,1 Ken Graham,1 Barry Gropman,1 Michael Harris,1
Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1 Donald Jennings,1 Catherine
Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1 Alexander
Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1
William Majoros,1 Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1
Trung Nguyen,1 Ngoc Nguyen,1
Marc Nodell,1 Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William
Rowe,1 Robert Sanders,1 John Scott,1
Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1
Russell Turner,1 Eli Venter,1
Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1
Ali Zandieh,1 Xiaohong Zhu1
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies--a whole-genome assembly and a regional chromosome assembly--were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ~12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge. (Underlines by WebEditor).
WebFigure 1: Legend Key for Annotation of the Celera Human Genome Assembly (Supplemental Data).
The remaining ~80% of the genome, the euchromatic component, is divisible into G-, R-, and T-bands (67). These cytogenetic bands have been presumed to differ in their nucleotide composition and gene density, although we have been unable to determine precise band boundaries at the molecular level. T-bands are the most G+C- and gene-rich, and G-bands are G+C-poor (68). Bernardi has also offered a description of the euchromatin at the molecular level as long stretches of DNA of differing base composition, termed isochores (denoted L, H1, H2, and H3), which are >300 kbp in length (69). Bernardi defined the L(light) isochores as G+C-poor (<43%), whereas the H (heavy) isochores fall into three G+C-rich classes representing 24, 8, and 5% ofthe genome. Gene concentration has been claimed to be very low in the L isochores and 20-fold more enriched in the H2 and H3 isochores (70). By examining contiguous 50-kbp windows of G+C content across the assembly, we found that regions of G+C content >48% (H3 isochores) averaged 273.9 kbp in length, those with G+C content between 43 and 48% (H1+H2 isochores) averaged 202.8 kbp in length, and the average span of regions with <43% (L isochores) was 1078.6 kbp. The correlation between G+C content and gene density was also examined in 50-kbp windows along the assembled sequence (Table 9 and Figs. 10 and 11). We found that the density of genes was greater in regions of high G+C than in regions of low G+C content, as expected. However, the correlation between G+C content and gene density was not as skewed as previously predicted (69). A higher proportion of genes were located in the G+C-poor regions than had been expected.
Chromosomes 17, 19, and 22, which have a disproportionate number of H3-containing bands, had the highest gene density (Table 10). Conversely, of the chromosomes that we found to have the lowest gene density, X, 4, 18, 13, and Y, also have the fewest H3 bands. Chromosome 15, which also has few H3 bands, did not have a particularly low gene density in our analysis. In addition, chromosome 8, which we found to have a low gene density, does not appear to be unusual in its H3 banding.
How valid is Ohno's postulate (71) that mammalian genomes consist of oases of genes in otherwise essentially empty deserts? It appears that the human genome does indeed contain deserts, or large, gene-poor regions. If we define a desert as a region >500 kbp without a gene, then we see that 605 Mbp, or about 20% of the genome, is in deserts. These are not uniformly distributed over the various chromosomes. Gene-rich chromosomes 17, 19, and 22 have only about 12% of their collective 171 Mbp in deserts, whereas gene-poor chromosomes 4, 13, 18, and X have 27.5% of their 492 Mbp in deserts (Table 11). The apparent lack of predicted genes in these regions does not necessarily imply that they are devoid of biological function.
We mapped the location of the markers that constitute the Genethon linkage map to the genome. The rate of recombination, expressed as cM per Mbp, was calculated for 3-Mbp windows as shown in Table 12. Higher rates of recombination in the telomeric region of the chromosomes have been previously documented (73). From this mapping result, there is a difference of 4.99 between lowest rates and highest rates and the largest difference of 4.4 between males and females (4.99 to 0.47 on chromosome 16). This indicates that the variability in recombination rates among regions of the genome exceeds the differences in recombination rates between males and females. The human genome has recombination hotspots, where recombination rates vary fivefold or more over a space of 1 kbp, so the picture one gets of the magnitude of variability in recombination rate will depend on the size of the window examined. Unfortunately,too few meiotic crossovers have occurred in Centre d'Étude du Polymorphism Humain (CEPH) and other reference families to provide a resolution any finer than about 3 Mbp. The next challenge will be to determine a sequence basis of recombination at the chromosomal level. An accurate predictor for the rate for variation in recombination rates between any pair of markers would be extremely useful in designing markers to narrow a region of linkage, such as in positional cloning projects.
Experimental methods have been used that resulted in an estimate of 30,000 to 45,000 CpG islands in the human genome (74,80) and an estimate of 499 CpG islands on human chromosome 22 (81). Larsen et al. (76) and Gardiner-Gardenand Frommer (75) used a computational method to identify CpG islands and defined them as regions of DNA of >200 bp that have a G+C content of >50% and a ratio of observed versus expected frequency of CG dinucleotide >0.6.
It is difficult to make a direct comparison of experimental definitions of CpG islands with computational definitions because computational methods do not consider the methylation state of cytosine and experimental methods do not directly select regions of high G+C content. However, we can determine the correlation of CpG island with gene starts, given a set of annotated genomic transcripts and the whole genome sequence. We have analyzed the publicly available annotation of chromosome 22, as well as using the entire human genome in our assembly and the computationally annotated genes. A variation of the CpG island computation was compared with Larsen et al. (76). The main differences are that we use a sliding window of 200 bp, consecutive windows are merged only if they overlap, and we recompute the CpG value upon merging, thus rejecting any potential island if it scores less than the threshold.
To compute various CpG statistics, we used two different thresholds of CG dinucleotide likelihood ratio. Besides using the original threshold of 0.6 (method 1), we used a higher threshold of CG dinucleotide likelihood ratio of 0.8 (method 2), which results in the number of CpG islands on chromosome 22 close to the number of annotated genes on this chromosome. The main results are summarized in Table 13. CpG islands computed with method 1 predicted only 2.6% of the CSA sequence as CpG, but 40% of the gene starts (start codons) are contained inside a CpG island. This is comparable to ratios reported by others (82). The last two rows of the table show the observed and expected average distance, respectively, of the closest CpG island from the first exon. The observed average closest CpG islands are smaller than the corresponding expected distances, confirming an association between CpG island and the first exon.
We also looked at the distribution of CpG island nucleotides among various sequence classes such as intergenic regions, introns, exons, and first exons. We computed the likelihood score for each sequence class as the ratio of the observed fraction of CpG island nucleotides in that sequence class and the expected fraction of CpG island nucleotides in that sequence class. The result of applying method 1 on CSA were scores of 0.89 for intergenic region, 1.2 for intron, 5.86 for exon, and 13.2 for first exon. The same trend was also found for chromosome 22 and after the application of a higher threshold (method 2) on both data sets. In sum, genome-wide analysis has extended earlier analysis and suggests a strong correlation between CpG islands and first coding exons.
8.1 The whole-genome sequencing approach versus BAC by BAC:
Experience in applying the whole-genome shotgun sequencing approach to a diverse group of organisms with a wide range of genome sizes and repeat content allows us to assess its strengths and weaknesses. With the success of the method for a large number of microbial genomes, Drosophila, and now the human, there can be no doubt concerning the utility of this method. The large number of microbial genomes that have been sequenced by this method (15, 80, 152) demonstrate that megabase-sized genomes can be sequenced efficiently without any input other that the de novo mate-paired sequences. With more complex genomes like those of Drosophila or human, map information, in the form of well-ordered markers, has been critical for long-range ordering of scaffolds. For joining scaffolds into chromosomes, the quality of the map (in terms of the order of the markers) is more important than the number of markers per se. Although this mapping could have been performed concurrently with sequencing, the prior existence of mapping data was beneficial. During the sequencing of the A. thaliana genome, sequencing of individual BAC clones permitted extension of the sequence well into centromeric regions and allowed high-quality resolution of complex repeat regions. Likewise, in Drosophila, the BAC physical map was most useful in regions near the highly repetitive centromeres and telomeres. WGA has been found to deliver excellent-quality reconstructions of the unique regions of the genome. As the genome size, and more importantly the repetitive content, increases, the WGA approach delivers less of the repetitive sequence.
The cost and overall efficiency of clone-by-clone approaches makes
them difficult to justify as a stand-alone
strategy for future large-scale genome-sequencing projects. Specific
applications of BAC-based or other clone mapping and sequencing strategies
to resolve ambiguities in sequence assembly that cannot be efficiently
resolved with computational approaches alone are clearly worth exploring.
Hybrid approaches to whole-genome sequencing will only work if there is
sufficient coverage in both the whole-genome shotgun phase and the BAC
clone sequencing phase. Our experience with human genome assembly suggests
that this will require at least 3× coverage of both whole-genome
and BAC shotgun sequence data.
8.2 The low gene number in humans:
We have sequenced and assembled ~95% of the euchromatic sequence of H. sapiens and used a new automated gene prediction method to produce a preliminary catalog of the human genes. This has provided a major surprise: We have found far fewer genes (26,000 to 38,000) than the earlier molecular predictions (50,000 to over 140,000). Whatever the reasons for this current disparity, only detailed annotation, comparative genomics (particularly using the Mus musculus genome), and careful molecular dissection of complex phenotypes will clarify this critical issue of the basic "parts list" of our genome. Certainly, the analysis is still incomplete and considerable refinement will occur in the years to come as the precise structure of each transcription unit is evaluated. A good place to start is to determine why the gene estimates derived from EST data are so discordant with our predictions. It is likely that the following contribute to an inflated gene number derived from ESTs: the variable lengths of 3'- and 5'-untranslated leaders and trailers; the little-understood vagaries of RNA processing that often leave intronic regions in an unspliced condition; the finding that nearly 40% of human genes are alternatively spliced (153); and finally, the unsolved technical problems in EST library construction where contamination from heterogeneous nuclear RNA and genomic DNA are not uncommon. Of course, it is possible that there are genes that remain unpredicted owing to the absence of EST or protein data to support them, although our use of mouse genome data for predicting genes should limit this number. As was true at the beginning of genome sequencing, ultimately it will be necessary to measure mRNA in specific cell types to demonstrate the presence of a gene.
J. B. S. Haldane speculated in 1937 that a population of organisms might have to pay a price for the number of genes it can possibly carry. He theorized that when the number of genes becomes too large, each zygote carries so many new deleterious mutations that the population simply cannot maintain itself. On the basis of this premise, and on the basis of available mutation rates and x-ray-induced mutations at specific loci, Muller, in 1967 (154), calculated that the mammalian genome would contain a maximum of not much more than 30,000 genes (155). An estimate of 30,000 gene loci for humans was also arrived at by Crow and Kimura (156). Muller's estimate for D. melanogaster was 10,000 genes, compared to 13,000 derived by annotation of the fly genome (26, 27). These arguments for the theoretical maximum gene number were based on simplified ideas of genetic load--that all genes have a certain low rate of mutation to a deleterious state. However, it is clear that many mouse, fly, worm, and yeast knockout mutations lead to almost no discernible phenotypic perturbations.
The modest number of human genes means that we must look elsewhere for the mechanisms that generate the complexities inherent in human development and the sophisticated signaling systems that maintain homeostasis. There are a large number of ways in which the functions of individual genes and gene products are regulated. The degree of "openness" of chromatin structure and hence transcriptional activity is regulated by protein complexes that involve histone and DNA enzymatic modifications. We enumerate many of the proteins that are likely involved in nuclear regulation in Table 19. The location, timing, and quantity of transcription are intimately linked to nuclear signal transduction events as well as by the tissue-specific expression of many of these proteins. Equally important are regulatory DNA elements that include insulators, repeats, and endogenous viruses (157); methylation of CpG islands in imprinting (158); and promoter-enhancer and intronic regions that modulate transcription. The spliceosomal machinery consists of multisubunit proteins (Table 19) as well as structural and catalytic RNA elements (159) that regulate transcript structure through alternative start and termination sites and splicing. Hence, there is a need to study different classes of RNA molecules (160) such as small nucleolar RNAs, antisense riboregulator RNA, RNA involved in X-dosage compensation, and other structural RNAs to appreciate their precise role in regulating gene expression. The phenomenon of RNA editing in which coding changes occur directly at the level of mRNA is of clinical and biological relevance (161). Finally, examples of translational control include internal ribosomal entry sites that are found in proteins involved in cell cycle regulation and apoptosis (162). At the protein level, minor alterations in the nature of protein-protein interactions, protein modifications, and localization can have dramatic effects on cellular physiology (163). This dynamic system therefore has many ways to modulate activity, which suggests that definition of complex systems by analysis of single genes is unlikely to be entirely successful.
In situ studies have shown that the human genome is asymmetrically populated with G+C content, CpG islands, and genes (68). However, the genes are not distributed quite as unequally as had been predicted (Table 9) (69). The most G+C-rich fraction of the genome, H3 isochores, constitute more of the genome than previously thought (about 9%), and are the most gene-dense fraction, but contain only 25% of the genes, rather than the predicted ~40%. The low G+C L isochores make up 65% of the genome, and 48% of the genes. This inhomogeneity, the net result of millions of years of mammalian gene duplication, has been described as the "desertification" of the vertebrate genome (71). Why are there clustered regions of high and low gene density, and are these accidents of history or driven by selection and evolution? If these deserts are dispensable, it ought to be possible to find mammalian genomes that are far smaller in size than the human genome. Indeed, many species of bats have genome sizes that are much smaller than that of humans; for example, Miniopterus, a species of Italian bat, has a genome size that is only 50% that of humans (164). Similarly, Muntiacus, a species of Asian barking deer, has a genome size that is ~70% that of humans.
8.3 Human DNA sequence variation and its distribution across the genome:
This is the first eukaryotic genome in which a nearly uniform ascertainment of polymorphism has been completed. Although we have identified and mapped more than 3 million SNPs, this by no means implies that the task of finding and cataloging SNPs is complete. These represent only a fraction of the SNPs present in the human population as a whole. Nevertheless, this first glimpse at genome-wide variation has revealed strong inhomogeneities in the distribution of SNPs across the genome. Polymorphism in DNA carries with it a snapshot of the past operation of population genetic forces, including mutation, migration, selection, and genetic drift. The availability of a dense array of SNPs will allow questions related to each of these factors to be addressed on a genome-wide basis. SNP studies can establish the range of haplotypes present in subjects of different ethnogeographic origins, providing insights into population history and migration patterns. Although such studies have suggested that modern human lineages derive from Africa, many important questions regarding human origins remain unanswered, and more analyses using detailed SNP maps will be needed to settle these controversies. In addition to providing evidence for population expansions, migration, and admixture, SNPs can serve as markers for the extent of evolutionary constraint acting on particular genes. The correlation between patterns of intraspecies and interspecies genetic variation may prove to be especially informative to identify sites of reduced genetic diversity that may mark loci where sequence variations are not tolerated.
The remarkable heterogeneity in SNP density implies that there are a variety of forces acting on polymorphism--sparse regions may have lower SNP density because the mutation rate is lower, because most of those regions have a lower fraction of mutations that are tolerated, or because recent strong selection in favor of a newly arisen allele "swept" the linked variation out of the population (165). The effect of random genetic drift also varies widely across the genome. The non-recombining portion of the Y chromosome faces the strongest pressure from random drift because there are roughly one-quarter as many Y chromosomes in the population as there are autosomal chromosomes, and the level of polymorphism on the Y is correspondingly less. Similarly, the X chromosome has a smaller effective population size than the autosomes, and its nucleotide diversity is also reduced. But even across a single autosome, the effective population size can vary because the density of deleterious mutations may vary. Regions of high density of deleterious mutations will see a greater rate of elimination by selection, and the effective population size will be smaller (166). As a result, the density of even completely neutral SNPs will be lower in such regions. There is a large literature on the association between SNP density and local recombination rates in Drosophila, and it remains an important task to assess the strength of this association in the human genome, because of its impact on the design of local SNP densities for disease-association studies. It also remains an important task to validate SNPs on a genomic scale in order to assess the degree of heterogeneity among geographic and ethnic populations.
We will soon be in a position to move away from the cataloging
of individual components of the system, and
beyond the simplistic notions of "this binds to that, which then
docks on this, and then the complex moves there. ..." (167)
to the exciting area of network perturbations, nonlinear responses and
thresholds, and their pivotal role in human diseases.
The enumeration of other "parts lists" reveals that in organisms
with complex nervous systems, neither gene
number, neuron number, nor number of cell types correlates in any
meaningful manner with even simplistic measures of structural or behavioral
complexity. Nor would they be expected to; this is the realm of nonlinearities
and epigenesis (168). The 520 million neurons of
the common octopus exceed the neuronal number in the brain of a mouse by
an order of magnitude. It is apparent from a comparison of genomic data
on the mouse and human, and from comparative mammalian neuroanatomy (169),
that the morphological and behavioral diversity found in mammals is underpinned
by a similar gene repertoire and similar neuroanatomies. For example, when
one compares a pygmy marmoset (which is only 4 inches tall and weighs about
6 ounces) to a chimpanzee, the brain volume of this minute primate is found
to be only about 1.5 cm3, two orders of magnitude less than
that of a chimp and three orders less than that of humans. Yet the neuroanatomies
of all three brains are strikingly similar, and the behavioral characteristics
of the pygmy marmoset are little different from those of chimpanzees. Between
humans and chimpanzees, the gene number, gene structures and functions,
chromosomal and genomic organizations, and cell types and neuroanatomies
are almost indistinguishable, yet the developmental modifications that
predisposed human lineages to cortical expansion and development of the
larynx, giving rise to language, culminated in a massive singularity that
by even the simplest of criteria made humans more complex in a behavioral
sense.
Simple examination of the number of neurons, cell types, or genes or of the genome size does not alone account for the differences in complexity that we observe. Rather, it is the interactions within and among these sets that result in such great variation. In addition, it is possible that there are "special cases" of regulatory gene networks that have a disproportionate effect on the overall system. We have presented several examples of "regulatory genes" that are significantly increased in the human genome compared with the fly and worm. These include extracellular ligands and their cognate receptors (e.g., wnt, frizzled, TGF-, ephrins, and connexins), as well as nuclear regulators (e.g., the KRAB and homeodomain transcription factor families), where a few proteins control broad developmental processes. The answers to these "complexities" perhaps lie in these expanded gene families and differences in the regulatory control of ancient genes, proteins, pathways, and cells.
While few would disagree with the intuitive conclusion that Einstein's
brain was more complex than that of
Drosophila, closer comparisons such as whether the set of
predicted human proteins is more complex than the protein set of Drosophila,
and if so, to what degree, are not straightforward, since protein, protein
domain, or protein-protein interaction measures do not capture context-dependent
interactions that underpin the dynamics underlying phenotype.
Currently, there are more than 30 different mathematical descriptions of complexity (170). However, we have yet to understand the mathematical dependency relating the number of genes with organism complexity. One pragmatic approach to the analysis of biological systems, which are composed of nonidentical elements (proteins, protein complexes, interacting cell types, and interacting neuronal populations), is through graph theory (171). The elements of the system can be represented by the vertices of complex topographies, with the edges representing the interactions between them. Examination of large networks reveals that they can self-organize, but more important, they can be particularly robust. This robustness is not due to redundancy, but is a property of inhomogeneously wired networks. The error tolerance of such networks comes with a price; they are vulnerable to the selection or removal of a few nodes that contribute disproportionately to network stability. Gene knockouts provide an illustration. Some knockouts may have minor effects, whereas others have catastrophic effects on the system. In the case of vimentin, a supposedly critical component of the cytoplasmic intermediate filament network of mammals, the knockout of the gene in mice reveals them to be reproductively normal, with no obvious phenotypic effects (172), and yet the usually conspicuous vimentin network is completely absent. On the other hand, ~30% of knockouts in Drosophila and mice correspond to critical nodes whose reduction in gene product, or total elimination, causes the network to crash most of the time, although even in some of these cases, phenotypic normalcy ensues, given the appropriate genetic background. Thus, there are no "good" genes or "bad" genes, but only networks that exist at various levels and at different connectivities, and at different states of sensitivity to perturbation. Sophisticated mathematical analysis needs to be constantly evaluated against hard biological data sets that specifically address network dynamics. Nowhere is this more critical than in attempts to come to grips with "complexity," particularly because deconvoluting and correcting complex networks that have undergone perturbation, and have resulted in human diseases, is the greatest significant challenge now facing us.
It has been predicted for the last 15 years that complete sequencing
of the human genome would open up new
strategies for human biological research and would have a major
impact on medicine, and through medicine and public health, on society.
Effects on biomedical research are already being felt. This assembly of
the human genome sequence is but a first, hesitant step on a long and exciting
journey toward understanding the role of the genome in human biology. It
has been possible only because of innovations in instrumentation and software
that have allowed automation of almost every step of the process from DNA
preparation to annotation. The next steps are clear: We must define
the complexity that ensues when this relatively modest set of about 30,000
genes is expressed. The sequence provides the framework upon which
all
the genetics, biochemistry, physiology, and ultimately phenotype depend.
It provides the boundaries for scientific inquiry. The sequence is only
the first level of understanding of the genome. All genes and
their control elements must be identified; their functions, in concert
as well as in isolation, defined; their sequence variation worldwide described;
and the relation between genome variation and specific phenotypic characteristics
determined. Now we know what we have to explain.
Another paramount challenge awaits: public discussion of this information and its potential for improvement of personal health. Many diverse sources of data have shown that any two individuals are more than 99.9% identical in sequence, which means that all the glorious differences among individuals in our species that can be attributed to genes falls in a mere 0.1% of the sequence. There are two fallacies to be avoided: determinism, the idea that all characteristics of the person are "hard-wired" by the genome; and reductionism, the view that with complete knowledge of the human genome sequence, it is only a matter of time before our understanding of gene functions and interactions will provide a complete causal description of human variability. The real challenge of human biology, beyond the task of finding out how genes orchestrate the construction and maintenance of the miraculous mechanism of our bodies, will lie ahead as we seek to explain how our minds have come to organize thoughts sufficiently well to investigate our own existence. (underlines by WebEditor).
15. C. J. Bult, et al., Science 273, 1058 (1996); J. F. Tomb, et al., Nature 388, 539 (1997); H. P. Klenk, et al., Nature 390, 364 (1997).
26. M. D. Adams, et al., Science 287, 2185 (2000).
27. G. M. Rubin, et al., Science 287, 2204 (2000).
64. G. L. Miklos and B. John, Am. J. Hum. Genet.31, 264 (1979); U. Francke, Cytogenet. Cell Genet.65, 206 (1994).
65. P. E. Warburton, H. F. Willard, in Human Genome Evolution, M. S. Jackson, T. Strachan, G. Dover, Eds. (BIOS Scientific, Oxford, 1996), pp. 121-145.
66. J. E. Horvath, S. Schwartz, E. E. Eichler, Genome Res.10, 839 (2000).
67.W. A. Bickmore and A. T. Sumner, Trends Genet.5, 144 (1989).
68. G. P. Holmquist, Am. J. Hum. Genet. 51, 17 (1992).
69. G. Bernardi, Gene 241, 3 (2000).
70. S. Zoubak, O. Clay, G. Bernardi, Gene 174, 95 (1996).
71. S. Ohno, Trends Genet. 1, 160 (1985).
72. K. W. Broman, J. C. Murray, V. C. Sheffield, R. L. White, J. L. Weber, Am. J. Hum. Genet.63, 861 (1998)
73. M. J. McEachern, A. Krauskopf, E. H. Blackburn, Annu. Rev. Genet.34, 331 (2000).
74. A. Bird, Trends Genet.3, 342 (1987).
75. M. Gardiner-Garden and M. Frommer, J. Mol. Biol.196, 261 (1987).
76. F. Larsen, G. Gundersen, R. Lopez, H. Prydz, Genomics13, 1095 (1992).
77. S. H. Cross and A. Bird, Curr. Opin. Genet. Dev.5, 309 (1995).
78. J. Peters, Genome Biol. 1, reviews1028.1 (2000) (http://genomebiology.com/2000/1/5/reviews/1028).
79. C. Grunau, W. Hindermann, A. Rosenthal, Hum. Mol. Genet.9, 2651 (2000).
80. F. Antequera and A. Bird, Proc. Natl. Acad. Sci. U.S.A. 90, 11995 (1993).
81. S. H. Cross, et al., Mamm. Genome11, 373 (2000).
82. D. Slavov, et al., Gene247, 215 (2000).
83. A. F. Smit and A. D. Riggs, Nucleic Acids Res.23, 98 (1995)
152. C. M. Fraser, et al., Science 281, 375 (1998); H. Tettelin, et al., Science 287, 1809 (2000).
153. D. Brett, et al., FEBS Lett. 474, 83 (2000).
154. H. J. Muller and H. Kern, Z. Naturforsch. B22, 1330 (1967).
155. H. J. Muller, in Heritage from Mendel, R. A. Brink, Ed. (Univ. of Wisconsin Press, Madison, WI, 1967), p. 419.
156. J. F. Crow, M. Kimura, Introduction to Population Genetics Theory (Harper & Row, New York, 1970).
157. K. Kobayashi et al., Nature 394, 388 (1998).
158. A. P. Feinberg, Curr. Top. Microbiol. Immunol. 249, 87 (2000).
159. C. A. Collins and C. Guthrie, Nature Struct. Biol.7, 850 (2000).
160. S. R. Eddy, Curr. Opin. Genet. Dev. 9, 695 (1999).
161. Q. Wang, J. Khillan, P. Gadue, K. Nishikura, Science 290, 1765 (2000).
162. M. Holcik, N. Sonenberg, R. G. Korneluk, Trends Genet. 16, 469 (2000).
163. T. A. McKinsey, C. L. Zhang, J. Lu, E. N. Olson, Nature 408, 106 (2000).
164. E. Capanna and M. G. M. Romanini, Caryologia 24, 471 (1971) .
165. J. Maynard Smith, J. Theor. Biol. 128, 247 (1987).
166. D. Charlesworth, B. Charlesworth, M. T. Morgan, Genetics 141, 1619 (1995).
167. J. E. Bailey, Nature Biotechnol. 17, 616 (1999).
168. R. Maleszka, H. G. de Couet, G. L. Miklos, Proc. Natl. Acad. Sci. U.S.A. 95, 3731 (1998).
169. G. L. Miklos, J. Neurobiol. 24, 842 (1993).
170. J. P. Crutchfield and K. Young, Phys. Rev. Lett. 63, 105 (1989) ; M. Gell-Mann and S. Lloyd, Complexity 2, 44 (1996) .
171. A. L. Barabasi and R. Albert, Science 286, 509 (1999).
172. E. Colucci-Guyon, et al., Cell 79, 679 (1994).
1 Celera Genomics, 45 West Gude Drive, Rockville,
MD 20850, USA.
2 GenetixXpress, 78 Pacific Road, Palm Beach,
Sydney 2108, Australia.
3 Berkeley Drosophila Genome Project, University
of California, Berkeley, CA 94720, USA.
4 Department of Biology, Penn State University,
208 Mueller Lab, University Park, PA 16802, USA.
5 Department of Genetics, Case Western Reserve
University School of Medicine, BRB-630, 10900 Euclid Avenue, Cleveland,
OH 44106, USA.
6 Johns Hopkins University School of Medicine,
Johns Hopkins Hospital, 600 North Wolfe Street, Blalock
1007, Baltimore, MD 21287-4922, USA.
7 Rockefeller University, 1230 York Avenue, New
York, NY 10021-6399, USA.
8 New England BioLabs, 32 Tozer Road, Beverly,
MA 01915, USA.
9 Division of Biology, 147-75, California Institute
of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA.
10 Yale University School of Medicine, 333 Cedar
Street, P.O. Box 208000, New Haven, CT 06520-8000, USA.
11 Applied Biosystems, 850 Lincoln Centre Drive,
Foster City, CA 94404, USA.
12 The Institute for Genomic Research, 9712 Medical
Center Drive, Rockville, MD 20850, USA.
13 Faculty of Life Sciences, Bar-Ilan University,
Ramat-Gan, 52900 Israel.
14 Grup de Recerca en Informàtica Mèdica,
Institut Municipal d'Investigació Mèdica, Universitat Pompeu
Fabra, 08003-Barcelona, Catalonia, Spain.
1. "Genome Discovery Shocks Scientists: Genetic blueprint contains far fewer genes than thought --- DNA's importance downplayed".
2. "Initial Sequencing and Analysis of the Human Genome".
3. "Selective Control of DNA Helix Openings During Gene Regulation".