"Our Genome Unveiled".
David Baltimore
California Institute of Technology, 1200 East California Boulevard,
Mail Code 204-31,
Pasadena, California 91125, USA.
e-mail: baltimo@caltech.edu
The draft sequences of the human genome are remarkable achievements.
They
provide an outline of the information needed to create a human being
and show, for
the first time, the overall organization of a vertebrate's DNA.
I've seen a lot of exciting biology emerge over the past 40 years.
But chills still ran down my
spine when I first read the paper that describes the outline of
our genome and now appears on
page 860 of this issue[1]. Not that many questions
are definitively answered for conceptual
impact, it does not hold a candle to Watson and Crick's 1953 paper
[2] describing the structure of
DNA. Nonetheless, it is a seminal paper, launching the era of post-genomic
science.
This milestone of biology's megaproject is the long-promised draft
DNA sequence from the
International Human Genome Sequencing Consortium (the public project).
The sequence itself
is available to all those connected to the Internet [3].
In the paper in this issue, we are presented
with a description of the strategy used to decipher the structures
of the huge DNA molecules
that constitute the genome, and with analyses of the content encoded
in the genome. It is the
achievement of a coordinated effort involving 20 laboratories and
hundreds of people around
the world. It reflects the scientific community at its best: working
collaboratively, pooling its
resources and skills, keeping its focus on the goal, and making
its results available to all as they
were acquired.
Simultaneously, another draft sequence is being published [4].
It is less freely available because it
was generated by a company, Celera Genomics, that hopes to sell
the information. This week's
Science contains an account of the history of that project and the
analyses of its data, while
another of the papers in this issue contains a comparison of the
quality of the two sequences [5].
To those who saw this as a competitive sport, the papers make it
appear to be roughly a tie.
However, it is important to remember that Celera had the advantage
of all of the public
project's data. Nevertheless, Celera's achievement of producing
a draft sequence in only a year
of data-gathering is a testament to what can be realized today with
the new capillary
sequencers, sufficient computing power and the faith of investors.
Answers
What have we learned from all of these AGCTs? The best way to answer
the question is to
read the analytical sections of the papers. I will only make some
general comments. It is
important to remember that no statements can be made with high precision
because the draft
sequences have holes and imperfections, and the tools for analysis
remain limited (as described
in a further paper [6] in this issue, page 828).
However, the answers provided by the draft will be
of interest to many investigators, and the value of having the draft
published in its imperfect
form is unquestionable.
The sequences are about 90% complete for the euchromatic (weakly
staining, gene-rich)
regions of the human chromosomes. The estimated total size of the
genome is 3.2 Gb (that is
gigabases, the latest escalation of units needed to contain the
fruits of modern technology). Of
that, about 2.95 Gb is euchromatic. Only 1.1% to 1.4% is
sequence that actually encodes
protein; that is just 5% of the 28% of the sequence that is transcribed
into RNA. Over half of
the DNA consists of repeated sequences of various types: 45% in
four classes of parasitic
DNA elements, 3% in repeats of just a few bases, and about 5% in
recent duplications of large
segments of DNA. The amounts in the first and third classes will
certainly grow as our ability
to characterize them increases in effectiveness and we examine the
darkly staining,
heterochromatin regions of chromosomes. As the co-discoverer
of reverse transcriptase (the
enzyme that reverses the common mode of information transfer from
DNA to RNA), I find it
striking that most of the parasitic DNA came about by reverse transcription
from RNA. In
places, the genome looks like a sea of reverse-transcribed DNA with
a small admixture of
genes.
Repeats
By contrast, the puffer fish another vertebrate has a genome
that contains very few
repeats. But it encodes a perfectly functional creature, so it seems
likely that most of the
repeats are simply parasitic, selfish DNA elements that use the
genome as a convenient host.
People call this 'junk DNA', but from the DNA's point of view it
deserves more respect. In
most places in the human genome the selfish elements are tolerated,
and in some places near
the ends of chromosomes, for instance, or near the chromosome constrictions
called
centromeres it builds up to form huge segments. However, the repeated
DNA may have
both negative and positive effects. For instance, the paucity of
repeats in certain highly
regulated regions of the genome suggests that insertions
there can disrupt gene regulation and
are deleterious. Conversely, the enrichment of the so-called Alu
class of repeated sequences in
the gene-rich, high-GC regions of the genome implies that they have
a positive function. The
repeats can also be fodder for evolving new functions and act as
loci for gene rearrangements.
In humans, virtually all of the parasitic DNA repeats seem old and
enfeebled, with little
evidence of continuing reinsertions. However, there has been very
little evolutionary scouring
of these repeats from the human genome, making it a rich record
of evolutionary history. The
mouse genome, by contrast, has many actively reinserting parasitic
sequences and is scoured
more intensely, making it a much younger and more dynamic genome.
This difference might
reflect the shorter generation time of mice or something about their
physiology, but I find it an
intriguingly enigmatic observation.
Much of what we learn about the global organization of the genome
is an elaboration of
previous notions. For instance, we knew that the genome had regions
with a relatively high
content of GC bases and regions high in AT, but now we have a very
complete appreciation of
this architecture. What maintains the patchiness of the GC/AT ratio
in the genome remains an
unanswered question. As was expected, most genes are located outside
the heterochromatic
regions; interestingly, however, in regions of the genome rich in
GC bases, the gene density is
greater and the average intron size is lower. These introns made
up of largely meaningless
sequence that breaks up the protein-coding sequences (exons) of
genes are much longer in
human DNA than in the genomes previously sequenced. Their dilution
of the coding sequence
is one element that makes finding genes by computer so difficult
in human DNA.
A major interest of the genome sequence to many biologists will be
the opportunity it provides
to discover new genes in their favourite systems for instance,
cell biologists will search for
new genes for signalling proteins, and neurobiologists will look
for new ion channels. This
data-mining exercise was carried out by various groups which report
their initial findings in
papers that appear on pages 824859 of this issue. They found some
new and interesting
genes, but surprisingly few, and occasionally could not find the
full extent of genes that they
knew were there. The paucity of discoveries reflects their concentration
on systems that were
previously heavily studied.
Gene-regulatory sequences are now there for all to see, but
initial attempts to find them were
also disappointing. This is where the genomic sequences of other
species in which the
regulatory sequences, but not the functionally insignificant
DNA, are likely to be much the
same will open up a cornucopia. Basically, the human sequence
at its present level of
analysis allows us to answer many global questions fairly well,
but the detailed questions remain
open for the future.
What interested me most about the genome? The number of genes is
high on the list. The
public project estimates that there are 31,000 protein-encoding
genes in the human genome, of
which they can now provide a list of 22,000. Celera finds about
26,000. There are also about
740 identified genes that make the non-protein-coding RNAs
involved in various cell
housekeeping duties, with many more to be found. The number of coding
genes in the human
sequence compares with 6,000 for a yeast cell, 13,000 for a fly,
18,000 for a worm and 26,000
for a plant. None of the numbers for the multicellular organisms
is highly accurate because of
the limitations of gene-finding programs. But unless the human genome
contains a lot of genes
that are opaque to our computers, it is clear that we do not gain
our undoubted complexity over
worms and plants by using many more genes. Understanding what does
give us our complexity
our enormous behavioural repertoire, ability to produce conscious
action, remarkable
physical coordination (shared with other vertebrates), precisely
tuned alterations in response to
external variations of the environment, learning, memory. . . need
I go on? remains a
challenge for the future.
Complexity
Where do our genes come from? Mostly from the distant evolutionary
past. In fact, only 94 of
1,278 protein families in our genome appear to be specific to vertebrates.
The most elementary
of cellular functions basic metabolism, transcription of DNA into
RNA, translation of RNA
into protein, DNA replication and the like evolved just once and
have stayed pretty well
fixed since the evolution of single-celled yeast and bacteria. The
biggest difference between
humans and worms or flies is the complexity of our proteins:
more domains (modules) per
protein and novel combinations of domains. The history is one of
new architectures being built
from old pieces. A few of our genes seem to have come directly from
bacteria, rather than by
evolution from bacteria apparently bacterial genomes can be direct
donors of genes to
vertebrates. So DNA chimaeras consisting of the genes from
several organisms can arise
naturally as well as artificially (opponents of 'genetically modified
foods' take note).
The most exciting new vista to come from the human genome is not
tackling the question
"What makes us human?", but addressing a different one: "What differentiates
one organism
from another?". The first question, imprecise as it is, cannot be
answered by staring at a
genome. The second, however, can be answered this way because our
differences from plants,
worms and flies are mainly a consequence of our genetic endowments.
The Celera team [4]
presents the more detailed analysis of the numbers of different
protein
motifs and protein types,
in extensive tables. From them, it is easy to see what types of
proteins and motifs have been
amplified for specific types of organisms. In vertebrates, not surprisingly,
we see elaboration
and the de novo appearance of two types of genes: those for specific
vertebrate abilities (such
as neuronal complexity, blood-clotting and the acquired immune response),
and those that
provide increased general capabilities (such as genes for intra-
and intercellular signalling,
development, programmed cell death, and control of gene transcription).
Someday soon we will
have the mouse genome, and then those of fish and dogs, and
probably the kangaroo genome
from the Australians. Each of these will fill in a piece of the
evolutionary puzzle and will
provide exciting comparisons.
We wait with bated breath to see the chimpanzee genome. But
knowing now how few genes
humans have, I wonder if we will learn much about the origins of
speech, the elaboration of the
frontal lobes and the opposable thumb, the advent of upright posture,
or the sources of abstract
reasoning ability, from a simple genomic comparison of human and
chimp. It seems likely that
these features and abilities have mainly come from subtle changes
for example, in gene
regulation, in the efficiency with which introns are spliced out
of RNA, and in proteinprotein
interactions that are not now easily visible to our computers
and will require much more
experimental study to tease out. Another half-century of work by
armies of biologists may be
needed before this key step of evolution is fully elucidated.
What is next? Lots of hard work, but with new tools and new aims.
First,
we have to stay the
course and get the most precise representation of the genome that
we can: this is a matter of
filling the cracks, cleaning up the errors, and getting rid of the
uncertainties that plague each of
the analytical methods. Second, we need to see more genomes,
with each one giving us a
deeper insight into our own. Third, we need to learn how
to take advantage of this book of life.
Tools for scanning the activity levels of genes in different cells,
tissues and settings are
becoming available and are already revolutionizing how we do biological
investigation. But we
will have to move back from the general to the particular, because
each gene is a story in itself
and its full significance can be learned only from concentrating
on its particular properties.
Fourth, we need to turn our new genomic information into an
engine of pharmaceutical
discovery. Individual humans differ from one another by about one
base pair per thousand.
These 'single nucleotide polymorphisms' (SNPs) are markers
that can allow epidemiologists to
uncover the genetic basis of many diseases. They can also provide
information about our
personal responses to medicines in this way, the pharmaceutical
industry will get new
targets and new tools to sharpen drug specificity. Moreover, the
analysis of SNPs will provide
us with the power to uncover the genetic basis of our individual
capabilities such as
mathematical ability, memory, physical coordination, and even, perhaps,
creativity.
Biology today enters a new era, mainly with a new methodology for
answering old questions.
Those questions are some of the deepest and simplest: "Daddy, where
did I come from?";
"Mommy, why am I different from Sally?". As these and other questions
get robust answers,
biology will become an engine of transformation of our society.
Instead of guessing about how
we differ one from another, we will understand and be able to tailor
our life experiences to our
inheritance. We will also be able, to some extent, to control that
inheritance. We are creating a
world in which it will be imperative for each individual person
to have sufficient scientific
literacy to understand the new riches of knowledge, so that we can
apply them wisely.
(Underlines by WebEditor).
References
1. International Human Genome Sequencing Consortium, Nature 409, 860-921 (2001).
2. Watson, J. D. & Crick, F. H. C., Nature 171, 737-738 (1953).
3. http://genome.cse.ucsc.edu/
4. Venter, J. C. et al., Science 291, 304-1351 (2001).
5. Aach, J. et al., Nature 409, 856-859 (2001).
6. Birney, E., Bateman, A., Clamp, M. E. & Hubbard, T. J., Nature 409, 827-828 (2001).
Additional Reference:
1. "Selective Control of DNA Helix Openings During Gene Regulation".