Gerardo Jimenez-Sanchez*, Barton Childs* & David Valle*†
* Department of Pediatrics, McKusick-Nathans Institute of Genetic
Medicine, and
† Howard Hughes Medical Institute, Johns Hopkins University School
of Medicine, Baltimore, Maryland 21205, USA
Correspondence should be addressed to D.V. (e-mail: dvalle@jhmi.edu).
To test the proposal that classifying disease genes and their products according to function will provide general insight into disease processes [1, 2], we have compiled and classified a list of disease genes. To assemble the list, we began with 269 genes identified in a survey of the 7th edition of Metabolic and Molecular Bases of Inherited Disease [2]. We then searched the 'morbid map' and allelic variants listed in the Online Mendelian Inheritance in Man [3] (OMIM), an online resource documenting human diseases and their associated genes (http://www.ncbi.nlm.nih.gov), and increased the total disease gene set to 923. This sample included genes that cause monogenic disease (97% of the sample) and genes that increase susceptibility for complex traits. We excluded genes associated only with somatic genetic disease (such as non-inherited forms of cancer) or the mitochondrial genome.
We categorized each disease gene according to the function of its protein product (see Supplementary Information). Our approach differed in two ways from that used by the International Human Genome Sequencing Consortium (IHGSC) to annotate the working draft human sequence [4]. First, we focused on the function of the protein itself without consideration of its biological context, whereas the IHGSC used the classification employed by the Gene Ontology project [5] which integrates three aspects of function: biochemical activity, biological process and subcellular location. Second, our functional designations were largely informed by features of pathology, whereas those of the IHGSC were almost entirely based on sequence homology to proteins of known function in model organisms. We also scored each disease gene for features related to clinical presentation, including age of onset, mode of inheritance, frequency, severity, extent of tissue involvement and association with malformations.
The results of our functional classification of the proteins encoded
by 923 disease genes are shown in Fig. 1a.
Figure 1: The functions of the protein products of disease
genes. a, The entire disease gene set. b–f, Disease genes stratified according
to the typical age of onset of the disease phenotype. The fraction of disease
genes encoding transcription factors in the in utero onset disorders
(25%) differs from the fraction encoding transcription factors for disorders
with onset after birth (6%; X2 = 49.4, P < 0.001).
Similarly, the fraction of disease genes encoding enzymes causing a disorder
with onset in the first year of life (47%) is different from the fraction
encoding enzymes causing disorders with other ages of onset (25.8%; X2
= 35.8, P < 0.001).
The largest functional category, comprising genes encoding enzymes,
accounts for 31.2% of the total. This represents about twice as many as
the next highest category, designated modulators of protein function (13.6
%), which includes proteins that stabilize, activate, fold or otherwise
influence the function of a second protein. Each of the remaining 12 categories
accounts for less than 10% of the total sample. The abundance of enzymes
in the disease gene set may reflect some historical bias towards metabolic
disorders in the study of human inherited disease [2].
In contrast, only 15% of 114 positionally cloned genes (updated from http://genome.nhgri.nih.gov/clone/)
encode enzymes, but this set may have its own biases (see Supplementary
Information). Indeed, protein domains associated with enzymes were
identified in 27% of 8,360 Drosophila
proteins scored for these motifs [6]. This observation
suggests that in higher eukaryotes the fraction of genes encoding enzymes
may be 25–30%, or close to the fraction identified in our disease gene
set.
Gene Function and Disease Characteristics:
We analysed the disease gene set for evidence of correlations between
the function of a gene product and the age of onset of its associated disease
(Fig. 1). Several aspects of this analysis are of
interest. First, diseases associated with genes encoding proteins in all
the functional categories can appear at any stage of life. The only apparent
exception is for diseases presenting after 50 years of age (Fig.
1f) but the sample of genes in this category is small and a more general
distribution of protein function may emerge as the number increases. Second,
genes encoding transcription factors are over-represented among genes causing
genetic disease with onset in utero (Fig. 1b). This
concentration of diseases resulting from abnormalities of transcription
factors probably reflects the important role of these proteins in orchestrating
development. It is therefore not surprising that genes encoding transcription
factors account for more than 30% of the genes associated with
malformation phenotypes (see Supplementary
Information).
An extraordinarily high fraction of diseases with onset in the first
year of life are caused by defects in genes encoding enzymes (47%; Fig.
1c). This too fits with biological expectations and clinical evidence.
The developing fetus has access to its mother's metabolic homeostatic systems
through the placenta. Thus, infants with inborn errors caused by enzyme
deficiencies are typically normal at birth and develop symptoms only
after the defect in their homeostatic system is exposed by demands
on their own metabolism [7]. The fraction of disease
genes encoding enzymes falls with later disease onset (Fig.
1d–f). Disorders with onset after age 50 are an apparent exception,
with the fraction of genes encoding enzymes increasing to more than 33%.
But the number of disorders in this category (18) is small, and our understanding
of the genes that contribute to complex traits with onset in this age range
is limited. Three of the six genes encoding enzymes in this age of onset
category are variants identified as susceptibility alleles rather than
true disease-producing alleles.
We divided the disease gene list by function
and compared disease characteristics including frequency, mode of inheritance,
age at onset and reduction in life expectancy. Figure 2 shows the results
for the four largest functional categories.
Figure 2 Characteristics of disease arranged by function
of the protein encoded by the disease gene. a, Disease genes encoding enzymes;
b, disease genes encoding modifiers of protein function; c, disease genes
encoding receptors; d, disease genes encoding transcription factors. The
columns of disease features are labelled at the top. AR, autosomal recessive;
AD, autosomal dominant; early adulthood, puberty to <50 years; late
adulthood, >50 years.
Interestingly, each of the four functional
categories has a different peak age at onset. For transcription factors
the peak is in utero; for enzymes it is in year 1; for receptors it is
between year 1 and puberty; and for modifiers of protein function it is
in early adulthood. These correlations provide biological support for the
validity of the functional characterization and they hint at additional
principles of disease. Perhaps disorders of
receptors are most likely to present in childhood because this is
a time of rapid growth and, especially during puberty, of intense signalling
activity between various cells and tissues. Similarly, disorders involving
modifiers of protein function may present later in life because the homeostatic
systems are not completely disrupted by these defects; rather, they respond
in ways that are less congruent with the demands placed on the organism
and so become symptomatic more gradually. Finally, there is no apparent
relationship between functional category and reduction in life expectancy.
This may reflect a true lack of correlation or it may indicate that larger
numbers and more sophisticated characterization of disease severity are
required to discern such relationships.
Better functional annotation of the human genome and a comprehensive list of human disease genes should lead to much greater integration of medicine and biology. We believe that increasing knowledge of the genes associated with diseases will allow researchers to address more complicated issues, including the relative contributions to disease of genes in the core biological set shared by all species and those encoding proteins specific to humans [8]; how sequence features (such as conservation and polymorphism) relate to disease characteristics; and how protein function relates to the outcome of clinical treatment [9].
Supplementary information
is available on Nature's World-Wide Web site
(http://www.nature.com)
or as paper copy from the London editorial office of Nature.
1. "Selective Control of DNA Helix Openings During Gene Regulation".
3. "The Sequence of the Human Genome".