1 Department of Human Genetics,
2 Department of Pediatric Oncology, Emma Children's Hospital,
Academic Medical Center, University of Amsterdam, Post Office Box 22700,
1100 DE Amsterdam, Netherlands.
3 Bioinformatics Laboratory,
4 Neurozintuigen Laboratory,
5 Department of Clinical Epidemiology and Biostatistics,
Academic Medical Center, University of Amsterdam, Amsterdam, Netherlands.
6 Department of Pathology and Department of Genetics,
Duke University Medical Center, Durham, NC 27710, USA.
The chromosomal position of human genes is rapidly being established.
We integrated these mapping data with genome-wide messenger RNA
expression profiles as provided by SAGE (serial analysis of gene
expression). Over 2.45 million SAGE transcript tags, including 160,000
tags of neuroblastomas, are presently known for 12 tissue types.
We developed algorithms to assign these tags to UniGene clusters
and their chromosomal position. The resulting Human Transcriptome Map generates
gene expression profiles for any chromosomal region in 12 normal and pathologic
tissue types. The map reveals a clustering of highly expressed genes to
specific chromosomal regions. It provides a tool to search for genes that
are overexpressed or silenced in cancer.
GeneMap'99 (1) gives the chromosomal position of 45,049 human expressed sequence tags (ESTs) and genes belonging to 24,106 UniGene clusters. To obtain an expression profile of these genes, we made use of the SAGE technology and databases. SAGE can quantitatively identify all transcripts expressed in a tissue or cell line (2). It is based on the extraction of a 10-base pair (bp) tag from a fixed position in each transcript and the sequencing of thousands of these tags. Software programs and databases support the identification of the mRNAs corresponding to the tags in a SAGE library. However, this step is prone to errors, and tag assignment requires manual verification. The National Center for Biotechnology Information (NCBI) SAGEmap database has electronically extracted tags from mRNAs and ESTs in UniGene clusters. A manual check of 156 tags extracted from 30 UniGene clusters showed that wrong tags mainly stemmed from sequence errors in ESTs and from errors in their 5' and 3' orientations. We developed algorithms to select 3'-end clones of 713,489 ESTs assigned to UniGene clusters and identified their tags. Sequence comparison algorithms discarded tags caused by sequence errors while preserving tags from alternative transcripts or single nucleotide polymorphisms [see supplementary information for AMC tagmap details (3)]. We identified reliable tags for 18,954 of the 24,106 UniGene clusters mapped on GeneMap'99. Manual analysis of 287 tags extracted from 86 UniGene clusters from intervals of chromosomes 1 and 22 showed an error rate of 6.2% in our electronic tag identification algorithms.To check for errors in UniGene clustering, we verified tags on the available sequenced P1-derived artificial chromosomes (PACs) of the mapped markers and annotated them accordingly [see legend to Fig. 2 and supplementary information (3)].
The Human Transcriptome Map [for Web site, see (4)]
uses these tag assignments to relate 2.31 million tags in public SAGE
libraries (NCBI SAGE map database) (5) and 160,000tags
in our neuroblastoma SAGE libraries to the UniGene clusters mapped in GeneMap'99.
The Human Transcriptome Map shows expression profiles for any chromosomal
region in 12 tissue types. SAGE libraries of a specific tissue were combined
into tissue-specific libraries (e.g., normal colon). We included tissues
for which 100,000 or more tags were available, as most transcripts in a
tissue are represented in a library of this size (6).
Five libraries represent normal tissues (colon epithelium, brain,
mammary gland, ovary, and prostate), and seven libraries represent
tumor tissues (neuroblastoma, glioblastoma, medulloblastoma,
and carcinomas of colon, ovary, breast, and prostate). The
Human Transcriptome Map has three levels of resolution. The
"whole chromosome view" shows gene expression per chromosome
(Fig. 1).
Fig. 1. Whole chromosome view of expression levels of the 1208 UniGene
clusters mapped to chromosome 11 on the GB4 radiation hybrid map
of GeneMap'99. Each unit on the vertical axis represents one UniGene cluster.
UniGene clusters mapped by several markers are only shown once, at the
position of the highest lod score (the logarithm of the odds ratio for
linkage). Only clusters for which we could extract a tag with our algorithms
are included. Expression is shown for SAGE libraries of 8 out of the 12
available tissue types. Expression levels in the libraries are normalized
per 100,000 tags. Expression levels from 0 to 15 tags are shown by horizontal
blue bars. Tag frequencies over 15 are shown by red bars. The blue-only
section to the right represents a moving median with a window size of 39
UniGene clusters generated from the expression levels in "all tissues."
Green bars indicate RIDGEs. The boxed region shows the tissue-specific
expression of a cluster of five metalloproteinases and two apoptosis inhibitors
in normal breast tissue and breast cancer tissue.
The whole chromosome views reveal a higher order organization of the genome, as there is a strong clustering of highly expressed genes. Chromosome 11 has several large regions of high gene expression, interspersed with regions where gene expression is low (Fig. 1).This pattern is observed in all 12 tissues. An application of a moving median with a window size of 39 genes to the chromosome11 map even more clearly visualizes the expression differences(Fig. 1, blue graph to the right). Most chromosomes show these clusters of highly expressed genes, which we call RIDGEs (regions of increased gene expression) (Fig. 3):
Fig. 3. Regional expression profiles for 23 human chromosomes show
a clustering of highly expressed genes in RIDGEs. Expression levels are
shown as a moving median with a window size of 39 genes. There are 74 regions
with one or more consecutive moving medians that have a lower limit of
four times the genomic median; 27 of them have a length of at least 10
consecutive moving medians (indicated by green bars).
Analysis of RIDGEs for physical characteristics suggests that many
of them have a high gene density. Chromosome 18 is, on average,
weakly expressed, and only 385 genes have been mapped to it
on GeneMap'99. The equally large chromosome 19 consists of a
succession of RIDGEs and harbors 937 mapped genes (Fig. 3).Although
many human genes are still unmapped, the difference in gene density of
chromosomes 18 and 19 is supported by CpG island density analyses (7). The
correlation between RIDGEs and gene density is even more suggestive
for chromosomes 3 and 6 (Fig. 4):
Fig. 4. Comparison of median gene expression levels and gene density
for chromosomes 3 and 6. The left diagrams of each chromosome show
the expression levels as a moving median with a window size of 39 UniGene
clusters. The right diagram of each chromosome shows gene density. For
each UniGene cluster, we calculated the average distance between adjacent
clusters in a window of 39 adjacent UniGene clusters. The inverse of this
value is shown (inverse centirays per gene).
The Human Transcriptome Map provides a tool to identify candidate genes that are overexpressed or silenced in cancer tissue. Neuroblastomas frequently show amplification of the distal chromosome 2p region, which targets the N-myc oncogene (10). Comparison of the whole chromosome views of chromosome 2p shows overexpressionof two adjacent genes in neuroblastoma SAGE libraries. The extended interval view identifies these genes as N-myc and the often coamplified neighboring gene DDX-1 (Fig. 2). Therefore, global positional information of chromosomal defects is sufficient to identify candidate oncogenes (11). Also, tumor-specific down-regulationcan be detected. Examples are a cluster of five matrix metalloproteinases on chromosome 11 [348 to 353 centirays (cR)] that are down-regulated in breast cancer tissue (Fig. 1, box); the E-cadherin tumor suppressor gene on chromosome 16 (406 cR) that is down-regulated in breast cancer tissue, as compared to normal breast tissue;and five carcinoembryonic antigen-related cell adhesion molecule genes on chromosome 19 (238 to 244 cR) that are down-regulated in colon carcinoma tissue, as compared to normal colon tissue(4).
Potential error sources in the Human Transcriptome Map are clustering errors in UniGene and the assignment of wrong tags to UniGene clusters. Our algorithms assign ~6.2% erroneous tags to UniGene clusters. The influence of these errors is probably attenuated. Assuming a total of 100,000 genes with 2 tags each, 200,000 tags would represent all human genes. Because there are >1 million variants of a 10-bp tag sequence, ~80% of the erroneously extracted tags will not match tags present in SAGE libraries and therefore will not influence overall expression profiles. However, individual tags and expression levels of UniGene clusters may harbor errors and require experimental confirmation. To test whether errors in UniGene clustering and mapping to GeneMap'99 may influence our observation of RIDGEs, we constructed a sequence-based expression map for the annotated chromosome 21 sequence and for a 4.3-Mb annotated contig of the MHC region on chromosome 6 (12,13). Also, these maps showed that the MHC region is a pronounced RIDGE, whereas chromosome 21 is devoid of RIDGEs and has an overall weak gene expression [see Web fig. 4 for maps(3)]. Therefore, the higher order structure of the genome observed with the Human Transcriptome Map will largely be correct. The existence of RIDGEs is unanticipated, as a comparable SAGE-based transcriptome map for yeast showed an even distribution over the genome of highly and weakly expressed genes (8). Becausethe Human Transcriptome Map identifies different types of transcription domains, it can now be analyzed as to how they relate to known nuclear substructures, such as nuclear speckles, PML bodies, and coiled bodies (14-16). Definition of the position of tags to the full chromosomal sequences will further increase the resolution of the transcriptome map. Incorporation of the growing number of SAGE libraries from different tissues and various developmental stages will extend the overview of gene expression profiles in the human body.
1. "Selective Control of DNA Helix Openings During Gene Regulation".
3. "The Sequence of the Human Genome".
5. "Initial Sequencing and Analysis of the
Human Genome".