1ARC Special Research Centre for Molecular and Cellular Biology, 2Department of Mathematics, University of Queensland, Brisbane, Queensland, Australia, 3Department of Evolutionary Biology, University of Copenhagen, Copenhagen, Denmark. (E-mail to J.S.M. at: j.mattick@cmcb.uq.edu.au).
In recent years, the discovery of many regulatory elements within introns, the recognition of the regulatory potential of intronic and other non-protein coding RNAs, and the concept of a cellular 'ribotype' resulting from differences in RNA processing in different cells and tissues have led to increasing interest in the role of introns in enhancing eukaryotic genetic complexity, via alternative splicing and as both the recipient and donor of cis-acting and trans-acting elements (1-4).
To explore the evolution and function of introns in eukaryocytes, we have developed an intron sequence information system (ISIS; http://isis.bit.uq.edu.au/) which contains information on over 170,000 spliceosomal introns. Data in ISIS version 1 is based on intron-containing sequences from GenBank release 111. ISIS contains phylogenetic and protein homology categories, information about individual sequences and various bioinformatic analyses of taxonomical groupings of sequences using non-redundant subsets of the data. The database is searchable by Blast, GenBank attributes and elements that we have annotated within introns, and gives graphical views of gene structure and elements such as alternative coding regions, EST matches and repetitive sequences.
During analysis of this database, we found many EST matches within sequences annotated as introns, indicating that there are many previously unrecognized alternatively spliced exons, especially as many of these exons are conserved between species. Alternative splicing was first predicted by Walter Gilbert (5), and subsequently verified by the discovery of cDNA isoforms exhibiting the addition or exclusion of whole or partial exons (6,7), although identification of such splice variants has largely occurred on an ad hoc basis. The development of large human EST (partial cDNA) sequence libraries over recent years, however, provides an opportunity to examine the incidence of alternative splicing globally by searching these libraries for exon skipping, exon truncation or inclusion of sequences currently described as intronic.
We examined the incidence of unrecognized exons in introns in 2,698 non-redundant human genes in ISIS which contain at least one complete sequenced intron and two flanking exons. We excluded hypervariable immune-related genes and any genes with previously annotated transcripts in GenBank. After removal of known repetitive srquences in the introns, we identified 3,119 EST clusters from 1,122 genes (42%) containing sequences previously annotated as intronic. The presence of such sequences in the EST databases may be the result of genuine alternative splicing events or pre-mRNA contamination of preparations used to construct libraries. We therefore discarded any cases of whole-intron retention (although some may represent genuine splice variants (8,9)), as well as all other ambiguous cases in which the EST sequences were indistinguishable from the genomic sequence.
The remaining 209 clusters (186 genes) had unequivocal alternative splicing events, with cryptic exons, 5' exon extensions and 3' exon extensions occurring in roughly equal proportions.
We also examined the frequency of exon skipping and length variation by analysing EST sequences that crossed exon/intron boundaries, identifying 507 genes in which known exons were absent or altered in length relative to the GenBank annotation. Combining these two sets, we identified 582 different human genes showing unequivocal alternative splicing via exon insertion, extension, truncation or deletion. This represents 22% of genes analyzed, many of which exhibit alternative splicing events. This reveals an unexpectedly high frequency of alternative splicing in genes not previously known to be alternatively spliced. Moreover, this analysis provides a very conservative estimate, given the fragmentary EST coverage, the 3' bias of most EST libraries and the removal of all ambiguous or indeterminate cases. Furthermore, our analysis did not detect any new exons or exon truncations/extensions at the 5' and 3' ends of transcripts, which are common (10,11).
A recent comparison of 475 disease-associated human protein sequences with the human EST database (12) suggested that as many as one in three may be alternatively spliced, although because this study was not based on a data set of whole genomic DNA for each gene, the background of incomplete splicing/pre-mRNA contamination could not be fully assessed. Similar estimates were obtained for another sample of 392 genes (10). Given the incomplete EST and intron sequence coverage in the databases, we anticipate that the real frequency of alternative splicing in human genes will be much greater than we have been able to measure, suggesting that there may be several hundred-thousand different mRNAs and protein isoforms produced in different cells and tissues at various stages of development and under different physiologic conditions.
This complexity also suggests microarray systems for analysing gene expression will need to be expanded beyond a single coding sequence per gene to include all possible alternative exons on the array, if we are to properly understand the genetic output and cellular circuitry of higher organisms.
Full details of the experimental methods and results are available (http://isis.bit.uq.edu.au/a_splicers.html).
We thank B. Huang and D. Kennedy for helpful comments and discussion.
1. Mattick JS, Curr. Opin. Genet. Dev. 4: 823-831 (1994).
2. Herbert A, and Rich A, Nature Genet. 21: 265-269 (1999).
3. Fire A, Trends Genet. 15: 358-363 (1999).
4. Chabot B, Trends Genet. 12: 472-478 (1996).
5. Gilbert W, Nature 271: 501 (1978).
6. Breitbart R, Andreadis A, and Nadal-Ginard B, Annu. Rev. Biochem. 56: 467-495 (1987).
7. Adams M, Rudner D, and Rio D, Curr. Opin. Cell Biol. 8: 331-39 (1996).
8. McKeown M, Annu. Rev. Cell Biol. 8: 133-155 (1992).
9. Smith CWJ, Patton JG, and Nadal-Ginard B, Annu. Rev. Genet. 23: 527-577 (1989).
10. Mironov A, Fickett J, and Gelfand M, Genome Res. 9: 1288-1293 (1999).
11. Gautheret D, Poirot O, Lopez F, Audic S, and Claverie JM, Genome Res. 8: 524-530 (1998).
12. Hanke J, et al, Trends Genet. 15: 389-390 (1999).
2. "Selective Control of DNA Helix Openings during Gene Regulation".
3. "Selective Gene De-Repression by De-Repressor RNA".
4. "Nuclear RNA Species Activate DNA Transcription within Chromatin".
5. "Oncogenes as Molecular Targets within Active Chromatin".