"Molecular Portraits of Human Breast Tumours".
Charles M. Perou1*, Therese Sorlie2*, Michael B. Eisen1, Matt van deRijn3, Stefanie S. Jeffrey4, Christian A. Rees1, Jonathan R. Pollack5, Douglas T. Ross5, Hilde Johnsen2, Lars A. Akslen6, Oystein Fluge7, Alexander Pergamenschikov1, Cheryl Williams1, Shirley X. Zhu3, Per E. Lonning8, Anne-Lise Borresen-Dale2, Patrick O. Brown5,9 & David Botstein1
1 Department of Genetics, Stanford University School of
Medicine, Stanford, California 94305, USA
2 Department of Genetics, The Norwegian Radium Hospital,
N-0310 Montebello Oslo, Norway
3 Department of Pathology, Stanford University School
of Medicine, Stanford, California 94305, USA
4 Department of Surgery, Stanford University School of
Medicine, Stanford, California 94305 , USA
5 Department of Biochemistry, Stanford University School
of Medicine, Stanford, California 94305, USA
6 Department of Pathology, The Gade Institute, Haukeland
University Hospital, N-5021 Bergen, Norway
7 Department of Molecular Biology, University of Bergen,
N-5020 Bergen, Norway
8 Department of Oncology, Haukeland University Hospital,
N-5021 Bergen, Norway
9 Howard Hughes Medical Institute, Stanford University
School of Medicine, Stanford, California 94305, USA
*These authors (C.M.P. and T.S. ) contributed equally to this work.
Correspondence and requests for materials should be addressed to D.B. (e-mail: botstein@genome.stanford.edu) or P.O.B. (e-mail: pbrown@cmgm.stanford.edu).
Human breast tumours are diverse in their natural history and in their responsiveness to treatments [1]. Variation in transcriptional programs accounts for much of the biological diversity of human cells and tumours. In each cell, signal transduction and regulatory systems transduce information from the cell's identity to its environmental status, thereby controlling the level of expression of every gene in the genome. Here we have characterized variation in gene expression patterns in a set of 65 surgical specimens of human breast tumours from 42 different individuals, using complementary DNA microarrays representing 8,102 human genes. These patterns provided a distinctive molecular portrait of each tumour. Twenty of the tumours were sampled twice, before and after a 16-week course of doxorubicin chemotherapy, and two tumours were paired with a lymph node metastasis from the same patient. Gene expression patterns in two tumour samples from the same individual were almost always more similar to each other than either was to any other sample. Sets of co-expressed genes were identified for which variation in messenger RNA levels could be related to specific features of physiological variation. The tumours could be classified into subtypes distinguished by pervasive differences in their gene expression patterns.
We proposed that the phenotypic diversity of breast tumours might be accompanied by a corresponding diversity in gene expression patterns that we could capture using cDNA microarrays. Systematic investigation of gene expression patterns in human breast tumours might then provide the basis for an improved molecular taxonomy of breast cancers. We analysed gene expression patterns in grossly dissected normal or malignant human breast tissues from 42 individuals (36 infiltrating ductal carcinomas, 2 lobular carcinomas, 1 ductal carcinoma in situ, 1 fibroadenoma and 3 normal breast samples). Fluorescently labelled (Cy5) cDNA was prepared from mRNA from each experimental sample. We prepared cDNA, labelled using a second distinguishable fluorescent nucleotide (Cy3), from a pool of mRNAs isolated from 11 different cultured cell lines (see Supplementary Information Table 1); this common 'reference' sample provided an internal standard against which the gene expression of each experimental sample was compared [2, 3].
Twenty of the forty breast tumours examined were sampled twice, as part of a larger study on locally advanced breast cancers (T3 /T4, and/or N2 tumours; see ref. [4] ). After an open surgical biopsy to obtain the 'before' sample, each of these patients was treated with doxorubicin for an average of 16 weeks (range 12-23), followed by resection of the remaining tumour. In addition, primary tumours from two patients were also paired with a lymph node metastasis from the same patient. To help interpret the variation in expression patterns seen in the tumour samples, we also characterized 17 cultured cell lines (with one cell line cultured under three different conditions), which provided models for many of the cell types encountered in these tissue samples. In total, we analysed 84 cDNA microarray experiments (see Supplementary Information, Table 2; the primary data tables can be obtained at http://genome-www.stanford.edu/molecularportraits/).
A hierarchical clustering method
was used to group genes on the basis of similarity in the pattern with
which their expression varied over all samples [5].
The same clustering method was used to group the experimental samples (cell
lines and tissues separately) on the basis of similarity in their patterns
of expression. We focus first on a set of 1,753 genes (about 22% of the
8,102 genes analysed), whose transcripts varied in abundance by at least
fourfold from their median abundance in this sample set in at least three
of the samples (Fig. 1; see Supplementary
Information Fig. 4 for the complete cluster diagram).
Figure 1. Variation in expression of 1,753 genes in 84 experimental
samples. Data are presented in a matrix format; each row represents a single
gene, and each column an experimental sample. In each sample, the ratio
of the abundance of transcripts of each gene to the median abundance of
the gene's transcript among all the cell lines (right panel), is represented
by the colour of the corresponding cell in the matrix. Green squares, transcript
levels below the median; black squares, transcript levels equal to the
median; red squares, transcript levels greater than the median; grey square,
technically inadequate or missing data. Colour saturation reflects the
magnitude of the ratio relative to the median for each set of samples (see
scale, bottom left; and Supplementary Information,
Fig. 4).
(a). Dendogram representing similarities in
the expression patterns between experimental samples. All 'before and after'
chemotherapy pairs that were clustered on terminal branches are highlighted
in red; the two primary tumour/lymph node metastasis pairs in light blue;
the three clustered normal breast samples in light green. Branches representing
the four breast luminal epithelial cell lines are shown in dark blue; breast
basal epithelial cell lines in orange; the endothelial cell lines in dark
yellow; the mesenchymal-like cell lines in dark green; and the lymphocyte-derived
cell lines in brown.
(b). Scaled-down representation of the 1,753
cluster diagram; coloured bars to the right identify the locations of the
inserts displayed in (c)-(j).
(c). Endothelial cell gene expression cluster;
(d). stromal/fibroblast cluster;
(e). breast basal epithelial cluster;
(f). B-cell cluster;
(g). adipose-enriched/normal breast;
(h). macrophage;
(i). T-cell;
(j). breast luminal epithelial cell.
Three striking features of the gene expression patterns of these tumours are evident in Fig. 1. First, the tumours show great variation in their patterns of gene expression. Second, this variation is multidimensional; that is, many different sets of genes show mainly independent patterns of variation. Third, these patterns have a pervasive order reflecting relationships among the genes, relationships among the tumours and connections between specific genes and specific tumours.
The hierarchical clustering algorithm organizes the experimental samples only on the basis of overall similarity in their gene expression patterns; these relationships are summarized in a dendrogram (Fig. 1a ), in which the pattern and length of the branches reflects the relatedness of the samples [5]. Fifteen of the twenty before and after doxorubicin pairs (red dendrogram branches), and both primary tumour/lymph node metastasis pairs (light blue branches) were clustered together on terminal branches in the dendrogram; that is, despite an interval of 16 weeks, independent surgical procedures and cytotoxic chemotherapy, independent samples taken from the same tumour were in most cases recognizably more similar to each other than either was to any of the other samples. In three instances (Norway 47, 61 and 101), the 'after' chemotherapy specimens clustered in a branch of the dendrogram that also contained the three normal breast samples; we know from the clinical data that these tumours were 3 of the 20 tumours that were classified as doxorubicin 'responders' (data not shown). An analysis of the relationship between gene expression and correlations with clinical data will be reported elsewhere (T.S. et al., manuscript in preparation).
The 'molecular portraits' revealed in the patterns of gene expression not only uncovered similarities and differences among the tumours, but in many cases pointed to a biological interpretation. Variation in growth rate, in the activity of specific signalling pathways, and in the cellular composition of the tumours were all reflected in the corresponding variation in the expression of specific subsets of genes. The largest distinct cluster of genes within the 1,753-gene cluster diagram was the 'proliferation cluster' (Supplementary Information Fig. 5), which is a group of genes whose levels of expression correlate with cellular proliferation rates [3, 6]. Expression of this cluster of genes varied widely among the tumour samples, and was generally well correlated with the mitotic index. As one might expect, this cluster also included the genes encoding two widely used immunohistochemical markers of cell proliferation (Ki-67 and PCNA).
Several groups of co-expressed genes provided views of the activities of specific signalling and/or regulatory systems. A large cluster of genes regulated by the interferon pathway (including STAT1) showed substantial variation in expression among the tumours, as was previously observed in a smaller set of breast tumours [6]. Variation in expression of the oestrogen receptor-gene (ER) correlated well with the direct clinical measurement of the ER protein levels in the tumours (Supplementary Information Table 3; concordance in 36/38 samples), and paralleled variation in the expression of a larger group of genes that included three other transcription factors (GATA-binding protein 3 (refs [7, 8] ), X-box binding protein 1 and hepatocyte nuclear factor 3 a). HER2/neu, also known as Erb-B2, is overexpressed in 20-30% of all breast tumours, usually associated with DNA amplification of the Erb-B2 locus [9,10]. Notably, most of the other genes contained within the Erb-B2 cluster were located in this same region of chromosome 17, and were also amplified on the genomic DNA level (ref. [10]; and J.R.P., unpublished data). Finally, a cluster of genes that included c-Fos and JunB co-varied in expression among the tumour specimens. We have found that this subset of genes is characteristically induced by prolonged handling of the samples after surgical resection (M.v.d.R. and C.M.P., unpublished data).
Human breast tumours are histologically complex tissues, containing a variety of cell types in addition to the carcinoma cells [11]. In analysing the gene expression patterns in solid human tumours, we used two lines of reasoning to infer the lineage of the cells that accounted for the apparently cell-type-specific expression of particular clustered groups of genes. First, such clusters included genes whose expression patterns have been previously characterized and that consistently pointed to a specific cell type. Second, these inferences were often corroborated by comparable expression of the same cluster in one or more of the cultured cell lines. Thus, eight independent clusters of genes appeared to reflect variation in specific cell types present within the tumours (Fig. 1c-j).
(1) Endothelial cells: a cluster of genes characteristically expressed by endothelial cells, including CD34, CD31 and von Willebrand factor were also strongly expressed in the two endothelial cell lines HUVEC and HMVEC (Fig. 1c).
(2) Stromal cells: a previously characterized cluster of genes that included several isoforms of collagen showed significant variation in expression among samples (Fig. 1d) [3, 6].
(3) Adipose-enriched/normal breast cells: a cluster of genes including fatty-acid binding protein 4 and PPAR may represent the presence of adipose cells (Fig. 1g).
(4) B lymphocytes: variation in expression of a cluster of genes that were highly expressed in the multiple myeloma-derived cell line RPMI-8226, including many immunoglobulin genes, appears to represent variable B-cell infiltration (Fig. 1f ).
(5) T lymphocytes: a cluster of genes including CD3 and two subunits of the T-cell receptor were highly expressed in the T-cell leukaemia-derived cell line MOLT-4 and probably indicate T-cell infiltrates (Fig. 1i).
(6) Macrophages: a cluster of genes that appeared to be markers of macrophage/monocytes included CD68, acid phosphatase 5, chitinase and lysozyme (Fig. 1h).
Two distinct types of epithelial
cell are found in the human mammary gland: basal (and/or myoepithelial)
cells and luminal epithelial cells [12]. These two
cell types are conveniently distinguished immunohistochemically; basal
epithelial cells can be stained with antibodies to keratin 5/6 (Fig.
2a), whereas luminal epithelial cells stain with antibodies against
keratins 8/18 (Fig. 2b). Many genes were expressed
by one of these two cell lineages, but not by the other (Fig. 1e
and j). The gene expression cluster characteristic of basal epithelial
cells included keratin 5, keratin 17, integrin-4 and laminin (Fig.
1e) [11]. The gene expression cluster characteristic
of the luminal cells was anchored by the previously noted cluster of transcription
factors that included ER (Fig. 1j).
Figure 2. Breast tissue immunohistochemistry.
(a). Normal mammary duct using antibodies against
the basal keratins 5/6.
(b). Normal mammary duct using antibodies against
the luminal keratins 8/18 (adjacent tissues sections were used in (a) and
(b))
(c). Tumour Stanford 16 using antibodies against
keratins 8/18.
(d). Tumour New York 3 using antibodies against
keratins 5/6.
One goal of this study was to develop a system for classifying
tumours on the basis of their gene expression patterns. The subset of genes
shown in Fig. 1 was not necessarily optimal for this
purpose, as the choice of genes whose expression levels provided the basis
for the ordering of the tumour samples determined which phenotypic relationships
among the tumours were reflected in the clustering patterns. We therefore
selected an alternative subset of genes to use as the basis for a new clustering
analysis.
The rationale behind this alternative
gene subset was that specific features of a gene expression pattern that
are to be used to classify tumours should be similar in any sample taken
from the same tumour, and they should vary among different tumours. The
22 paired samples provided a unique opportunity for a deliberate and systematic
search for such genes. From the genes whose expression was well measured
in the 65 tissue samples, we selected a subset of 496 genes (termed the
'intrinsic' gene subset) that consisted of genes with significantly greater
variation in expression between different tumours than between paired samples
from the same tumour (see Supplementary Information).
When variation in expression of this set of genes was used to order the
tissue samples (Fig. 3; and Supplementary
Information Fig. 6),
Figure 3. Cluster analysis using the 'intrinsic' gene subset. Two
large branches were apparent in the dendogram, and within these large branches
were smaller branches for which common biological themes could be inferred.
Branches are coloured accordingly; basal-like, orange; Erb-B2+,
pink; normal-breast-like, light green; and luminal epithelial/ER+, dark
blue.
(a). Experimental sample associated cluster dendogram. Small black
bars beneath the dendogram identify the 17 pairs that were matched by this
hierarchical clustering; larger green bars identify the positions of the
three pairs that were not matched by the clustering.
(b). Scaled-down representation of the intrinsic cluster diagram
(see Supplementary Information Fig. 6).
(c). Luminal epithelial/ER gene cluster.
(d). Erb-B2 overexpression cluster.
(e). Basal epithelial cell associated cluster containing keratins
5 and 17.
(f). A second basal epithelial-cell-enriched gene cluster.
17 of the 20 'before and after' doxorubicin pairs were
grouped together as were both of the tumour/lymph node metastasis pairs.
Qualitatively similar sample clustering patterns were obtained when a second
gene subset that focused on genes expressed by epithelial cell types, and
which had only 25% overlap with the intrinsic gene subset, was used (data
not shown).
The division of the tissue samples into two subgroups was a striking feature of the intrinsic gene subset cluster analysis (Fig. 3a). As a test of the robustness of this division, we applied the 'weighted voting' method [13]. This algorithm recapitulated the sorting of the tissue samples between these two subgroups for all but 1 of the 65 samples (data not shown). It is important to note, however, that there is extensive residual variation in expression patterns within each of these two broad subgroups. Indeed, many of the finer subdivisions probably have important biological properties (see below).
The two dendrogram branches in Fig. 3 largely separate the tumour samples into those that were clinically described as ER positive (blue) and those that were ER negative (other colours). The tumours in the ER+ group were characterized by the relatively high expression of many genes expressed by breast luminal cells (Fig. 3c). This connection was further corroborated using immunohistochemical analysis and antibodies against the luminal cell keratins 8/18 (Fig. 2c). With one exception, none of the tumours in this group expressed Erb-B2 at high levels (Fig. 3d).
Many of the genes characteristic of breast basal epithelial cells were also highly expressed in a group of six clustered tumours (Fig. 3e). To corroborate the 'basal-like' characteristics of these tumours, we carried out immunohistochemistry using antibodies against the breast basal cell keratins 5/6 and 17. All six of these tumours showed staining for either keratins 5/6 or 17 or both (Fig. 2d). Notably, these six tumours also failed to express ER and most of the other genes that were usually co-expressed with it (Fig. 3c). Breast tumours that stain positive for basal keratins have been described [14-16], and basal keratins may account for 3-15% of all breast tumours [15, 17-19]; in this study, the incidence was 15% (6/40).
As mentioned above, overexpression of the Erb-B2 oncogene was associated with the high expression of a specific subset of genes. We identified a cluster of tumours that was partially characterized by the high level of expression of this subset of genes (Fig. 3d). These tumours also showed low levels of expression of ER [20, 21] and of almost all of the other genes associated with ER expression - a trait they share with the basal-like tumours.
Several tumour samples and the single fibroadenoma tested (Fig. 3, light green), were clustered with a group of samples that also contained the three normal breast specimens (Fig. 3a). The 'normal breast' gene expression pattern is typified by the high expression of genes characteristic of basal epithelial cells and adipose cells, and the low expression of genes characteristic of luminal epithelial cells.
The number of clearly different molecular phenotypes observed among the breast tumours suggests that we are far from having a complete picture of the diversity of breast tumours. When hundreds (instead of tens) of breast tumours have been characterized, a more defined tumour classification is likely, and statistically significant relationships with clinical parameters should be uncovered. We were, however, able to identify four groups of samples that might be related to different molecular features of mammary epithelial biology (that is, ER+/luminal-like, basal-like, Erb-B2+ and normal breast). An important implication of this study is that the clinical designation of 'oestrogen receptor negative' breast carcinoma encompasses at least two biologically distinct subtypes of tumours (basal-like and ErB-B2 positive), which may need to be treated as distinct diseases.
A striking conclusion from these data concerns the stability, homogeneity and uniqueness of the 'molecular portraits' provided by the quantitative analysis of gene expression patterns. We infer that these portraits faithfully represent the 'tumour' itself, and not merely the particular tumour 'sample', because we could recognize the distinctive expression pattern of a tumour in independent samples. The finding that a metastasis and primary tumour were as similar in their overall pattern of gene expression as were repeated samplings of the same primary tumour, suggests that the molecular program of a primary tumour may generally be retained in its metastases. Finally, we have explicitly discussed only a tiny fraction of the genes whose expression patterns varied among these tumours. Attention to the thousands of individual genes that define the molecular portraits of each tumour, and learning to interpret their patterns of variation, will undoubtedly lead to a deeper and more complete understanding of breast cancers.
Most of the techniques used in this work have been described elsewhere[2,
3,
22 ,23] and detailed protocols
are available at:
http://cmgm.Stanford.EDU/pbrown/
The methods and protocols are also
included in the Supplementary Information, and the primary data tables
can be obtained at:
http://genome-www.stanford.edu/molecularportraits/
Supplementary information is available on Nature's World-Wide Web
site: http://www.nature.com/nature/journal/v406/n6797/suppinfo/406747a0.html
or as paper copy from the London editorial office of Nature.
Supplementary Information on Methods and Protocols:
cDNA clones and
microarray production:
mRNA Isolations, Fluorescent cDNA
Production and Hybridization:
Common Reference Sample:
Breast Tumor Pathology:
Microarray Data Analysis:
Selection of
Genes for the 'Intrinsic' Gene Subset:
Supplementary
Information Methods References:
Supplementary Information Figure Legends:
Figure 4:
Figure 5:
Figure 6:
Table 1:
Table 2:
Table 3:
The 8102 human cDNA genes/clones used in this study were obtained from Research Genetics (Huntsville AB, USA) and were chosen from a set of 15,000 cDNA clones that corresponded to the Research Genetics Human Gene Filters sets GF200-202 (http://www.resgen.com/). This set of genes contained some redundancy (approximately 300 genes were printed more than once on each array) and contained approximately 4000 named genes, 2000 genes with homology to named genes in other species, and approximately 2000 ESTs of unknown function. In addition to these sequence verified clones, a small set (396) of clones whose sequence had not been verified were also included; all of these clones were excluded from all analyses presented in the figures. All of the non-sequence verified clones are identified in the primary data tables by the prefix SID (Stanford Identifier). The cDNA microarrays used in this study were made as previously described [1,2]. Detailed protocols are available at http://cmgm.stanford.edu/pbrown/array.html and http://cmgm.stanford.edu/pbrown/mguide/index.html All 84 microarray experiments in this study were conducted using microarrays from a single printing.
mRNA Isolations, Fluorescent cDNA Production and Hybridizations.
Following their excision, breast tumor samples were rapidly frozen in liquid N2 and then stored at -80 C until use. mRNA was isolated from breast tumors as described in Perou et al. [3], using the Trizol Reagent (Gibco-BRL) and Invitrogen FastTrack 2.0 Kit (all Stanford samples, and see http://genome-www.stanford.edu/sbcmp/web.shtml for the detailed protocol), or using the Trizol Reagent followed by Dynal bead separation for the mRNA purification step (all Norway tissue samples). Isolation of mRNA from cell lines, and all mRNA labeling reactions (1.5-2 micrograms of mRNA/reaction) and microarray hybridizations were performed as described in Perou et al. [3]. We identified a systematic difference in the R/G ratio of a subset of genes/cDNA elements that perfectly correlated with where the mRNA samples were prepared (i.e. Stanford or Norway). This "source" artifact affected a small subset of the cDNA elements and caused these genes/cDNAs to have a higher R/G ratio in one set of samples versus the other. The molecular cause of these results are unknown, but may be due to differences in the mRNA isolation protocols, or may reflect biological differences between the samples; at this time we can not distinguish between these possibilities.
Each of the 84 experimental samples tested here was analyzed by a comparative hybridization, using a common "reference" mRNA pool as a standard; this reference sample was composed of equal mixtures of mRNA isolated from 11 established human cell lines (MCF7, Hs578T, OVCAR3, HepG2, NTERA2, MOLT4, RPMI-8226, NB4+ATRA, UACC-62, SW872, and Colo205: see Supplementary Information Table 1 for more details). The 11 cell lines were all grown to 70-90% confluence in RPMI medium containing 10% Fetal Calf Serum and Penicillin/Streptomycin. The cells were harvested either by scraping or centrifugation, and quickly resuspended in RNA lysis buffer and mRNA prepared as described in Perou et al. [3]. In each case, multiple individual mRNA preparations were collected for each cell line, which were then pooled together and analyzed via Northern analysis before final mixing to ensure the quality of the input mRNAs. The 11 mRNA samples were then mixed together in equal amounts, aliquoted in 10mM Tris (7.4), and stored at -80 C until use (2 micrograms of common reference sample was used per microarray hybridization and was always labeled using Cy3).
The 39 individual breast tumor samples and the single fibroadenoma used in this study were collected at either Stanford University in Stanford CA, USA, or at the Haukeland University Hospital in Bergen, Norway. Twenty of the forty breast tumors analyzed here were sampled twice as part of a larger Norwegian study on locally advanced breast cancers (T3/T4 and/or N2 tumors) and have been described previously [4]; these patients underwent an open surgical biopsy before treatment with doxorubicin monotherapy (range 12-23 weeks), followed by the definitive surgical resection of the remaining tumor after therapy, and were evaluated for clinical responses according to UICC criteria [5]. In addition to the 20 pairs, there were 8 additional "before" specimens from Norway and 12 tissue specimens from Stanford (all Stanford tumors tested had a diameter of 3cm or larger). Finally, 2 of the 10 Stanford tumor specimens assayed were also paired with a lymph node metastasis from the same patient. A single pathologist (MvdR) reviewed H&E sections of each tumor, including all before and after pairs, and made a histological evaluation of each while blinded to the source. Tumors were graded using a modified version of the Bloom-Richardson method [6]. These data are displayed in Supplementary Information Table 3 and a representative H&E section of each tumor is posted on our website at http://genome-www.stanford.edu/molecularportraits/. Immunohistochemistry was performed as described previously [3,7]; the antibodies used included CAM5.2 (specific for keratins 8/18, Becton Dickinson), anti-keratin 5/6 (Boehringer Mannheim), and anti-keratin 17 (Dako).
The cDNA microarrays were scanned with either a General Scanning (Watertown, MA) ScanArray 3000 at 20 microns resolution, or with a prototype Axon Instruments (Foster City, CA) GenePix Scanner at 10 micron resolution. The output files, which were TIFF images, were then analyzed using the program ScanAlyze (M. Eisen; available at http://www.microarrays.org/software). Fluorescent ratios and quantitative data on spot quality (see ScanAlyze manual) were stored in a prototype of the AMAD database (M. Eisen; available at http://www.microarrays.org/software). Areas of the array with obvious blemishes were manually flagged and excluded from subsequent analyses. The primary data tables can be downloaded at http://genome-www.stanford.edu/molecularportraits/, in text/tab delimited format after obtaining a password. Hierarchical-clustering gene selection criteria for Figure 1 (Supplementary Materials Figure 4); Data were extracted from the database in a single table, with each row representing an array element, each column a hybridization, and each cell the observed fluorescent ratio for the array element in the appropriate hybridization. This table had 9216 rows and 84 columns. Previously flagged spots were excluded, as were spots that did not pass the quality control ScanAlyze parameter of "%pixels > background of at least 0.55 in both the red and green channels". Array elements were also removed if they did not meet the above mentioned "%pixel" quality control measure in a least 80% of the hybridizations analyzed. The data table was then split into tissues and cell lines, and the two subtables were separately median polished (the rows and columns were iteratively adjusted to have median 0) before being rejoined into a single table. We finally then selected for the subset of genes whose expression varied by at least 4-fold from the median in this sample set in at least three of the samples tested (1753 genes satisfied these conditions). We applied average-linkage hierarchical clustering, as implemented in the program Cluster (M. Eisen; http://www.microarrays.org/software), separately to both the genes and arrays. The results were analyzed, and figures generated, using TreeView (M. Eisen; http://www.microarrays.org/software).
Selection of genes for the "intrinsic" gene subset (Figure 3 and Supplementary Materials Figure 6).
To select a set of genes from the entire 8102 gene set whose variation in expression optimally represented differences between tumors rather than just differences between tumor samples (i.e. the "intrinsic" gene subset used in Figure 3), we assigned a "within-between" score to each gene equal to the mean effect of the gene on the pairwise correlation coefficients of the 22 matched tumor pairs less the mean effect of the gene on the remaining 210 tumor-tumor pairwise correlation coefficients. The "effect" of a gene on a pairwise correlation was defined as the difference in the correlation coefficient with and without data for the gene included. Higher "within-between" scores indicated that the gene had a good tendency to group together paired samples. The 496 genes with a score one standard deviation above the mean score were selected and defined as the "intrinsic" gene subset. To confirm the existence of an "intrinsic" set of genes and to verify that the "within-between" score identified these genes, we examined the predictive quality of the score using a type of "leave-one-out" cross-validation analysis. The entire analysis was repeated 22 times, each with one of the 22 matched pairs completely removed from the analysis. If an "intrinsic" set of genes exists, and if the "within-between" score successfully identifies these genes, we would expect the genes with high scores in each reduced dataset to produce relatively high correlations in the excluded pair. Indeed, when the genes were sorted based on their "within-between" score in each reduced dataset, the correlation coefficient of the excluded matched pair in sliding windows of 250 genes increased progressively with increasing "within-between" score for nearly all of the matched pairs, while no such increase was found when randomly matched pairs were used.
Supplementary Information Methods References
1. Ross, D. T. et al. Systematic variation in gene expression patterns in human cancer cell lines [see comments]. Nat Genet 24, 227-235 (2000).
2. Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [see comments]. Nature 403, 503-511 (2000).
3. Perou, C. M. et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci U S A 96, 9212-9217 (1999).
4. Aas, T. et al. Specific P53 mutations are associated with de novo resistance to doxorubicin in breast cancer patients. Nat Med 2, 811-814 (1996).
5. Hayward, J. L. et al. Assessment of response to therapy in advanced breast cancer. Br J Cancer 35, 292-298 (1977).
6. Robbins, P. et al. Histological grading of breast carcinomas: a study of interobserver agreement. Hum Pathol 26, 873-879 (1995).
7. Bindl, J. M. & Warnke, R. A. Advantages of detecting monoclonal antibody binding to tissue sections with biotin and avidin reagents in Coplin jars. Am J Clin Pathol 85, 490-493 (1986).
Supplementary Information Figure Legends
Supplementary InformationFigure 4.
The complete cluster diagram of 84 experimental samples versus 1753 genes (detailed in Figure 1), with all gene names included. The ratios of gene expression relative to the category (cell line or tissue) specific median R/G ratio are shown. Green squares represent lesser than median levels of gene expression; black squares represent median expression; red squares represent greater than median levels of expression; gray squares indicate insufficient or missing data. The color saturation also reflects the magnitude of the ratio (see scale at top right).
Supplementary InformationFigure 5.
Close-up of the "proliferation" cluster subset of genes taken from the large 1753 gene cluster diagram presented in Supplementary Information Figure 3. Displayed above the enlargement on the right is a color representing the "mitotic grade" of each tumor as assayed by standard pathological methods. Genes displayed in pink lettering have previously been implicated in chromosomal instability, genes in blue encode known proliferation-associated antigens, and genes in orange encode known chemotherapy targets.
Supplementary InformationFigure 6.
The complete 496 gene cluster diagram formed when using the "intrinsic" gene subset (detailed in Figure 3).
Supplementary InformationTable 1.
A listing and description of the 11 cell lines used to create the common "reference" sample.
Supplementary InformationTable 2.
A complete listing of the 84 experimental samples that were assayed versus the common "reference" sample.
Supplementary InformationTable 3.
A listing of the tumors used in this study, along with a data table containing additional information about each tumor/patient.
1. Tavassoli, F. A. & Schnitt, S. J. Pathology of the Breast (Elsevier, New York, 1992).
2. Eisen, M. B. & Brown, P. O. DNA arrays for analysis of gene expression. Methods Enzymol. 303: 179-205 (1999).
3. Ross, D. T. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genet. 24: 227-235 (2000).
4. Aas, T. et al. Specific P53 mutations are associated with de novo resistance to doxorubicin in breast cancer patients. Nature Med. 2: 811-814 (1996).
5. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95: 14863-14868 (1998).
6. Perou, C. M. et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl Acad. Sci. USA 96: 9212-9217 (1999).
7. Yang, G. P., Ross, D. T., Kuang, W. W., Brown, P. O. & Weigel, R. J. Combining SSH and cDNA microarrays for rapid identification of differentially expressed genes. Nucleic Acids Res. 27: 1517-1523 (1999).
8. Hoch, R. V., Thompson, D. A., Baker, R. J. & Weigel, R. J. GATA-3 is expressed in association with estrogen receptor in breast cancer. Int. J. Cancer 84: 122-128 (1999).
9. Pauletti, G., Godolphin, W., Press, M. F. & Slamon, D. J. Detection and quantitation of HER-2/neu gene amplification in human breast cancer archival material using fluorescence in situ hybridization. Oncogene 13: 63-72 (1996).
10. Pollack, J. R. et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genet. 23: 41-46 (1999).
11. Ronnov-Jessen, L., Petersen, O. W. & Bissell, M. J. Cellular changes involved in conversion of normal to malignant breast: importance of the stromal reaction. Physiol. Rev. 76: 69-125 (1996).
12. Taylor-Papadimitriou, J. et al. Keratin expression in human mammary epithelial cells cultured from normal and malignant tissue: relation to in vivo phenotypes and influence of medium. J. Cell Sci. 94: 403-413 (1989).
13. Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537 (1999).
14. Dairkee, S. H., Mayall, B. H., Smith, H. S. & Hackett, A. J. Monoclonal marker that predicts early recurrence of breast cancer. Lancet 1: 514 (1987).
15. Dairkee, S. H., Puett, L. & Hackett, A. J. Expression of basal and luminal epithelium-specific keratins in normal, benign, and malignant breast tissue. J. Natl Cancer Inst. 80: 691-695 (1988).
16. Malzahn, K., Mitze, M., Thoenes, M. & Moll, R. Biological and prognostic significance of stratified epithelial cytokeratins in infiltrating ductal breast carcinomas. Virchows Arch. 433: 119-129 (1998).
17. Guelstein, V. I. et al. Monoclonal antibody mapping of keratins 8 and 17 and of vimentin in normal human mammary gland, benign tumors, dysplasias and breast cancer. Int. J. Cancer 42: 147-153 (1988).
18. Gusterson, B. A. et al. Distribution of myoepithelial cells and basement membrane proteins in the normal breast and in benign and malignant breast diseases. Cancer Res. 42: 4763-4770 (1982).
19. Nagle, R. B. et al. Characterization of breast carcinomas by two monoclonal antibodies distinguishing myoepithelial from luminal epithelial cells. J. Histochem. Cytochem. 34: 869-881 (1986).
20. Berns, E. M. et al. Prevalence of amplification of the oncogenes c-myc, HER2/neu, and int-2 in one thousand human breast tumors: correlation with steroid receptors. Eur. J. Cancer 28: 697-700 (1992).
21. Heintz, N. H., Leslie, K. O., Rogers, L. A. & Howard, P. L. Amplification of the c-erb B-2 oncogene and prognosis of breast adenocarcinoma. Arch. Pathol. Lab. Med. 114: 160-163 (1990).
22. DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680-686 (1997).
23. Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503-511 (2000).
We thank W. Gerald and L. Norton for the three New York tumour specimens; M. Stampfer and P. Yaswen for the 184 sample mRNAs; and members of the P. O. Brown, D. Botstein and A.-L. Borresen-Dale labs for discussions. We are grateful to the NCI and the Howard Hughes Medical Institute who provided support for this research. C.M.P. is a SmithKline Beecham Pharmaceuticals Fellow of the Life Sciences Research Foundation. T.S. is a research fellow of the Norwegian Cancer Society. M.B.E. is an Alfred P. Sloan Foundation Postdoctoral Fellow in Computational Molecular Biology. D.T.R. is a Walter and Idun Berry Fellow. P.O.B. is an Associate Investigator of the Howard Hughes Medical Institute.
1. "Mated Models of Gene Regulation in Eukaryotes".
2. "Oncogenes as Molecular Targets within Active Chromatin".