Published ahead of print on February 23, 2006, doi:10.1165/rcmb.2004-0261OC
© 2006 American Thoracic Society DOI: 10.1165/rcmb.2004-0261OC Human Lung Project: Evaluating Variance of Gene Expression in the Human LungDivision of Pulmonary Sciences and Critical Care Medicine and Section of Biometrics and Informatics, University of Colorado Health Sciences Center; and National Jewish Medical and Research Center, Denver, Colorado Correspondence and requests for reprints should be addressed to Mark W. Geraci, M.D., University of Colorado Health Sciences Center, Division of Pulmonary Sciences and Critical Care Medicine, 4200 East Ninth Ave, C-272, Denver, CO 80262. E-mail: mark.geraci{at}uchsc.edu
Nondiseased tissue is an important reference for microarray studies of pulmonary disease. We obtained 23 single lungs from multiorgan donors at time of procurement. Donors varied in age, sex, smoking history, and ethnicity. Lungs were dissected into upper and lower lobe peripheral sections for RNA extraction. Microarray analysis was performed using Affymetrix Hu-133 Plus 2.0 arrays. We observed that the relative variability of gene expression increased rapidly from technical (lowest), to regional, to population (highest). In addition, age and sex have measurable effects on gene expression. Gene expression variability is heterogeneously distributed among biologic categories. We conclude that gene expression variability is greater between individuals than within individuals and that population variability is the most important factor in the study design of microarray experiments of the human lung. Classes of genes with high population variability are biologically important and provide a novel perspective into lung physiology and pathobiology. Our study represents the first comprehensive analysis of nondiseased lung tissue. The generation of this robust dataset has important implications for the design and implementation of future comparative expression analysis with pulmonary disease states.
Key Words: Keywords: lung microarray genomics variability
The use of gene expression microarrays as a high-throughput means to obtain qualitative and quantitative expression profiles on thousands of gene transcripts has revolutionized the field of translational medicine. Gene expression profiling has become a powerful tool in the armamentarium of clinical lung cancer research as a means to define clinical subtypes (13), prognosis (46), molecular biomarkers (79), and novel therapeutic interventions (10, 11). As a result of the application of this technology to lung cancer, the use of expression profiling can be widely applied to other noncancer-related lung diseases. Recently, gene expression profiling has been used to provide insights into the pathogenesis of idiopathic pulmonary fibrosis (1216), primary pulmonary hypertension (17, 18), smoking-related lung disease (19, 20), acute respiratory distress syndrome (21, 22), asthma (23, 24), and cystic fibrosis (25, 26). Once used primarily to investigate changes in gene expression within in vitro models such as cell culture or clonal cell populations, microarray technology is increasingly applied to case-control models to derive gene expression patterns as descriptors of pathobiology or clinical outcomes. However, fundamental knowledge of the extent, nature, and sources of gene expression variation within nondiseased individuals is lacking. This lack of nondiseased comparative tissue can result in selection bias, confounding patient groups and Type I and Type II statistical errors. The majority of human microarray studies comparing diseased with nondiseased tissues published in the medical literature neither describe nor adequately characterize the comparative control population. Therefore, a better understanding of the normal variation between and within individuals including covariables such as age and sex will advance the use of microarray technology as a powerful investigational tool. Human microarray studies in nondiseased states have been used to analyze variations in gene expression in peripheral blood (27), retina (28, 29), cornea (30), brain (31), kidney (32), and muscle (33, 34). From these studies and others (3537), it is increasingly apparent that tissue heterogeneity, intrinsic host factors, and sample processing have direct, measurable effects on gene expression. Additionally, microarray studies of human muscle (34) and retina (29) have demonstrated variations in gene expression specifically related to age and sex. A major challenge of microarray expression profiling and bioinformatics is to maximize true discovery while limiting false discovery. Although fundamental in concept, the immense amount of expression data generated from a single microarray experiment often yields hundreds of "differentially expressed" transcripts that may represent normal biologic variability between samples, tissue sample heterogeneity, or technical variability from tissue processing or the array platform and not a true pathobiologic discovery between comparative groups. In this study, we characterize the human lung transcriptome for the first time using microarray expression profiling. We analyze all major sources of variation in gene expression in postmortem human lung samples with particular focus on technical, anatomic, and individual variability. Additionally, we provide a novel description of the expression variability in the nondiseased lung. These results expand our understanding of normal human gene expression variation and pulmonary physiology and have an important impact on the future design of case-control microarray experiments involving the human lung. This robust "control-tissue" database is publicly available as a resource to the research community for future comparative analyses.
Human Subjects Single whole-lung samples from 23 individuals were obtained from Tissue Transformation Technologies (Edison, NJ) (Table 1). All individuals suffered brain death and were evaluated for organ transplantation before research consent. Informed consent was obtained at the time of transplant evaluation. All specimens failed regional lung selection criteria for transplantation. Reasons listed for failure to transplant include age (41%), smoking history (5%), "quality" (14%), gas exchange (9%), size (9%), and inability to match (23%). For study inclusion, individuals had to demonstrate no evidence of active infection or chest radiographic abnormalities, mechanical ventilation < 48 h, PaO2/FiO2 ratio > 200, and no past medical history of underlying lung disease or systemic disease that involves the lungs (e.g., rheumatoid arthritis, systemic lupus erythematosus). Patients with mild asthma not requiring the regular use of inhaled -agonists were included. Lung samples were procured within 34 h after brain death (mean, 16.2 h; range, 4.533.25 h). After resection, the lungs were insufflated with preservation solution and transported on ice to our laboratory. All samples were received within 28 h after procurement (mean, 16.7 h; range, 928 h). Upon receipt, each lung was dissected into upper and lower lobes and central (< 5 cm from mainstem bronchus) and peripheral (< 5 cm from pleura) sections. The samples were flash frozen in liquid nitrogen and stored at 80°C for further analysis. The study was approved by the National Jewish Medical and Research Center Institutional Review Board (IRB protocol #NJC HS-1539).
Tissue Processing Frozen lung tissue (3050 mg) was homogenized, and total RNA was extracted using the MiniElute protocol (Qiagen, Valencia, CA). Total RNA was quantified by spectrophotometer, and assessment of RNA quality by Bioanalyzer (Agilent Technologies, Palo Alto, CA) was performed. For study inclusion, RNA samples were required to meet the Tumor Analysis Best Practices Working Group quality standards (38). Only those RNA samples that showed intact 18S and 28S ribosomal RNA chromatographs and with optical density 260/280 ratios of 1.82.1 were used for further analysis.
Tissue Histology
Microarray Analysis of Human Lung Gene Expression
Experimental Design Regional lung variability. Microarray data were generated from total RNA isolated from upper lobe and lower lobe peripheral regions from eight individuals. The resulting 16 paired upper lobe and lower lobe arrays comprise the regional comparison dataset. Population variability. Microarray data from 21 of the 23 individual lower lobes were analyzed to investigate relationships between age, sex, and overall gene expression variability (samples 4827 and 4878 were excluded due to histologic abnormalities). The age comparison comprised individuals with age > 60 yr (n = 6; age, 68.7 ± 4.3 yr) or < 40 yr (n = 7; age, 28.7 ± 7.1 yr) matched for sex and cumulative smoking history (pack-years). The sex comparison comprised male (n = 11; age, 46.4 ± 19.7 yr) and female (n = 10; age, 53.2 ± 14.9 yr) individuals matched for age and cumulative smoking history.
Sample Characteristics Table 1 displays the characteristics of the lung samples that underwent microarray analysis. In total, 36 microarrays were completed on 23 individuals (age range, 2174 yr; mean, 50.1 yr). Eleven of the patients were men. Seventy-eight percent (18/23) of individuals were Caucasian, 17% (4/23) were African American, and 4% (1/23) were Hispanic. All individuals developed brain death, with 50% (11/23) resulting from cerebrovascular accident. Thirty percent (7/23) died from head trauma, and 22% (5/23) died from other causes, including intracranial hemorrhage, malignancy, hydrocephalus, and anoxia. All lungs were received as whole, single lung specimens. Sixty percent of the lung samples (14/23) were obtained from the right side, and the remaining 40% (9/23) were from the left side. Smoking histories, including current smoking status and cumulative pack-years, were available in a majority of cases (Table 1). Thirty-five percent (8/23) of patients were current smokers, 30% (7/23) were former smokers, and 35% (8/23) were nonsmokers.
Microarray Analysis
Regional variability. Sixteen upper lobe and lower lobe peripheral microarrays from eight individuals were analyzed using a paired t test. Unsupervised hierarchical clustering of this data is shown in Figure 2A. Figure 2A demonstrates that an individual's upper and lower samples are more closely related to each other than to another individual, with the exception of samples 20008 and 20017, which did not cluster together. Additional permutations of the data confirm this discordant clustering to be a true finding and not a result of the applied clustering algorithm. Upon review of the tissue handling, RNA quality, chip quality, histologic analysis, and available clinical data, we have not identified an explanation for this observed dissimilarity. There was no observed clustering related to age, sex, or smoking status. Upper and lower lobe samples were compared using a paired t test. Figure 2B shows the results of this class comparison illustrated by an overabundance graph as described by Kaminski and colleagues (43). This graph compares the number of genes observed over a range of P value scores (observed discovery) with what would be expected under the matching null hypothesis (chance discovery). The comparison of observed discovery to chance discovery yields a global assessment of true or significant discovery between upper and lower lobes. In Figure 2B, the comparison of paired upper and lower lobe samples demonstrates that, for any given P value, all observed differences between the groups can be explained by chance. In other words, there is no significant difference in gene expression between upper and lower lobes from the same individual.
We also compared left lung (n = 6) and right lung (n = 12) by a two-sample t test (data not shown) and found no statistically significant differences in gene expression between anatomic lobes within the population.
Population Variability
Age. Individuals with age > 60 yr (n = 6; age, 68.7 ± 4.3 yr) or < 40 yr (n = 7; age, 28.7 ± 7.1 yr) were compared by a two-sample t test. Results are presented as an overabundance graph in Figure 4A. The lower lobe expression from an individual matched for sex and cumulative smoking history was used for comparative analysis. Results show that there are numerous genes differentially expressed between the age groups. Figure 4B illustrates that at P value < 0.001, there are 40 signature genes discriminating between the two age groups. The blue coloring in Figure 4B represents relatively low expression, and the red coloring represents relatively high expression. Although permutation analysis does not support this gene list as statistically robust (P = 0.086), the overabundance graph (Figure 4A) supports a conclusion that differential expression exists between the lungs of older and younger individuals (12, 16, 4345).
Gene Variability Analysis To investigate the variability in gene expression, we examined the expression value for 20,669 probe sets across 21 lower lobe microarray experiments and computed the SD of each gene's expression across individuals. To identify classes of genes with significantly higher- or lower-than-expected variability of expression, we used MAPPFinder 2.0 (42) to compare the observed distribution of variability in genes associated with each particular GO term to the overall distribution of variability. GO categories over-represented in the upper and lower deciles are listed in Tables E1 and E2 in the online supplement, and two large representative categories are contrasted with the overall variability distribution in Figure 5. Results show that gene expression variability is heterogeneously distributed among biological categories (Tables E1 and E2), with the greatest amount of expression variability related to immune function and immune-related processes. Conversely, the least amount of variability in gene expression was observed in areas of cell metabolism and cellular maintenance functions.
Gene expression microarray profiling has become a common methodologic tool in the field of molecular biology and medical research. Once limited to tightly controlled in vitro studies, this technology has rapidly advanced into more complex biologic systems. Although the use of microarray technology is widespread in the study of human disease, there remain many unanswered questions regarding normal variation in human gene expression between individuals. In addition, the relative contributions of the array platform, anatomic regional sampling, intrinsic tissue sample composition, and sample processing to overall experimental variability may pose challenges to true pathobiologic discovery. Several studies have tried to address the issues of tissue variability and population variability in nondiseased human tissues such as blood (27), cornea (30), retina (28, 29), and muscle (3335). Although most studies suggest that technical variability is small (35, 46, 47), there remain conflicting reports (48). Therefore, we sought to analyze these major sources of gene expression variability while describing the natural biologic variability in the nondiseased human lung, a major organ focus of interest in our laboratory. Reproducibility is a critical factor in gene expression profile experiments. Similar to other published reports (35, 46, 47), we demonstrate that technical variability within the Affymetrix oligonucleotide microarray platform contributes minimally to overall experimental variability. Within our technical replicate experiment, the greatest source of variability was observed within the individual gene/probe set expression with small relative contributions from different regional locations and chip replicates (Figure 1). These results confirm the findings of others (35, 46, 47) and support the conclusion that the use of RNA replicates in the Affymetrix platform adds little to the precision of the overall analysis. Therefore, in studies projected to have a large sample size, technical replicates may not be necessary. In studies of other organs, tissue variability, in the context of sample heterogeneity and regional distance, has demonstrable effects on gene expression. Bakay and colleagues investigated intraindividual muscle biopsy variations in gene expression and found that the greatest source of variability was between different regions of the same individual's biopsy, highlighting the importance of cell-type composition on expression differences (35). Likewise, Whitney and colleagues demonstrated that variations in gene expression patterns in peripheral blood can be traced to differences in the relative proportions of specific blood cell types (27). We initially analyzed intraindividual tissue replicates and lobar replicates for variability in gene expression. Tissue replicates consisted of duplicate sections of the sample anatomic region that underwent separate probe synthesis and array hybridization. Lobar replicates were "central" (within 5 cm of the mainstem bronchus) or "peripheral" (within 5 cm of the pleura) sections taken from the same lobe in the same individual and were subjected to an identical array protocol. Preliminary analysis from our small comparative groups (n = 3 and n = 6, respectively) suggested that the greatest expression variability was between individuals in the population as compared with within-individual intralobar regions. Given our concern regarding the potential for inadvertent central airway sampling and the clinical knowledge that different lung diseases preferentially involve different anatomic regions, we focused on investigating peripheral tissue variability within distinct anatomic regions (namely, upper lobe versus lower lobe). We found that within an individual, differentially expressed gene transcripts between upper and lower lobes are observed (data not shown). This finding was individually consistent within the population. However, when we compared upper lobe with lower lobe over the entire population using a parametric paired t test, we found that there were no significant differences in gene expression between the groups. These findings support the conclusion that there are differentially expressed genes across the upper and lower lobes of an individual; however, these differentially expressed genes are not consistently observed across the population. Thus, to minimize regional tissue sample variability in the human lung, study design with accrual of an appropriate sample size is important.
Although it seems logical that case-control microarray expression experiments should match individuals for such covariates as age and sex, most reported array studies comparing diseased and nondiseased groups neither describe nor characterize the comparative control population. Several microarray studies in nondiseased human retina (28), brain (31), and muscle (34) have identified age- and sex-associated expression patterns within their study populations. In the present study, we demonstrate global differences in gene expression between older and younger subjects (Figures 4) matched for sex and cumulative smoking history that approach statistical significance. Given the significant age-related variation in gene expression demonstrated in our overabundance analysis, our t test analysis is underpowered to detect statistically significant differences among the older and younger groups. Furthermore, the use of array analysis modeling and, in particular, results of the age-related overabundance plot support our finding that age differences in gene expression in the human lung is a variable worth further investigation (16, 4345). We observed no significant global difference in gene expression when we compared age-matched and cumulative smoking historymatched male subjects with female subjects (Figure 3). However, when we focused on highly statistically significant differences, we noted that Given the limitations in our sample size for the subgroup analyses of age and sex, it is possible that the gene expression values for many genes are not normally distributed across the population. To investigate the impact of this on our age group comparison, we repeated our analysis using a nonparametric approach. Figure 4B illustrates 40 probe sets with a P value < 0.001 by parametric analysis. Repeating the analysis with a univariate permutation t test yields 37 probe sets (data not shown). As expected from the false discovery rate analysis, the parametric test and the univariate permutation test correspond on approximately half the genes meeting the P < 0.001 threshold (see Figures E1 and E2 and Table E3). Looking across the population, we observed that the variability in gene expression was heterogeneously distributed among biological categories (Tables E1 and E2 and Figure 5). We found the greatest amount of gene expression variability to be within immune function and immune-related processes. The least amount of variability in gene expression was observed in areas of cell metabolism and cellular maintenance functions. These observations have important implications for future microarray study designs because the inherent variability within the gene categories of interest strongly affects the sample size required for the measurement of a given size effect within each respective category. Some of the variability observed in gene expression, particularly with respect to inflammatory and immune processes, may reflect the fact that all of the patients suffered brain death. As evidenced by a large body of transplantation research, brain death has neurohumoral, metabolic, and inflammatory effects on the host (4952). It is likely that this physiologic state, in addition to inherent differences in the nondiseased lung across the study population, underlies the variability in expression observed for immune function genes. A second potential limitation is that all patients required mechanical ventilation. Although there was no evidence of documented infection, severe gas exchange abnormality, or chest radiograph abnormality in any of the samples, the use of mechanical ventilation may have demonstrable effects on gene expression (53). Lastly, sample processing, handling, postmortem interval, and ischemia have measurable effects on gene expression. These effects have been demonstrated in several studies of various nondiseased human tissue types, including blood (27), muscle (33), intestinal mucosa (36), and brain (37). Li and colleagues suggest that in postmortem brain, tissue samples may be more vulnerable to these processing effects in vitro based on the clinical course and host capability to respond to proteolytic and metabolic stress in vivo (37). In our study, we obtained tissue samples within an average of 17 h of procurement, and all samples were expeditiously processed in the same manner by the same investigator (G.P.C.). Given that samples were obtained from different hospitals by different surgical teams, we are limited as to standardized procurement handling and processing procedures. Our work represents the first attempt to describe the human lung transciptome in a clinically nondiseased state. Our data provide novel insight into the natural variability in the human lung and describe the relative contributions of all major sources of gene expression variability. Our results show that population variability makes the greatest contribution to overall variability. Thus, adequate sample size is of paramount importance in study design. Regional variability within different anatomic locations in the lung may be significant when comparing only a small group of microarrays. This finding has important implications for the design of future comparisons with specific lung disease states because it suggests that different peripheral anatomic regions can be compared between populations if comparative groups are significant in size. Sample size estimates for comparative analyses cannot be generated from our database because the appropriate study sample size for any microarray expression study depends on the inherent variability of the gene or genes of interest and on the size of the effect that is anticipated. We also demonstrate that the contribution of the oligonucleotide microarray platform contributes little to the observed gene expression variability, and thus the inclusion of RNA replicates within the study design can be avoided in favor of expanded sample numbers. This comprehensive human lung database is available to the research community for its ongoing use as a control dataset for future comparative analyses.
This work was supported by NHLBI grant R01 HL 72340-01. The complete set of gene expression data has been deposited in the Gene Expression Omnibus database (www.ncbi.nlm.nih.gov/geo/) accession #GSE1643. This article has an online supplement, which is accessible from this issue's table of contents at www.atsjournals.org Originally Published in Press as DOI: 10.1165/rcmb.2004-0261OC on February 23, 2006 Conflict of Interest Statement: None of the authors has a financial relationship with a commercial entity that has an interest in the subject of this manuscript. Received in original form August 12, 2004 Accepted in final form February 17, 2006
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||