Published ahead of print on August 28, 2003, doi:10.1165/rcmb.2003-0214OC
© 2004 American Thoracic Society DOI: 10.1165/rcmb.2003-0214OC Gene Expression Analysis in Response to Lung ToxicantsI. Sequencing and Microarray DevelopmentGlobal Research, Amersham Biosciences (SV) Corp., Sunnyvale; Department of Anatomy, Physiology, and Cell Biology, and Department of Molecular Biosciences, School of Veterinary Medicine, University of California Davis, Davis, California Address correspondence to: Michael A. Shultz, Dept. of Molecular Biosciences, School of Veterinary Medicine, University of California Davis, Davis, CA 95616. E-mail: mashultz{at}ucdavis.edu
A key challenge in measuring gene expression changes in the lung in response to site-selective toxicants is differentiating between target and nontarget areas. The toxicity for the cytotoxicant 1-nitronaphthalene is highly localized in the airway epithelium. Target cells comprise but a fraction of the total lung cell mass; measurements from whole lung homogenates are not likely to reflect what occurs at the target site. Additionally, the use of generic microarrays to measure expression in airway epithelium may not provide a good representation of transcripts present at the site of toxic action. cDNA libraries from airway and alveolar subcompartments of rat lung were sequenced for the development of a custom microarray representative of these lung regions. We identified 7,460 nonredundant rat lung sequences. Nearly 30% of the sequences on this array are not present on the Affymetrix Rat GeneChip 230. A 20,000-element microarray was developed that delineates differences in gene expression between subcompartments. This is the first in a series of articles employing this microarray for detecting gene expression changes during acute injury produced by 1-nitronaphthalene and subsequent repair.
Abbreviations: 1-nitronaphthalene, 1-NN ampicillin, amp Basic Local Alignment Search Tool, BLAST contiguous sequences, contigs coefficient of variation, CV expressed sequence tag, EST glyceraldehyde-3-phosphate dehydrogenase, GAPDH Luria Broth, LB polymerase chain reaction, PCR relative fluorescent unit, RFU sodium dodecyl sulfate, SDS saline-sodium citrate, SSC
The lung is a major target organ for numerous compounds that reach respiratory tissue both via the circulation and by inhalation. In some cases, selective injury to the lung is a consequence of unique exposure. For example, oxidant gases such as ozone result in selective injury to centriacinar regions of the lung (1). In other cases, the lung is selectively targeted because of its unique biochemical and physiologic functions. The selective lung injury caused by paraquat is related to the active transport of this chemical from the bloodstream resulting in high levels in lung tissue (2). Likewise, a number of compounds cause selective injury to pulmonary epithelial cells irrespective of the route of administration because they undergo metabolic activation by the cytochrome P450 monooxygenases, enzymes which are selectively localized in three of the more than 40 cell types found in the lungClara cells, type II cells, and macrophages (3). 1-Nitronaphthalene (1-NN) and various methylnitronaphthalenes are generated by gas phase atmospheric reactions which occur between N2O5 and the parent naphthalene/methylnaphthalene moiety (4). Parenteral administration of 1-NN to rats produces dose-dependent injury to epithelial cells of conducting airways (5) which appears to be related to the generation of electrophilic intermediates by cytochrome P450 monooxygenases (6). Although extensive literature detailing many of the steps involved in the initial metabolic activation and detoxication of bioactivated cytotoxicants including 1-NN exists (7), little is known about the downstream response of the lung to toxicants that produce injury to airway epithelial cells. The advent and refinement of approaches to analyze expression levels of thousands of genes has provided an important tool for identifying key pathways associated with various disease processes (8): for use as a diagnostic tool to determine tumor type, drug sensitivity, and prognosis (9); for monitoring changes in gene expression during the process of lung development (10); and for interrogating alterations associated with stressors including toxic chemicals (11). A considerable number of studies in toxicology have focused on using gene expression profiles to categorize hepatotoxicants. In general, these studies have demonstrated that necrotic agents can be differentiated from cholestatic agents as well as peroxisome proliferators based on their expression profiles (12). Far less attention, however, has focused on changes in gene expression in the lung associated with toxicant exposure, and much of this work has been conducted with in vitro models. Studies in isolated human bronchial cells have shown upregulation of genes associated with apoptosis at early times after exposure to cigarette smoke or H2O2 followed by induction of stress proteins, ubiquitin, and genes responsible for the deactivation of reactive oxygen species such as glutaredoxin and dihydrodiol dehydrogenase (13). Likewise, in vitro studies in malignant and nonmalignant human lung cell lines have shown that 2,3,7,8-tetrachlorodibenzo-p-dioxin exposure alters the expression of numerous genes associated with cell signaling, cell cycle, and xenobiotic metabolism (14). More recent work has used gene expression analysis to show that the NO-synthase inhibitor, N(G)-nitro-L-arginine methyl ester, attenuates many of the alterations in expression of cytokines and restores surfactant protein expression associated with nickel exposure in whole animals (15). The correlations of changes in gene expression with the severity of toxicity noted in this work clearly demonstrates the power of such studies in identifying key cellular changes associated with toxicity. However, none of the studies cited above provide a comprehensive view of the changes in gene expression that occur in target sites, such as conducting airways, at times which correspond to both the initial injury and repair phases associated with exposure to a lung toxicant. In the few cases where in vivo models have been used to study the response of the lung to external stimuli or to define changes occurring with development, target areas of the lung have not been separated from nontarget areas. The work reported here has focused on the development and validation of a custom microarray capable of monitoring changes in expression in both airways and parenchyma of rats exposed to 1-NN, a pulmonary cytotoxicant selective for airway epithelial cells. This array was developed by sequencing expressed genes obtained from airways and parenchyma of control and toxicant-treated rats. This approach ensured that low copy number transcripts from lower abundance cell types and from specific areas of the lung known to be susceptible to environmental toxicants were represented on the microarray. This work clearly demonstrates substantial differences in genes expressed in airways compared with parenchyma. This work also reports a method for identifying tissue-specific transcripts from less sequenced species, and demonstrates the value of building custom tissue-specific arrays for interrogating gene expression changes in the lung or other tissues. These studies are the first in a series designed to comprehensively evaluate changes in gene expression at early times after toxicant exposure where injury is developing as well as later time points where repair is occurring.
Animals Male Sprague-Dawley rats (150200 g) were purchased from Harlan (Indianapolis, IN). Animals were provided food (Purina Lab Chow) and water ad libitum and housed in HEPA-filtered cage racks in an AAALAC-accredited facility at the University of California, Davis for at least 5 d before use in an experiment.
cDNA and Plasmid Library Construction The quality of cDNA for all libraries was assessed after agarose gel electrophoresis. Lengths of synthesized cDNA ranged from 1005,000 bp. cDNA was size-selected for products > 400 bp by gel purification using the QIAEX II Gel Extraction System (Qiagen). Plasmid libraries were generated from each size-selected cDNA library by directional cloning (Eco RI/Not I) of cDNA into the multiple cloning site of pGEM-11Zf(-) (Promega, Madison, WI). Plasmid libraries were plated onto Luria Broth (LB) Agar containing ampicillin (amp), 5-bromo-4-chloro-3-indolyl-ß-D-galactoside, and isopropyl-ß-D-thiogalactopyranoside. White bacterial colonies (9,200 per library = 100 x 96-well microtiter plates with 92 colonies plus 4 blank negative controls) were picked and grown overnight at 37°C in 96-deep well culture plates containing 0.5 ml LB-amp. Glycerol stocks for 36,800 cDNA clones were prepared in duplicate for each 96-well plate grown and stored at -80°C.
cDNA Amplification and Sequencing PCR products were sequenced from the 5' end in 10 µl reactions containing 5 µl (1:10 dilution) PCR product, 1 µl T7 primer (5 pmoles/µl), and 4 µl DYEnamic ET Dye Terminator Reaction Mix (Amersham Biosciences). Mixtures were cycle sequenced 30 x (95°C for 15 s, 54°C for 15 s, 60°C for 60 s). Sequencing products were purified by ethanol precipitation and were separated by capillary electrophoresis using a MegaBACE 1000 DNA Sequencing System (Amersham Biosciences) and data analyzed using MegaBACE Sequence Analyzer software (Amersham Biosciences).
Sequence Data Analysis Sequence similarity searches for all contigs and singlets were performed against public GenBank databases (http://www.ncbi.nlm.nih.gov) using the Basic Local Alignment Search Tool (BLAST) (BLASTN 2.2.4, Aug-262002, NCBI) for identification of cDNA clones. Databases searched included those from the functionally annotated NCBI Reference Sequence project (RefSeq) databases for rat (rat.fna, Nov-192002, 4,023 sequences), mouse (mouse.fna, Nov-212002, 12,683 sequences), and human (human.fna, Nov-192002, 18,232 sequences) as well as the nt database (May-302002, 1,218,445 sequences).
cDNA Microarray Construction
PCR products (1:1 mixture with dimethyl sulfoxide) were arrayed into 384-well microarray spotting plates. PCR products and 1 control plate (see below) were spotted using a Lucidea Array Spotter (Amersham Biosciences) in normal mode (2 replicates; 200 µ row and column pitch; 0.9 1.1 mm slide thickness setting; 65% humidity) onto Type 7*reflective amino-silane coated glass slides (Amersham Biosciences). Resulting microarrays contained A control plate containing positive controls (rat ß-actin and glyceraldehyde-3-phosphate dehydrogenase [GAPDH]), negative controls (various yeast and Escherichia coli transcripts), and water blanks was incorporated into each spotted microarray. Positive controls were amplified using rat ß-actin and GAPDH control amplimer sets (Clontech, Palo Alto, CA). Spotting concentrations for ß-actin and GAPDH positive controls were 100, 50, 25, 12.5, 6.25, and 3.125 ng/µl. Spots at each concentration were represented 12 times per array. Spotting concentrations for negative controls (5 yeast and 15 E. coli transcripts) were 50 ng/µl, and each was represented 12 times per array. There were 384 water blanks spotted per array.
Probe Synthesis and Microarray Hybridization
For airway and parenchyma probes, RNA was isolated from microdissected subcompartments obtained from individual lungs of five untreated rats (n = 5). Detailed methods and validation of the quality of mRNA have been presented separately (17). Briefly, rats were exsanguinated, the trachea was exposed and cannulated, and lungs were removed from the body and inflated with RNAlater (Ambion, Austin, TX). Lungs were stored submerged in RNAlater at 4°C until airways and parenchyma were obtained by dissection. Total RNA was isolated using TRIzol (Invitrogen) followed by an RNeasy protocol with on-column DNase digestion (Qiagen). mRNA was isolated using OligoTex (Qiagen). Cy3- and Cy5-labeled cDNA probes were prepared from 1 µg mRNA using the CyScribe cDNA Post-Labeling Kit (Amersham Biosciences) following the manufacturer's protocol. Briefly, first-strand cDNA was synthesized (42°C, 2 h) incorporating amino-allyl dUTP, and mRNA template was removed by degradation under basic conditions. Single-stranded cDNA was purified using a QIAquick PCR Purification Kit (Qiagen). Equal aliquots of eluants were split ( Purified probes were matched (airway versus parenchyma from individual animals) based on total dye content (3050 pmoles of each), combined and evaporated in a vacuum centrifuge. Evaporated probes were resuspended in 30 µl hybridization solution (50% formamide, 50 ng/µl mouse Cot-1 DNA, 50 ng/µl polyA, 1x V2 Amersham hybridization buffer), placed on a microarray under a glass coverslip (pre-washed with ethanol and dried under an N2 stream), and hybridized 1618 h at 42°C in a humidified chamber. Following hybridization, coverslips were floated off the slide in 2x SSC/0.1% SDS (room temp), and slides were washed consecutively (with gentle shaking) in 2x SSC/0.1% SDS (55°C, 5 min), 1x SSC/0.1% SDS (55°C, 5 min), 0.1x SSC (3 x 20 dips, room temp), rinsed in water (20 dips), and dried under an N2 stream. Slides were scanned using a GenePix 4000B microarray scanner (Axon Instruments) (PMT 750 V for Cy3 and 775 V for Cy5), and images were processed using Lucidea Automated Spotfinder 1.0 software (Amersham Biosciences).
Microarray Data Analysis Normalization factors were calculated for each dye channel from each array by finding the median for all 50 and 100 ng/µl ß-actin positive control spots (24 spots total per dye channel). Signal intensities from either 50 or 100 ng/µl spotted ß-actin DNA did not differ significantly from each other. Normalization factors were applied to relative fluorescent unit (RFU) signals for all spots from each slide (Cy3 to Cy5 normalization). The normalization factors were also applied to RFU signals to normalize across slides for both dyes. For each animal, four normalized RFU signals (two duplicates/slide x two replicate slides) for either subcompartment were obtained per transcript, and their geometric means were calculated to yield one value per transcript per animal. Means and standard deviations for each lung subcompartment were calculated per transcript from the five individual animal geometric means of RFU signals. The method for determining significant gene expression differences between the airway and parenchyma RFU signals (n = 5) involved a two-sample unpaired t test with a two-tailed distribution assuming unequal variance. The conversion of RFU to % actin level was calculated using the mean RFU signal for all normalized ß-actin spots at 50 and 100 ng/µl for 10 hybridizations. Signal ratios (airway/parenchyma) also were calculated using the normalized RFU signals. Four ratios per animal (duplicates and inverse replicates) were obtained per transcript. Their geometric means were calculated to yield one ratio value per transcript per animal. Means and standard deviations were calculated from the geometric means of ratios obtained from five individual animals. To determine whether there was any dye bias effect introduced from using the secondary labeling procedure described in these studies, normalized hybridization data from airway samples labeled with Cy3 and Cy5 were compared with each other (Airway-Cy3 versus Airway-Cy5). Similarly, Parenchyma-Cy3 data were compared with Parenchyma-Cy5. Additionally, ratios generated from Airway-Cy3/Parenchyma-Cy5 were compared with ratios generated from Airway-Cy5/Parenchyma-Cy3. Data for such comparisons were normalized as described above and generated as follows: two normalized RFU signals (two duplicates/slide) for either subcompartment/dye were obtained per transcript; their means were calculated; and means and standard deviations for each lung subcompartment/dye were calculated per transcript from each of the five individual animal RFU signals. Signal ratios (airway/parenchyma) also were calculated using the normalized RFU signals. Two ratios per animal (duplicates) were obtained per slide, and their means calculated. Means and standard deviations were calculated from five individual animal means.
Rat Lung cDNA Library Sequencing A final set of 22,940 high quality sequences (accuracy of base call = 99.95%, average sequence length 516 bases) was available for analysis (Table 1). All high quality sequences were assembled into contigs to remove redundant data, and this resulted in a reduced rat lung sequence set of 8,746. Roughly 80% of the sequences assembled into 3,956 contigs, whereas 20% of the sequences remained as unincorporated singlets. The 80:20 ratio of contigs to singlets was consistent across all four libraries (airway and parenchyma) from control and 1-NNtreated rats.
For EST identification, alignments were performed with this reduced rat lung sequence set against currently available NCBI sequence databases including the nt database and the RefSeq databases from rat, mouse, and human. ESTs were further analyzed to identify any other unincorporated redundant sequences. BLAST results were searched for replicates by comparing GenBank accession numbers and also by manually searching for replicates in gene identities (for ESTs not sharing accession numbers as for homologous sequences from different species). This resulted in a final set of 7,460 nonredundant ESTs (Table 1). The data in Figure 1A represent the best BLAST alignments for each of the 7,460 ESTs. Results are classified by the source species supplying the best alignment as well as by its e-value score derived from the BLAST search. An e-value of 1 x 10-5 was used as the lowest confidence threshold for positive identification, and all ESTs with e-values higher than this threshold value were classified as unidentified and possibly novel. Of the sequences searched, 82% (6,130) showed some degree of homology to sequences in the public databases (Table 2). Of the nonredundant rat lung library sequences, 19% were best identified by a rat sequence, 49% by a mouse sequence, 14% by a human sequence, and 0.4% by sequences from other species. Of the 6,130 positively identified ESTs, only 3,414 had known biological function according to their best BLAST alignment (Table 2). The remaining 2,716 sequences, largely derived from EST or bacterial artificial chromosome clone sequencing projects, had no known function.
Slightly less than 18% (1,330) of the ESTs from the rat lung sequences were determined to be novel (Table 2). Of these 1,330 ESTs, 606 and 471 were singlet sequences obtained from airway and parenchyma libraries, respectively. The remaining 253 ESTs were contig sequences condensed from 563 individual sequences319 from airway libraries and 244 from parenchyma libraries. Comparisons of the length and quality scores for positively identified versus novel sequences showed that the average lengths were similar (known: 555 ± 138 bases and novel: 491 ± 170 bases), and the average Phred quality score per base was also similar (known: 36 ± 6 and novel: 34 ± 6). An alternative approach to monitoring alterations in gene expression to lung toxicants would have been to use arrays that are commercially available. The current Affymetrix Rat GeneChip (Set 230, April 2,003) consists of 31,142 sequences designed from GenBank, dbEST, RefSeq, the UniGene database (Build 99, June 2002) and refined by analysis and comparison with a draft assembly (June 2002) of the rat genome (Baylor College of Medicine Human Genome Sequencing Center, Houston, TX). Of the ESTs identified through the current studies, nearly one-third are not represented on the current Affymetrix array (Figure 1B). Of the sequences which overlapped with those on the Affymetrix array (71%), e-value scores indicated that more than 58% were either identical or excellent matches to rat sequences in the public databases. The most highly sequenced transcripts from all four libraries were determined from the nonredundant EST set, and the frequencies (as a percentage of sequences obtained from each library) for the five most abundant sequences ranged from 0.74.6% (control airway), 0.71.8% (1-NNtreated airway), 0.71.8% (control parenchyma), and 1.32.8% (1-NNtreated parenchyma). The five most highly sequenced transcripts from the control airway library (6,865 sequences) were dimethylarginine dimethylamino hydrolase (313 copies), Clara cell secretory protein (155 copies), synaptic vesicle glycoprotein 2b (61 copies), ß-carotene 15, 15-dioxygenase (60 copies), and thymosin ß-4 (45 copies). The top five transcripts sequenced from the airway after 1-NN treatment (5,152 sequences) were Clara cell secretory protein (92 copies); dimethylarginine dimethylamino hydrolase (71 copies); palate, lung, and nasal epithelium carcinomaassociated mRNA (71 copies); tyrosine aminotransferase (49 copies); and synaptic vesicle glycoprotein 2b (37 copies).
The five most highly sequenced transcripts from the control parenchyma library (4,997 sequences) were surfactant-associated protein 1 (91 copies), eukaryotic translation elongation factor 1
Rat Lung cDNA Microarray Development and Quality Control The data obtained after hybridizations with rat airway and parenchyma probes were analyzed to determine the spotting concentrations that would allow for minimal spot saturation. A standard curve generated from ß-actinpositive controls was used to determine that spotting concentrations at or above 30 ng/µl were sufficient to avoid saturation of the target on the slides for highly expressed transcripts such as ß-actin. Over 80% of the PCR products were spotted at concentrations at or above 30 ng/µl. Spot signal intensities for ß-actin decreased at a linear rate below a spotting concentration of 25 ng/µl. Approximately 2% of the spots on these microarrays were derived from concentrations of spotting material at or below 5 ng/µl. The housekeeping genes (ß-actin and GAPDH) spotted on each array as positive controls were used to determine the variability in the data collected from airway versus parenchyma hybridizations (n = 10, see MATERIALS AND METHODS). Actin transcript levels showed a coefficient of variation (CV) between 11 and 18% in the airways and between 12 and 25% in the parenchyma for DNA spotted at 50 or 100 ng/µl. GAPDH transcript levels for DNA spotted at these concentrations showed a CV between 17 and 23% in the airways and between 14 and 24% in the parenchyma. CVs from all normalized expression data averaged 26.6 ± 13.4% (airways) and 25.4 ± 12.8% (parenchyma). The signal intensities from negative controls (PCR products from E. coli and yeast genes spotted at 50 ng/µl) averaged between 0.4 and 0.9% actin level with CVs ranging from 6789% for either airway or parenchyma. For selected transcripts, PCR products were deposited in several quadrants of each slide allowing for replicate analysis within an individual microarray. Signal ratios generated from such replicate data points (319 replicates) correlated well with CVs averaging < 20% regardless of spotting concentration.
Studies to determine whether there was significant dye bias introduced by the secondary labeling procedure used in this investigation showed that, after normalization of signal intensities for all elements on slides hybridized with Airway-Cy3 versus Airway-Cy5 or Parenchyma-Cy3 versus Parenchyma-Cy5, data were virtually identical regardless of the dye used. When comparing ratio data (Airway-Cy3/Parenchyma-Cy5 or Airway-Cy5/Parenchyma-Cy3), signal ratios for over 80% of the spots were within 30% of each other (
Transcript Expression Differences Between Rat Lung Airways and Parenchyma by Microarray
Airway and parenchymal epithelial cells of the lung serve very distinct functions. Consistent with the different functions, several thousand genes (a total of 25.6% of the transcripts represented on this array) were expressed at significantly different levels (P
As would be expected, a number of marker genes show significant differential expression between these lung subcompartments. For example, airway nonciliated bronchiolar epithelial (Clara) cells produce a variety of secretory proteins including Clara cell secretory protein (Ugb) and the secretoglobins (Scgb3a1 and Scgb3a2). All of these genes are expressed at substantially higher levels in the airway compared with parenchyma (7.4, 23.6, and 8.2 fold, respectively). Likewise, the surfactant pulmonary-associated proteins (Sftpc and Sftpa1) are products responsible for stabilizing the alveolar airspace. These transcripts were expressed at much lower levels in the airway than in the parenchyma (-3.2 and -2.3 fold, respectively).
The availability of custom microarrays generated from extensive sequencing of airway and parenchymal cDNA libraries allowed a comprehensive comparison of the relative expression levels of genes in both lung compartments. Accordingly, transcript expression levels were sorted by signal intensities to determine the relative transcript abundance in airway and parenchyma (Figure 5). Three distinct populations were present in either subcompartment which were categorized as "low abundance" (< 10% actin level), "moderate abundance" (1099% actin level), and "high abundance" (
The five most highly expressed transcripts in the airway were hyaluronan-mediated motility receptor (2,084% actin level), a transcript encoding HDCMA18P protein (1,893% actin level), Clara cell secretory protein (1,635% actin level), a transcript encoded by the mitochondrial genome (1,593% actin level), and a transcript encoding hypothetical protein MGC14961 (1,584% actin level). In the parenchyma, the most highly expressed transcripts included a novel sequence (2,642% actin level), lysozyme (1,761% actin level), leucine-rich repeat LGI family, member 4 (1,725% actin level), a RIKEN cDNA 4833421E05 (1,674% actin level), and surfactant-associated protein C (1,547% actin level).
The lung is an important target for a number of cytotoxic and carcinogenic chemicals that reach the respiratory system via both inhalation and parenteral routes of exposure. The overall goal of the research program outlined in this article, and to be presented in subsequent articles, is to define the response of target lung cells to cytotoxicant exposure during phases of acute injury and repair using global gene expression analysis. Although microarrays have been used to evaluate changes in gene expression during lung development (10) and to define alterations in expression profiles associated with fibrosis (8), this approach has not been widely used to understand changes that occur in the lung in response to cytotoxic agents. One of the challenges in elucidating the response of the lung to external stress is related to the fact that the respiratory system is composed of over 40 different cell types with highly specialized functions specific to the unique physiology and biochemistry of this tissue. Accordingly, the methods used to sample the lung for delineating alterations in homeostasis associated with toxic exposures must differentiate between target and nontarget areas of this organ system. Although earlier studies have been able to demonstrate temporal relationships in gene transcript levels for extracellular proteins during development, the use of whole lung homogenates in the studies may not accurately reflect important changes occurring in specialized sites within the lung (10). Significant progress in the preservation and isolation of mRNA from subcompartments of the lung has been reported (17). Likewise, the specialized functions of the respiratory system suggest that use of commercially available arrays may not provide adequate coverage of a number of genes specifically linked to the unique physiology of this tissue. This point is fully supported by the data shown in Figure 1B indicating that nearly 30% of the elements present on the custom arrays described in this study were not represented in the most recent version of the rat gene array (Set 230) from Affymetrix. Similarly, when compared with the public databases, nearly 20% of the genes sequenced in this study did not align to sequences from rat, mouse, or human sequence databases at the time these studies were conducted (Figure 1A, Table 2). Eight of these novel genes (out of the 100 represented in Figures 3 and 4) were highly differentially expressed in airway or parenchyma, suggesting that their gene products may have very specific roles in the maintenance of homeostasis in these tissue subcompartments. Although there have been large sequencing efforts for human and mouse genetic material, data from rat is less abundant. For example, the RefSeq database for rat contains only 4,023 fully annotated sequences compared with databases for mouse and human (which contain 12,683 and 18,232 sequences, respectively). This work documents 7,460 nonredundant cDNA clones identified after the analysis of 22,940 high quality sequences from rat airway and parenchyma subcompartments. The number of airway sequences which may have been overlooked by sequencing a whole lung cDNA library is estimated to exceed 650 airway transcripts (close to 10% of this cDNA clone set) unless the whole lung library had been sequenced in far greater depth than has been achieved so far. From the sequences analyzed, a set consisting of 8,789 sequences was contributed to the rat sequencing effort (GenBank accession nos.: CF106750, CF115538). In addition to the uncharacterized sequences obtained during the course of this work, the development of spotting material for use in glass slide arrays offers the ability to generate arrays at a cost that allows studies to be conducted in several animals at each time point and at multiple doses. In a number of published studies (10), the significant cost of commercial chips resulted in the use of pooled samples. Although there is ample experimental data to show that the chips properly interrogate the levels of expression in that pool, this approach precludes any assessment of the variance between animals. Statistically valid conclusions on alterations in gene expression in response to stressors can only be derived from data sets that contain multiple animals (or multiple pools of animals) measured individually. The sequencing project and microarray studies presented here generated two unique data sets representing differential expression of transcripts from the airways and parenchyma of rat lung. Although for many of the higher abundance transcripts there was a good correlation between the two data sets, for much of the data, it was difficult to make comparisons between the expression levels on the arrays with the number of transcripts sequenced. Better comparisons would have required a much larger sequencing data set, such as those for SAGE analysis, which can involve as many as 300,000 sequence tags (18). Although all of the microarray work described in this article included the use of dye reversal hybridizations, the data show that when using the indirect labeling procedure, use of a single dye provides good quantitative measurements. Ratio data generated from airway-Cy3 versus parenchyma-Cy5 and airway-Cy5 versus parenchyma-Cy3 hybridizations were within 30% of each other for over 80% of the transcripts, and the majority were within 10%. As would be expected, more variability was noted with genes which were not highly expressed. Significant differences in signal ratios (> 2-fold difference) were observed in < 2% of the data. One of the possible disadvantages associated with the development of custom arrays that use spotted PCR products is related to cross-hybridization for genes with closely related sequences. Although in the vast majority of cases the arrays produced in this study appeared to properly interrogate differences in expression between the airway and parenchymal subcompartments, some transcripts, such as those coding for some of the cytochrome P450 monooxygenases, were expected at much higher levels in airway compared with parenchyma. For example, a transcript expected to be present at significantly higher levels in the airway compared with parenchyma was CYP2F4 (unpublished observations, Baldwin and colleagues). Significant differences in expression between the airway and parenchyma were not observed for any of the CYP genes, including CYP2F4. Although many of the CYPs are highly abundant in airway Clara cells, several have also been demonstrated in alveolar type II cells, macrophages, and endothelial cells of the parenchyma (19). Lack of differential expression between the subcompartments may be due to more equal distribution of these CYPs than previously expected. However, some measure of cross-hybridization between the different CYP homologs is likely because they belong to a highly homologous family of enzymes. Some level of cross-hybridization may have decreased the ability of the current arrays to discriminate among the various CYPs. These findings underscore the importance of confirming differentially expressed genes by secondary techniques, especially when using cDNA microarrays, to avoid any artifacts introduced by such possible cross-hybridization effects. Spot intensities were used to compare relative abundance of transcripts from the airway and parenchymal subcompartments. Although mRNA populations differ between cell types, those from typical mammalian cells are categorized into three distinct abundance classes (20). Similarly, transcripts in the lung appear to fall into three fairly distinct abundance classes, but with variations between airway and parenchyma in the "low" and "moderate" abundance classes. Roughly 7% of all transcripts measured had abundance levels at or above 100% actin level and thus would be characterized as "highly abundant." In the "typical" mammalian cell, < 1% of all transcripts fall into this highly abundant category. The data presented in Figure 5 indicate that cells in the lung produce more highly abundant transcripts than observed in other tissues. This is not surprising considering the diversity of cell types in the lung, each with highly specialized functions. It is important to note that the measurements of abundance levels presented here are meant only as observations of relative abundance. The accuracy of using fluorescence data from cDNA microarrays such as those used in these studies as a measure of abundance is limited and should be viewed with caution. Variation in spotting material concentration, variation in length of spotted PCR products, and differential incorporation of fluorescent molecules (Cy3 or Cy5) into labeled cDNA probes are a few of the factors that may potentially lead to artifacts in these measurements. There are several pieces of data that provide confidence that the relative abundance data presented contains minimal artifacts introduced as a result of low concentrations of spotting material. These include the following: (i) over 80% of microarray spots were deposited at sufficient concentrations (> 30 ng/µl) to measure highly abundant transcripts; (ii) when comparing only those data obtained from PCR products at concentrations > 30 ng/µl to the complete data set, expression abundance patterns (low:moderate:high abundance classes) are not altered significantly; (iii) control data indicate that transcripts expressed at 10% actin level should be measured accurately at spotting concentrations as low as 5 ng/µl; and (iv) over 98% of the microarray spots were deposited at or above 5 ng/µl, and this concentration should be sufficient for over 90% of transcripts (low and moderate abundance). A further issue relates to the fact that the lengths of PCR products spotted varied thus potentially affecting probe-target binding dynamics. These factors are less of a concern when using the short oligonucleotides (50- to 70-mers) available commercially for spotting glass slide arrays. Partial sets have become available for the rat within the past months. In addition, differential incorporation of fluorescent molecules during probe labeling may introduce bias that could affect the measurement of abundance. The microarray work described here included the use of dye reversal hybridizations to minimize dye-biased data. As discussed earlier, using an indirect labeling method provides relatively equal incorporation of dye to the cDNA probe regardless of the dye incorporated (Cy3 or Cy5), and therefore, we feel confident that dye-biased effects were minimized. One of the primary goals of the work described here has been to validate the use of this array for interrogating changes in gene expression in response to pulmonary toxicants. Thus, the relative expression levels for genes in the airway versus parenchyma should correspond with reports from in situ hybridization and immunocytochemistry studies on the localization of various genes and gene products in the lung. Transcript populations that are expressed primarily in either parenchyma or airway should show deviations from transcripts that are expressed at relatively equal levels in either lung subcompartment (such as housekeeping genes). The data presented in Figure 2 clearly show three distinct populations that fit this paradigm. Among the most selectively expressed transcripts in the airways were Clara cell secretory protein (Ugb) and the secretoglobins (Scgb3a1 and Scgb3a2). These transcripts are expressed primarily by the Clara cells of the airway epithelium (21). Other transcripts known to be expressed in the airways included DNCL2B which is likely from ciliated cells (22), Plunc from the trachea and bronchi (23), and Dcn which has been shown in the extracellular matrix of the airway (24). Similarly in the parenchyma there was selective expression of the pulmonary surfactant proteins (Sftpc and Sftpa1) which are known to be expressed highly by alveolar epithelial Type II cells (21). Other genes which were more highly expressed in parenchyma than in airways included Cldn5, Anxa8, Ca2, Lyz, and Kdap, and again, these have been shown to be expressed in alveolar epithelial Type II cells (21, 2528). In addition, Ca4, Ace, and Defb2 are expressed in alveolar macrophages or capillary endothelial cells (2931). Substantial differences in the expression of several hundred genes between conducting airway and alveolar lung regions supports the contention that proper understanding of toxicant-induced changes in gene expression requires the use of methods capable of selecting target and nontarget areas of the lung as well as analysis with a lung-derived microarray. Other transcripts which have been reported in whole lung but whose subcompartment localization is unknown included MLL3, S100C, Hpgd, Men1, and Egfl6 (3236). The current studies indicate that MLL3 and S100C are more highly expressed in the airways than the parenchyma while Hpgd, Men1, and Egfl6 are more highly expressed in the parenchyma (Tables 3 and 4). The genes listed over the last few paragraphs only comprise 20% of the top 100 most differentially expressed genes between these subcompartments. The remaining 80% of transcripts were either previously uncharacterized in the lung or of unknown biological function in any tissue. Although many of the sequences have been reported recently from the rat genome sequencing project, this study shows that these genes are expressed in either airway or parenchyma. Interestingly, many of the genes listed in Tables 3 and 4 involved in intermediary metabolism or in cell/organism defense and homeostasis were expressed more highly in the parenchyma compared with the airways (Table 5). This is consistent with the nature of this region of the lung: (i) it requires specific conditions conducive for efficient gas exchange, and (ii) cells are in a very oxidative environment requiring vigorous intermediary metabolism. Conversely, in the airways, a higher percentage of genes could be classified in cell structure/motility and this is consistent with the mucociliary function of the cells lining the airways. Of the top 100 mRNA transcripts showing the highest fold expression differences between the two lung subcompartments, over 50% had no known biological function. This indicates that many uncharacterized genes identified through the approach described here may contribute significantly to the specialized functions of the lung. Over 40% of the transcripts showing significant expression differences between the airways and parenchyma had no known biological function associated with them. It was also interesting to note that the parenchymal subcompartment showed over three times the number of significantly different transcripts than did the airways, implying that a greater heterogeneity of the mRNA transcripts may be required for homeostasis in the parenchyma. These findings underscore the need to establish fundamental baseline knowledge of the lung such as that provided by this gene expression data set to aid in the functional characterization of specific regions of the lung. Accordingly, the gene expression data generated from these studies will be available for other investigators to query through NCBI's Gene Expression Omnibus public gene expression and hybridization array data repository (http://www.ncbi.nlm.nih.gov/geo/) and through our web site (http://faculty.vetmed.ucdavis.edu/faculty/arbuckpitt/BuckpittLab). In summary, this sequencing effort has allowed for the construction of a focused microarray containing 7,460 nonredundant transcripts derived from pulmonary tissue and supplies valuable baseline information of transcripts represented in the respiratory system. Nearly one-third of the elements present on this custom microarray are not currently represented on Affymetrix rat arrays; thus, this microarray represents a valuable and unique resource for delineating gene expression differences between lung subcompartments and in response to toxicants. The microarray data presented here validates the capability of this array to delineate gene expression differences in lung tissues and provides prima facie gene expression data from different lung subcompartments. We have used these arrays to monitor gene expression changes in target and nontarget areas of the lung in an effort to gain insight into some of the fundamental responses of pulmonary epithelial cells during injury and repair from a metabolically activated cytotoxicant. Those studies will be presented in subsequent papers. Future studies will be directed at identifying the functions of the gene products for those genes whose expression levels are altered by toxicant exposure.
Note from Authors: The field of genomics and microarrays is rapidly and continuously evolving. This has made it difficult to present these sequencing data in a fashion that is as "up-to-date" as possible while trying to keep pace with the changes in publicly available sequence data. In the time the data for this manuscript were acquired, analyzed, and submitted for publication, the Rat Genome Project has made significant progress in its sequencing of the rat genome. As of June 30, 2003, 90% of the rat genome (v.3.1) has become available. A recent comparison of the rat lung sequences to the updated rat genome shows that roughly 90% of the "novel" sequences presented here have now been identified by that project. Many of these sequences, though, remain uncharacterized and their biological functions are still unknown. When the 7,460 lung sequences were compared with the latest annotated sequence databases (RefSeq, July 2003) for rat, mouse, and human (4,549, 16,117, and 20,365 sequences, respectively), the number of sequences in this rat lung data set with known biological function had only increased from Although the new rat genome sequences help confirm the quality of the lung sequences generated through this approach, the authors have kept the identity nomenclature from the original BLAST comparisons with the databases described in MATERIALS AND METHODS solely to emphasize the potential of this approach in identifying unique tissue-specific sequences from less-sequenced species. Although the rat genome will be fully sequenced in the near future, sequence information for many other species remains incomplete, and application of the approach described in this article would be highly useful. Regardless of the sequence nomenclature used, the information presented in this article provides prima facie evidence for the expression of 7,460 genes in respiratory tissue. Furthermore, this does not affect the validity of the microarray developed using this approach or the subsequent gene expression data generated. The comparison made to the Affymetrix Rat GeneChip (Set 230, April 2003) is as up to date as possible (no new releases have been made available), and the noted differences undoubtedly reflect the lag time required to "mine" sequence data and develop the necessary oligonucleotide sets for inclusion on such microarrays. The approach described in this manuscript has allowed for the completion of the toxicology studies initially proposed before the availability of a suitable commercially available rat array containing many of these lung-specific genes.
This study was supported by National Institutes of Environmental Health Sciences ES 09681, ES 04699, and Amersham Biosciences. UC Davis is a Center for Environmental Health Sciences and support for core facilities is acknowledged. M.A.S. was supported during the preparation of this manuscript by T32HL07013. The authors acknowledge the research team at Aeomica for their support provided for the sequencing project. They also gratefully acknowledge Dr. David Rank and Dr. Russell Thomas for informatics advice and David Jenkins for support in the construction of the microarray. Received in original form June 10, 2003 Received in final form August 13, 2003
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||