Published ahead of print on July 25, 2003, doi:10.1165/rcmb.2003-0103OC
© 2004 American Thoracic Society DOI: 10.1165/rcmb.2003-0103OC Genome-Wide Search and Identification of a Novel Gel-Forming Mucin MUC19/Muc19 in Glandular TissuesCenter for Comparative Respiratory Biology and Medicine and Division of Pulmonary and Critical Care Medicine, University of California, Davis, California; Center for Oral Biology, University of Rochester, Rochester, New York; Department of Oral and Maxillofacial Surgery, Drew University of Medicine and Science, Los Angeles; and Department of Molecular Pharmacology and Toxicology, University of Southern California, Los Angeles, California Address correspondence to: Reen Wu, Ph.D., Center for Comparative Respiratory Biology and Medicine, Surge 1 Annex, Room 1121, University of California at Davis, One Shields Ave., Davis, CA 95616. E-mail: rwu{at}ucdavis.edu
Gel-forming mucins are major contributors to the viscoelastic properties of mucus secretion. Currently, four gel-forming mucin genes have been identified: MUC2, MUC5AC, MUC5B, and MUC6. All these genes have five major cysteine-rich domains (four von Willebrand factor [vWF] C or D domains and one Cystine-knot [CT] domain) as their distinctive features, in contrast to other nongel-forming type of mucins. The CT domain is believed to be involved in the initial mucin dimer formation and have very succinct relationship between different gel-forming mucins across different species. Because of gene duplication and evolutional modification, it is very likely that other gel-forming mucin genes exist. To search for new gel-forming mucin candidate genes, a "Hidden Markov Model"(HMM) was built from the common features of the CT domains of those gel-forming mucins. By using this model to screen all protein databases as well as the six-frame translated expression sequence tag and translated human genomic databases, we identified a locus located at the peri-centromere region of human chromosome 12 and the corresponding homologous region of mouse chromosome 15. We cloned the 3' end of this gene and its mouse homolog. We found one vWF C domain, one CT domain, and various mucin-like threonine/serine-rich repeats. Phylogenetic analysis indicated the close relationship between this gene and the submaxillary mucin from porcine and bovine. A polydispersed signal was observed on the Northern blot, which indicates very large mRNA size. Further analysis of the upstream genomic sequences generated from human and mouse genome projects revealed three additional vWF D domains and many mucin-like threonine/serine-rich repeats. The expression of this gene is restricted to the mucous cells of various glandular tissues, including sublingual gland, submandibular gland, and submucosal gland of the trachea. Based on the chronological convention, we have given the name MUC19 to the human ortholog and Muc19 to the mouse.
Abbreviations: bovine submaxillary mucins, BSM Cystine-knot domain, CT glyceraldehyde-phosphate dehydrogenase, GAPDH Hidden Markov Model, HMM expression sequence tag, EST polymerase chain reaction, PCR porcine submaxillary mucins, PSM reverse transcriptase, RT saline sodium citrate, SSC von Willebrand factor C domain, VWC von Willebrand factor D domain, VWD von Willebrand factor, vWF
Mucus is a viscoelastic gel-like substance that covers the mammalian epithelial surface of various tissues. The main functions of mucus include lubricating and protecting of epithelia from environmental insults. The viscous and elastic properties of mucus gel are generally attributable to the physical properties and structural features of mucin glycoproteins, specifically gel-forming mucins. MUC2, MUC5AC, MUC5B, MUC6 define this mucin subgroup and they are believed to have evolved from one common ancestor with von Willebrand factor (vWF) (1). Bovine and porcine submaxillary mucins (BSM, PSM) also belong to this subgroup (1). All of these gel-forming mucins have very large size (1540 kb cDNA); they also share a similar structure and substantial sequence homology in the conserved regions. The cDNA sequences of those mucins have multiple "cysteine-rich" vWF C (VWC) and vWF D (VWD) domains in the flanking region of the mucin-like threonine/serine-rich repeats and Cystine knot (CT) domains in their C-terminal regions (1, 2). Both the cysteine number and their positions are extremely conserved in those domains, which play an essential role in forming disulfide-linked dimers (35) and multimers (1, 6, 7). No such domains are found in other nongel-forming type of mucins. Their large size and the capability of forming multimers support the notion that these mucins have played a pivotal role in forming the mucus gel. Indeed, those gel-forming mucins have been proven to be major components of the mucus secretion of various organs (811). In addition to the gel-forming mucins, fifteen other human mucin genes have been cloned and named as MUC1, 34 and 718 (http://www.ncbi.nlm.nih.gov/LocusLink/). Generally speaking, individual mucins are named because that they have so-called "threonine/serine-rich mucin repeats," and they share no apparent sequence similarities as a big group (12). Among those mucins, some are membrane-tethered (MUC1, 3, 4, 11, 12) and some are very small (MUC7, 9, 10) (12). The contribution of those mucins to the biophysical and biochemical properties of mucus gel is not entirely clear. For many years, the total number of mucin genes has remained a mystery. Currently, four human gel-forming mucin genes have been identified: MUC2, MUC5AC, MUC5B, and MUC6. New gel-forming mucin may also exist due to gene duplication, chromosomal exchange, or other genetic alterations. Current progress in DNA sequencing has led to the creation of many sequence databases that are useful resources for the discovery of new proteins. Now that the human genome project has been completed, potential gene candidates can be predicted from their genomic sequence. In addition, another useful database is dbEST (NCBI expression sequence tag [EST] database http://www.ncbi.nlm.nih.gov/dbEST/). dbEST contains the cloned cDNA sequences by the reverse transcription of mRNA samples from various tissues, and has been widely used for the study of gene expression. One general approach to discover new members of a gene family is to search the nucleotide databases for similar sequences of this gene family by BLAST program (http://www.ncbi.nlm.nih.gov/BLAST/). However, many gene families, such as gel-forming mucins, don't have the overall sequence similarities; rather, they only share some conserved "motifs" such as the CT domain. This difficulty can be overcome by searching the database using sequence profiles rather than merely the sequence per se. There are many methods for constructing sequence profile from a multiple sequence alignment; the resulting profile represents the mathematical summary of the specific features of these sequences extracted from those known members of a given gene family. Searching the database by using a sequence profile is like looking for the general "features" of those genes rather than just similar DNA sequences (13, 14). "Hidden Markov Model" (HMM) is one of the most powerful tools in this regard (15, 16). Using this HMM-based searching method, Schultz and coworkers (17) have discovered more than 1,000 new putative human small GTPase proteins. Combined with EST database search and BLAST search on genomic sequence, Wittenberger and colleagues (13) have uncovered new members of the G-proteincoupled receptor superfamily. Therefore, this HMM-based search approach will be more robust and specific than the BLAST program. In this report, we have used this approach to identify MUC19/Muc19, as a novel glandular tissuespecific gel-forming mucin gene.
Screening the Novel Gel-Forming Mucin Genes As shown in Figure 1, we collected all the known gel-forming mucin genes, including those of human and other animal species, from the NCBI database. We chose the 3' end sequences because of the concern that some genes, such as MUC6, only have 3' end sequences. Moreover, most of the EST sequences were generated from the 3' end. All sequences were selected and processed with Blast2 (NCBI software program). Only the most representative sequences were preserved. These genes were then aligned by the ClustalW program (18). A gel-forming mucin gene-specific HMM was built based on the alignment data by using HMMER2.2 software from Sean Eddy's Lab Home Page of Washington University at St. Louis, MO (http://hmmer.wustl.edu/). NCBI human and mouse EST databases were downloaded to an in-house Linux computer. All those sequences were six-frame translated, then they were screened using the "gel-forming Hidden Markov Model" by HMMSEARCH in the HMMER2.2 software package. Initially, a default cutoff value (< 1) was used. All hits were then used to search the NCBI nr database to find out if those ESTs corresponded to the known genes by using an in-house search program. By visual inspection, we found that there was a large gap in the scores among all those hits. All the known nonmucin genes have a score much smaller than 0.01. Thus, a second cutoff value (< 0.01) was used to filter the results. The same method was also used to search the human and mouse genomic databases from NCBI again. The only difference in this search was that all the genomic sequences were first translated by GENESCAN program (19) before the search.
3' and 5'-RACE The RACE kit (Roche Diagnostics Corporation, Indianapolis, IN) was used to synthesize the first-strand cDNA from total RNA (3 µg) isolated from human and mouse salivary gland tissues. All the procedures followed the manufacturer's instructions. Briefly, Oligo-dT anchor primer or antisense gene-specific primer corresponding to different regions of MUC19/Muc19 message were used to initiate first-strand cDNA synthesis. For the 3'-RACE, PCR was performed by 5' gene specific primers and 3' oligo d(T) anchor primer. For 5'-RACE, a 3' tailing with oligo d(A) with terminal deoxynucleotidyl transferase was performed on the first-strand cDNA, then a PCR was performed using the nested gene-specific primer and the 5' oligo d(T) anchor primer. The PCR products were subcloned into the TA vector (Invitrogen, Carlsbad, CA) for cloning and DNA sequencing. All primer sequences used in this study are listed on the Table 1.
Reverse TranscriptasePolymerase Chain Reaction Amplification cDNA was synthesized from total RNA (3 µg) by RT with oligo d(T) primer. The resulting single-strand cDNA was used as a template for polymerase chain reaction (PCR) amplification by MUC19/Muc19 gene-specific primers (Table 1). PCR products were TA cloned and sequenced.
Phylogenetic Analysis
Genomic Structure and Localization
RNA Isolation and Northern Blot Hybridization
Expression Analysis by Quantitative Reverse Transcriptase-PCR
In Situ Hybridization Glass slide sections from various tissue blocks were hybridized in the hybridization solution using biotin-labeled antisense or sense probes synthesized by in vitro transcription of MUC19/Muc19 clones. In situ hybridization was performed as per the manufacturer's protocol (Roche Diagnostics Corp., Indianapolis, IN) and modified as described before (21). Briefly, slide sections were treated with 10 µg/ml Proteinase K in 50 mM Tris-Cl (pH 8.0) and 50 mM ethylenediamenetetraacetic acid for 15 min at 37°C, rinsed twice in 0.2x saline sodium citrate (SSC) thereafter, and then postfixed in 4% paraformaldehyde/phosphate-buffered saline for 20 min. Slides were treated twice for 5 min each time with 0.1 M triethanolamine (pH 8.0) and blocked by 0.25% acetic anhydride in 0.1 M triethanolamine. The sections were then dehydrated through the ethanol series. For each section, 0.5 pmol biotin-labeled oligonucleotide probe in 50 µl of hybridization buffer was applied. The hybridization buffer contained 2x SSC, 1x Denhardt's solution, 10% dextran sulfate, 50 mM phosphate buffer (pH 7.0), 50 mM dithiothreitol, 250 µg/ml yeast tRNA, 100 µg/ml poly A, and 500 µg/ml salmon sperm DNA. The section was hybridized at 45°C overnight in a humidified chamber. After hybridization, the section was washed twice for 15 min each time at 37°C with 2x SSC, twice for 15 min each time with 1x SSC, and twice for 15 min each time with 0.25x SSC. After the wash, the slide was reacted with anti-biotin primary antibody conjugated with alkaline phosphatase. After several washes, the reacted probes in the slide were color-developed with the Biotin Nucleic Acid Detection kit from Roche Diagnostics Corp.
Developing HMM for the Genome-Wide Search of New Gel-Forming Mucin Genes To conduct a genome-wide search for new gel-forming mucin genes, a specific HMM was developed based on the sequence alignment of all known gel-forming mucins (Figure 1). To enhance the representation of this model, sequences from species other than human and mouse were also included in the alignment. Using this finalized model, a comprehensive search of the human and mouse EST databases revealed many hits with high scores, especially those from the mouse EST databases. Most of these hits were parts of the known gel-forming mucin genes (MUC2/Muc2, MUC5AC/Muc5AC, etc.). Because of significant high score of these hits in the mouse EST database, we decided to focus on the mouse gene. After processing those results by an in-house program, 24 mouse ESTs that did not match any known mouse mucin gene were obtained from the search. These ESTs were in fact generated from the same gene. The translated product of this new gene has a bona fide gel-forming mucin like CT domain (Figure 2Ab).
Molecular Cloning and Sequence Characterization of the 3' End of Novel Gel-Forming Mucin Gene, Muc19 We then performed 5'/3'-RACE using the primers deduced from the potential coding region of this new gene. The total mouse salivary gland RNA was used because all the ESTs from this new gene were obtained from mouse salivary gland library SG2. For 5'-RACE, we used mmuc19_1740 as gene specific primer; for 3'-RACE, we used mmuc19_1392. Sequences of the primers are listed in the Table 1. By these methods, we were able to obtain two cDNA clones (1.897 kb and 2.023 kb) that were generated by different polyadenlynation sites (Figure 2Aa). The longer transcript has the same ORF as the shorter one, but has longer 3' UTR. The sequence has been deposited into GenBank under the accession number AY193891. The deduced peptide sequence has significantly high threonine and serine content (35.9%) and several mucin-like threonine/serine-repeats (Figure 2Ac). It also has the signature motifs of gel-forming mucin: VWC and CT domains (Figure 2Ab). Because mucins are named numerically in chronological order, we therefore named this new mouse mucin gene as Muc19. By comparing the Muc19 sequence with the UCSC and NCBI human genome sequence database, the cloned 3' end sequence of Muc19 was found to reside at chromosome 15 (Figure 3A) and consists of 9 exons (Figure 2Aa).
Identification of the Human MUC19 Locus by Searching the Translated Genomic Database with "Gel-Forming Mucin HMM" In contrast to mouse EST database search, the human MUC19 was not found in the human EST library. After looking through the current human EST libraries, we realized this problem might be due to the lack of the human salivary gland library in the human EST database. To overcome this obstacle, we performed the screening using the translated human genomic databases deduced from the publicly available GenBank database. By using this approach, we were able to identify the putative human MUC19 locus in chromosome 12 (Figure 3B). We also screened the translated mouse genomic databases and found the mouse Muc19 locus at chromosome 15 (Figure 3B), which further confirmed the sensitivity and accuracy of our screening method. Interestingly, this portion of mouse chromosome 15 seems to be the homologous region to the human chromosome 12. Notably, we were unable to identify any candidates other than MUC19/Muc19 by this search on both human and mouse genomes.
Molecular Cloning and Sequencing of the 3' End of Human MUC19
Phylogenetic and Sequence Analysis of Gel-Forming Mucin Genes
The Predicted Gene Structure Upstream of the Cloned 3' End of MUC19/Muc19 The genomic sequences from both the human MUC19 and mouse Muc19 locus allow us to deduce the genomic structure and protein motifs. Most importantly, those sequences have been shown to be very similar to PSM. We then tried to predict the gene structure upstream of the cloned MUC19/Muc19 sequences by comparing their genomic sequence with PSM peptide sequence using Genewise program (Ewan Birney [http://www.sanger.ac.uk/Software/Wise2]). The benefit of this prediction program is that it uses sequence homology in addition to sequence statistics to facilitate the exon prediction. Thus it is more accurate than the conventional exon prediction method like GENESCAN that is solely dependent on sequence statistics (19). As shown in Figures 6A and 6B, both peptide sequences deduced from the genomic sequences share similar structural domains with other gel-forming mucin genes: 5'-VWD-VWD-VWD-mucin repeats-VWC-CT-3'. Both genes seem to have a very large central region containing most of the serine/threonine-rich repeats, which is reminiscent of the large central exon of MUC5B gene (22). Those structural features are very similar to PSM (Figure 6C). As we expected, the predicted peptide sequences from MUC19 and Muc19 were very similar with the PSM sequence (Figure 7). Highly homologous sequences were found at both the C terminus and putative N terminus of the peptide sequences of MUC19/Muc19, while no significant homology was seen in the central repetitive regions (Figure 7). Both MUC19 and Muc19 are very large genes. Human MUC19 has more than 180kb of genomic sequence with a deduced peptide sequence larger than 7,000 amino acids, whereas mouse Muc19 has 80 kb of genomic sequences with 3,000 amino acids. The smaller size of mouse Muc19 might result from more gaps and much lower quality of the mouse genomic sequences available in the current database. We expect that the genomic size of mouse Muc19 is probably similar to human MUC19 when the mouse genomic project is complete.
Characterization of the Expression of MUC19/Muc19 In Vitro and In Vivo To further examine the expression of MUC19/Muc19 in various tissues, both Northern blot and RT-PCR approaches were used to screen the mouse and human multitissue panels. Like other gel forming mucin gene messages, the Northern blot revealed a polydispersed feature of MUC19/Muc19 messages in salivary gland and tracheal tissues (Figure 8A). In mouse, Muc19 is mainly expressed in the two major salivary glands, sublingual and submandibular, and to a much lesser extent in trachea (Figure 8A). Muc19 is expressed at a higher level in the sublingual gland than that in the submandibular gland, and it is undetectable in the parotid gland (Figure 8A). This result is consistent with the distribution of mucous cell population in those glands. In these three major salivary glands, the sublingual gland contains mostly the mucous cell type, the submandibular gland contains a mixture of mucous and serous cell types, and the parotid gland cells are mostly the serous cell type. In human tissues, we also detected similar polydispersed signals from trachea and submandibular gland RNA samples (Figure 8B). To increase the sensitivity and the coverage of this tissue distribution study, we further used the quantitative RT-PCR method to screen additional human and mouse tissue samples. In the screening, primers hmuc19_1333/hmuc19_1426 were used for human, and primers mmuc19_1378/mmuc19_1443 were used for mouse. As summarized in Table 2, MUC19/Muc19 expression is very restricted and cannot be detected by RT-PCR in various nonglandular tissues.
We used in situ hybridization to further examine the specific cell types that express MUC19/Muc19 messages. MUC19 transcripts were detected in the mucous cells of the submandibular gland and submucosal gland of the trachea from human (Figure 9). A similar positive hybridization of mouse Muc19 probe was seen in mouse tissue sections from the sublingual gland and tracheal submucosal gland (Figure 10). Notably, there is no hybridization signal in most serous cells of these glands. The strict cell type specificity of MUC19/Muc19 may explain why low levels of these transcripts in the tracheal RNA sample in which most of the RNA species are generated from the nonglandular portion.
The current explosion of sequence data from the genome project and EST project of different species make it much easier to identify new gene family members. In addition to the simple sequence similarity search, pattern-based search methods have proven to be more robust (13, 17). In this study, we successfully used the HMM-based approach to identify a novel gel-forming mucin gene, MUC19/Muc19, which are specifically expressed in various glandular tissues. In contrast to conventional biological discovery, the bioinformatic discovery approach requires a precise mathematical definition of the specific feature of the gene family of interest. Our initial attempt to define the "mucin-like threonine/serine-rich repeats" for discovering new mucin genes was a complete failure. This was partly due to the heterogeneous nature of the mucin genes; some of them are named quite arbitrarily. MUC7, for example, was named as mucin only because of the presence of four mucin-like serine/threonine-rich repeats (23). As a matter of fact, many immunoglobulin genes have more mucin repeats than MUC7. It seems that the conventional mucin definition is too loose to distinguish the real mucins from other mucin-like genes. Thus, we tried to define the mucin genes based on additional features of their peptide sequences. We found that a mucin subgroup called "gel-forming mucin" (1, 12) was much easier to be defined. All of these gel-forming mucin genes share similar conserved motifs and structures. Most notably, they have been suggested to be the determining factor for the viscoelastic properties of mucus secretion and mucus gel formation in various organs. We therefore defined the "gel-forming mucin-specific HMM" based on specific features at the 3' ends of known gel-forming mucin genes in various species. After screening the ESTs databases, we found this "gel-forming mucin-specific HMM" to be very specific and discriminating. The approach identified all previously known gel-forming mucin genes of various species without missing any. No other hits had a high enough score to be considered except MUC19/Muc19. That was also true when translated human and mouse genomic databases were included for the screening. The newly identified MUC19/Muc19 gene has the gel-forming mucin feature with a structure significantly similar to the porcine and bovine submaxillary mucins. It has been suggested that all the known gel-forming mucin genes are evolved from one common ancestor with vWF by gene duplication events (1). Structurally, MUC19/Muc19 are also very similar to vWF as well as other gel-forming mucin genes. Interestingly, human MUC19 resides in chromosome 12q12, which is close to the location of vWF (12p13). In the phylogenetic tree, MUC19 is much closer to the MUC2/MUC5AC/MUC5B than MUC6, although MUC6 is also located in the 11p15 (24). We suspect that MUC19 shares a similar ancestor with the other gel-forming mucins and branched out evolutionarily later than MUC6.
The most striking feature of MUC19/Muc19 is their size. Of the known sequences, MUC5B is the largest human gel-forming mucin, consisting of Similar to their porcine/bovine counterparts, MUC19/Muc19 is expressed mainly in the major salivary glands, including both the sublingual and submandibular glands. This then raises the question: what is the major mucin component in the saliva? Previous study has suggested that MUC5B protein is the major mucin component in the high molecular weight portion of salivary mucus based on the comparison of the known mucin species in the saliva as well as in RNA samples from salivary gland (27, 28). However, a recent paper indicates that concentrated solutions of salivary MUC5B protein alone cannot replicate the gel-forming properties of saliva (29), which suggests the presence of additional mucin(s) in mediating mucus gel formation. In this study, we have demonstrated that MUC19/Muc19 transcripts are present in the major salivary glands at a high level. Because its large size, this new mucin may be one of the major components contributing to the viscosity of salivary mucus. We have also demonstrated the expression of MUC19/Muc19 in the mucous cells of airway submucosal glands. Submucosal gland is one of the major sources for the airway mucus secretion. Until now, MUC5B protein is the only gel-forming mucin identified in the mucous cells of human airway submucosal glands (21). Both MUC5B and MUC5AC mucin proteins have been identified in human airway secretions from normal and patients with various chronic diseases (8, 9, 3032). It is very possible that MUC19/Muc19 protein also contributes to airway mucus secretion. Because of its huge size, MUC19/Muc19 mucin may be essential to determining the viscoelasticity of the airway mucus secretion. In the chronic airway diseases such as asthma and COPD, the presence of unusually high level of MUC19/Muc19 mucin could be detrimental to the morbidity and mortality of these diseases by increasing the tenacious nature of mucus plugs in airways. In summary, we have identified a novel gland-specific gel-forming mucin gene, MUC19/Muc19 by using HMM-based genome-wide search approach. Molecular cloning and sequence information suggest that this mucin gene is probably the largest gel-forming mucin gene ever identified, and it has all the features of the known gel-forming mucins. Expression analyses, based on Northern blot and in situ hybridization, demonstrate that MUC19/Muc19 is mainly expressed in the mucous cells of various glands, including the major salivary glands (sublingual and submandibular glands), and the submucosal gland of large airways. Further studies of the expression and the biochemical properties of this novel mucin gene in various mucus secretions will be essential to understanding the function and the regulation of this newly found mucin in the normal and disease.
This manuscript is supported in part by NIH grants (HL35635, ES06230, ES09701, AI50496, ES04699, and ES05707), the California Tobacco-Related Disease Research Program (10RT-0262), and Cystic Fibrosis Research, Inc. (Grant #03006j).
Human MUC19: GenBank Accession Number AY236870 2. Mouse Muc19: GenBank Accession Number AY193891. Received in original form March 21, 2003 Received in final form July 2, 2003
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||