Crossref journal-article
American Association for the Advancement of Science (AAAS)
Science (221)
Abstract

The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.

Bibliography

(1998). Genome Sequence of the Nematode C. elegans  : A Platform for Investigating Biology. Science, 282(5396), 2012–2018.

Authors 1
  1. (first)
References 81 Referenced 3,484
  1. M. S. Chee et al. in Cytomegaloviruses vol. 154 of Current Topics in Microbiology and Immunology J. K. McDougall Ed. (Springer-Verlag Berlin 1990) pp. 125–169;
  2. 10.1126/science.7542800
  3. Bult C. J., et al., ibid. 273, 1058 (1996); / ibid. by Bult C. J. (1996)
  4. . F. R. Blattner et al. ibid. 277 1453 (1997) (10.1126/science.277.5331.1453)
  5. S. T. Cole et al. Natur e 393 537 (1998).
  6. H. W. Mewes et al. Nature 387 (suppl.) 7 (1997); (10.1038/387s007)
  7. 10.1126/science.274.5287.546
  8. 10.1073/pnas.83.20.7821
  9. 10.1002/bies.950130809
  10. 10.1038/335184a0
  11. . The current status of the C. elegans physical map is accessible on the World Wide Web (20 21).
  12. The investigations contributing to the C. elegans genome project are too numerous to cite. Two early representative publications are
  13. 10.1093/nar/15.5.2295
  14. and
  15. 10.1016/0022-2836(88)90374-9
  16. 10.1038/ng0592-114
  17. ; W. R. McCombie et al. ibid. p. 124.
  18. Y. Kohara PNE Protein Nucleic Acid Enzyme 41 715 (1996).
  19. 10.1093/genetics/130.3.471
  20. 10.1126/science.3033825
  21. 10.1038/356037a0
  22. Wilson R., et al., ibid. 368, 32 (1994). / ibid. by Wilson R. (1994)
  23. 10.1093/nar/23.4.670
  24. For details of the sequencing process see (49). The process began with the purification of DNA from selected clones of the tiling path. The DNA was sheared mechanically and after size selection the resulting fragments were subcloned into M13 or plasmid vectors. Random subclones were selected for sequence generation (the shotgun sequencing approach). Generally 900 sequence reads per 40 kb of genomic DNA were generated with fluorescent dye–labeled primers or terminators. Bases were determined with PHRED (50). An assembly of these random sequences that was generated with PHRAP (51) typically resulted in two to eight contigs. Gap closure and resolution of sequence ambiguities were achieved during finishing [using the editing packages GAP (52) and CONSED (53) and the collection of additional data] through longer reads directed sequencing reactions using custom oligonucleotide primers on chosen templates or additional chemistries as required. High-quality finished sequence was analyzed through the use of a suite of programs (including BLAST and GENEFINDER) and the results were stored in ACEDB and submitted to GenBank. Unfinished and finished sequence data were available to investigators by file transfer protocol (ftp) from both sequencing sites (20 21).
  25. 10.1101/gr.8.5.557
  26. 10.1093/nar/20.10.2471
  27. ; J. D. Parsons Comput. Appl. Biosci. 11 615 (1995). (10.1093/bioinformatics/11.6.615)
  28. 10.1101/gr.8.5.562
  29. 10.1093/nar/20.5.1083
  30. 10.1073/pnas.91.12.5695
  31. A clean separation of the YAC DNA from the host chromosomal DNA sometimes required the use of yeast strains in which specific yeast chromosomes are altered in size to provide a window around the YAC that is free of the native chromosomes.
  32. 10.1073/pnas.92.25.11706
  33. 10.1101/gr.7.5.551
  34. Available at www.sanger.ac.uk.
  35. Available at genome.wustl.edu/gsc/gschmpg.html.
  36. 10.1073/pnas.93.17.8983
  37. Every region must be sequenced either on each strand or with dye primer and dye terminator chemistry which extensive comparisons have shown to be at least as reliable as double stranding in revealing and correcting compressions and other base-calling errors. All regions must be represented by reads from two or more independent subclones or from PCR products across the region. If subcloned PCR products are used for a region three independent clones must be sequenced. Rare exceptions to the general rules of double stranding or alternative chemistry were permitted on the basis of the following. For regions of <50 bases where despite valid efforts a finisher is unable to achieve double stranding or double chemistry the sequence may be submitted (provided the sequence is of high quality and both the finisher and his or her supervisor see no ambiguous bases). When editing in XGAP all sequence data must be resolved at the 75% consensus level either by the collection of additional data or by the editing of poorly called traces. In CONSED any consensus base with a quality <25% must be manually reviewed to determine if the available data are sufficient to unambiguously support the derived contig sequence. If not additional data are collected.
  38. Each finished sequence is submitted to a series of quality control tests including verification that all of the finishing rules (23) have been followed and a careful verification that the assembly is consistent with all restriction digest information. In addition every finished sequence undergoes an automatic process of base calling and reassembly with different algorithms than those that were used for the initial assembly and comparison of the resultant consensus by a banded Smith-Waterman analysis [CROSSMATCH (51)] against the sequence that was obtained by the finisher. Any discrepancies in assembly or sequence along with any regions failing to meet finishing criteria are manually reviewed and new data are collected as necessary. Only when all discrepancies are accounted for is the sequence passed on for annotation. In turn if annotation flags any suspicious regions these are again passed back to the finisher for resolution either through additional data collection or editing.
  39. P. Green and L. Hillier unpublished software.
  40. 10.1016/0022-2836(91)90108-I
  41. 10.1093/nar/25.5.955
  42. 10.1016/S0022-2836(05)80360-2
  43. ; W. Gish WU-BLAST unpublished software.
  44. E. L. L. Sonnhammer and R. Durbin in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology R. Altman D. Brutlag P. Karp R. Lathrop D. Searls Eds. (AAAI Press Menlo Park CA 1994) pp. 363–368.
  45. Mott R., Comput. Appl. Biosci. 13, 477 (1997). / Comput. Appl. Biosci. by Mott R. (1997)
  46. 10.1093/nar/26.1.320
  47. ; S. R. Eddy Curr. Opin. Struct. Biol. 6 361 (1996). (10.1016/S0959-440X(96)80056-X)
  48. We identified local tandem and inverted repeats with the programs QUICKTANDEM TANDEM and INVERTED (20) which search for repeats within 1-kb intervals along the genomic sequence. An index of repeat families used by the project is available at www.sanger.ac.uk/Projects/C_elegans/repeats/.
  49. R. Durbin and J. Thierry-Mieg unpublished software. Documentation code and data are available from anonymous ftp servers at lirmm.lirmm.fr/pub/acedb/ ftp.sanger.ac.uk/pub/acedb/ and ncbi.nlm.nih.gov/repository/acedb/.
  50. In C. elegans two or more genes can be transcribed from the same promoter with one gene separated by no more than a few hundred nucleotides from another. In genes undergoing transsplicing the 5′ exon begins with a splice acceptor sequence making this 5′ exon more difficult to distinguish from internal exons. This combination of factors may result in two genes being merged into one [
  51. 10.1016/S0168-9525(00)89026-5
  52. We have identified 182 genes possessing alternative splice variants which are predominately from EST data. Of these 67 genes produce proteins that differ at their amino termini 57 genes produce proteins that differ at the carboxyl end and 59 genes produce proteins that display an internal variation. Of the internal variations seven genes showed complete exon skipping. Thirty-one genes were found where the 5′ end of an exon had changed 21 of which resulted in a difference of three or fewer codons. In contrast of the 24 alternative transcripts that changed the 3′ end of an exon only 4 resulted in a change of three or fewer codons.
  53. Available at www.sanger.ac.uk/Projects/C_elegans/Science98/.
  54. R. K. Herman in The Nematode Caenorhabditis elegans W. B. Wood Ed. (Cold Spring Harbor Laboratory Press Plainview NY 1988) pp. 17–45;
  55. 10.1073/pnas.92.24.10836
  56. These results were obtained with WU-BLAST (version 2.0a13MP) using default parameters and a threshold P value of 10 −3 .
  57. 10.1126/science.8456298
  58. 10.1126/science.282.5396.2022
  59. 10.1006/geno.1997.4989
  60. GENEFINDER systematically uses statistical criteria [primarily log likelihood ratios (LLRs)] to attempt to identify likely genes within a region of genomic sequence. Candidate genes are evaluated on the basis of “scores” that reflect their splice site translation start site coding potential LLRs and intron sizes. These scores are normalized by reference to the distribution of combined scores in a simulated sequence as follows: If a given combined score occurs on average once in every 10 s nucleotides in simulated DNA then the corresponding normalized score is set to s. (For example exons with a normalized score of 5.0 or greater will be found only once in every 100 kb of simulated DNA. With the current reference simulated sequence which is 1 Mb in length 6.0 is the maximum normalized score that can occur.) A dynamic programming algorithm is then used to find the set of nonoverlapping candidate genes (on a given strand) that has the highest total score (among all such sets). About 85% of experimentally verified “exon ORFs” (open reading frames containing true exons) in C. elegans genes in GenBank have normalized scores above 5.0 (and many of the remaining 15% are initial or terminal exons which have a single splice site). The fraction of exons with scores >5.0 may be lower for all C. elegans genes because of the bias toward highly expressed genes (which often have very high coding segment scores) in the experimentally verified set. However even for genes in the current verified set that are expressed at moderate to low levels a majority of exon ORF scores exceed 5.0; this score should be an effective criterion for identifying at least part of most genes. In theory high-scoring ORFs could arise in other ways. For example intergenic or intronic regions having abnormal nucleotide composition might appear to have coding segments and occasionally by chance may have high-scoring splice sites. So far there seem to be relatively few such regions in the C. elegans genomic sequence. These regions may account for the anomalous orphan exons that we occasionally find. In addition there are examples where these GENEFINDER-predicted genes fall into clear gene families that are nematode-specific or have only very distant similarity outside the nematodes for example chemoreceptor genes (54).
  61. Pfam is a collection of protein family alignments that were constructed semiautomatically with hidden Markov models within the HMMER package. The collagen and seven transmembrane chemoreceptor data were obtained with unpublished hidden Markov models. The number of seven transmembrane chemoreceptor genes is lower than that found by Robertson (54) which could be due to pseudogenes.
  62. Putative tRNA pseudogenes are identified by the search program tRNAscan-SE as sequences that are significantly related to a tRNA sequence consensus but do not appear to be likely to adopt a tRNA's canonical secondary structure (26). Many higher eukaryotic genomes have mobile tRNA-derived short interspersed nuclear elements (SINEs). However because they are few in number the nematode tRNA pseudogenes seem more likely to have arisen by some rare event rather than by the extensive mobility that characterizes mobile SINEs [
  63. 10.1038/317819a0
  64. A. F. Smit Curr. Opin. Genet. Dev . 6 743 (1996).
  65. 10.1093/nar/25.20.4041
  66. 10.1146/annurev.ge.29.120195.002305
  67. 10.1038/369371a0
  68. The abundance of C. elegans ESTs does not directly reflect expression levels because they are derived from cDNAs in which more abundantly expressed genes were partially selected against (6 7).
  69. 10.1093/genetics/141.1.159
  70. This approach is also being used for the human genome (Sanger Centre Washington University Genome Sequencing Center Genome Res. in press).
  71. For methodological details see (20) or (21). For biochemical procedures see R. K. Wilson and E. R. Mardis in Genome Analysis: A Laboratory Manual B. Birren E. D. Green S. Klapholz R. M. Myers J. Roskams Eds. (Cold Spring Harbor Laboratory Press Plainview NY 1997) vol. 1 pp. 397–454. For software packages see (20) or (21) and
  72. 10.1101/gr.8.3.260
  73. ; M. Wendl et al. ibid. p. 975; J. D. Parsons Comput. Appl. Biosci. 11 615 (1995); and (10.1093/bioinformatics/11.6.615)
  74. 10.1101/gr.6.11.1110
  75. 10.1101/gr.8.3.175
  76. ; B. Ewing and P. Green ibid. p. 186.
  77. P. Green personal communication.
  78. 10.1093/nar/23.24.4992
  79. 10.1101/gr.8.3.195
  80. 10.1101/gr.8.5.449
  81. This work has been supported by grants from the U.S. National Human Genome Research Institute and the UK MRC. We would also like to thank the many members of the C. elegans community who have shared data and provided encouragement in the course of this project.
Dates
Type When
Created 23 years, 1 month ago (July 27, 2002, 5:50 a.m.)
Deposited 1 year, 7 months ago (Jan. 13, 2024, 12:30 a.m.)
Indexed 49 minutes ago (Aug. 30, 2025, 12:34 p.m.)
Issued 26 years, 8 months ago (Dec. 11, 1998)
Published 26 years, 8 months ago (Dec. 11, 1998)
Published Print 26 years, 8 months ago (Dec. 11, 1998)
Funders 0

None

@article{1998, title={Genome Sequence of the Nematode C. elegans  : A Platform for Investigating Biology}, volume={282}, ISSN={1095-9203}, url={http://dx.doi.org/10.1126/science.282.5396.2012}, DOI={10.1126/science.282.5396.2012}, number={5396}, journal={Science}, publisher={American Association for the Advancement of Science (AAAS)}, year={1998}, month=dec, pages={2012–2018} }