10.1126/science.287.5461.2204
Crossref journal-article
American Association for the Advancement of Science (AAAS)
Science (221)
Abstract

A comparative analysis of the genomes of Drosophila melanogaster , Caenorhabditis elegans , and Saccharomyces cerevisiae —and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.

Bibliography

Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor, G. L., Miklos, Nelson, C. R., Hariharan, I. K., Fortini, M. E., Li, P. W., Apweiler, R., Fleischmann, W., Cherry, J. M., Henikoff, S., Skupski, M. P., Misra, S., Ashburner, M., Birney, E., Boguski, M. S., Brody, T., … Lewis, S. (2000). Comparative Genomics of the Eukaryotes. Science, 287(5461), 2204–2215.

Authors 56 University of Pennsylvania
  1. Gerald M. Rubin (first)
  2. Mark D. Yandell (additional)
  3. Jennifer R. Wortman (additional)
  4. George L. Gabor (additional)
  5. Miklos (additional)
  6. Catherine R. Nelson (additional)
  7. Iswar K. Hariharan (additional)
  8. Mark E. Fortini (additional) University of Pennsylvania
  9. Peter W. Li (additional)
  10. Rolf Apweiler (additional)
  11. Wolfgang Fleischmann (additional)
  12. J. Michael Cherry (additional)
  13. Steven Henikoff (additional)
  14. Marian P. Skupski (additional)
  15. Sima Misra (additional)
  16. Michael Ashburner (additional)
  17. Ewan Birney (additional)
  18. Mark S. Boguski (additional)
  19. Thomas Brody (additional)
  20. Peter Brokstein (additional)
  21. Susan E. Celniker (additional)
  22. Stephen A. Chervitz (additional)
  23. David Coates (additional)
  24. Anibal Cravchik (additional)
  25. Andrei Gabrielian (additional)
  26. Richard F. Galle (additional)
  27. William M. Gelbart (additional)
  28. Reed A. George (additional)
  29. Lawrence S. B. Goldstein (additional)
  30. Fangcheng Gong (additional)
  31. Ping Guan (additional)
  32. Nomi L. Harris (additional)
  33. Bruce A. Hay (additional)
  34. Roger A. Hoskins (additional)
  35. Jiayin Li (additional)
  36. Zhenya Li (additional)
  37. Richard O. Hynes (additional)
  38. S. J. M. Jones (additional)
  39. Peter M. Kuehl (additional)
  40. Bruno Lemaitre (additional)
  41. J. Troy Littleton (additional)
  42. Deborah K. Morrison (additional)
  43. Chris Mungall (additional)
  44. Patrick H. O'Farrell (additional)
  45. Oxana K. Pickeral (additional)
  46. Chris Shue (additional)
  47. Leslie B. Vosshall (additional)
  48. Jiong Zhang (additional)
  49. Qi Zhao (additional)
  50. Xiangqun H. Zheng (additional)
  51. Fei Zhong (additional)
  52. Wenyan Zhong (additional)
  53. Richard Gibbs (additional)
  54. J. Craig Venter (additional)
  55. Mark D. Adams (additional)
  56. Suzanna Lewis (additional)
References 96 Referenced 1,319
  1. 10.1126/science.287.5461.2185
  2. ; C. elegans Sequencing Consortium Science 282 2012 (1998); (10.1126/science.282.5396.2012)
  3. 10.1126/science.274.5287.546
  4. 10.1126/science.7542800
  5. C. elegans data were taken from A C. Elegans Database (ACEDB) release WS8.
  6. Local gene duplications were determined by searching for N similar genes within 2 N genes on each arm. For example if three similar genes are found within a region containing six genes this counts as one cluster of three genes. Genes were judged to be similar if a BLASTP High Scoring Pair (HSP) with a score of 200 or more existed between them. Histone gene clusters were not included. C. elegans data were taken from ACEDB release WS8 containing 18 424 genes.
  7. More information about GO is available at . The Gene Ontology project provides terms for categorizing gene products on the basis of their molecular function biological role and cellular location using controlled vocabularies.
  8. Initial results came from an NxN BLASTP analysis performed for each fly worm and yeast sequence in a combined data set of these completed proteomes. The databases used are as follows: Celera–Berkeley Drosophila Genome Project (BDGP) 14 195 predicted protein sequences (1/5/2000); WormPep 18 Sanger Centre 18 576 protein sequences; and Saccharomyces Genome Database (SGD) 6306 protein sequences (1/7/2000). A version of NCBI-BLAST2 was used with the SEG filter and with the effective search space length (Y option) set to 17 973 263. Pairs were formed between every query sequence with a significant BLASTP to one of the other organisms' sequences. Significance was based on E-value cutoffs and length of match. These pairs were then independently grouped using single linkage clustering ( 61 ). Finally the number of proteins from each proteome was counted. The requirement for 80% alignment of sequences makes this method of defining orthology particularly sensitive to errors that arise from incorrect protein prediction. However the results comparing yeast and worm are essentially identical to those previously reported (61) even though the effective database size was different the data sets have changed (Chervitz: yeast 6217 and worm 19 099; this study: yeast 6306 and worm 18 576) and the version of BLAST used is quite different (Chervitz: WashU BLAST 2.0a19MP; this study: NCBI BLAST 2.08).
  9. 10.1093/nar/28.1.45
  10. 10.1093/nar/28.1.228
  11. InterPro (Integrated resource for protein domains and functional sites) is a collaborative effort of the SWISS-PROT TrEMBL PROSITE PRINTS Pfam and ProDom databases to integrate the different pattern databases into a single resource. The database and a detailed description of the project can be found under . PROSITE is described in
  12. 10.1093/nar/27.1.215
  13. ; PFAM is described in
  14. 10.1093/nar/27.1.260
  15. ; and PRINTS is described in
  16. 10.1093/nar/27.1.220
  17. G. D. Plowman S. Sudarsanam J. Bingham D. Whyte T. Hunter Proc. Natl. Acad. Sci. U.S.A. 96 13603 (1999). (10.1073/pnas.96.24.13603)
  18. J. Barrett N. D. Rawlings J. F. Wessner Eds. Handbook of Proteolytic Enzymes (Academic Press San Diego CA 1998).
  19. 10.1038/368548a0
  20. 10.1073/pnas.95.12.6819
  21. 10.1016/S0962-8924(98)01494-9
  22. 10.1016/S0962-8924(99)01667-0
  23. 10.1017/S0033583500005783
  24. P. Vernier B. Cardinaud O. Valdenaire H. Philippe J.-D. Vincent Trends Pharmacol. Sci. 16 375 (1995); (10.1016/S0165-6147(00)89078-1)
  25. 10.1016/S0925-4773(99)00141-0
  26. 10.1016/0092-8674(94)90384-0
  27. P. Mombaerts Science 286 707 (1999). (10.1126/science.286.5440.707)
  28. 10.1126/science.282.5396.2028
  29. 10.1016/S0896-6273(00)81093-4
  30. 10.1016/S0092-8674(00)80582-6
  31. 10.1002/(SICI)1096-9861(19990322)405:4<543::AID-CNE7>3.0.CO;2-A
  32. 10.1126/science.282.5390.943
  33. 10.1016/S0092-8674(00)81401-4
  34. 10.1038/378206a0
  35. 10.1126/science.276.5313.791
  36. 10.1016/S0092-8674(00)80657-1
  37. 10.1016/0092-8674(94)90506-1
  38. 10.1074/jbc.272.2.1002
  39. 10.1126/science.270.5233.86
  40. 10.1073/pnas.91.14.6359
  41. 10.1101/gad.10.10.1206
  42. 10.1006/bbrc.1998.9407
  43. 10.1016/S0092-8674(00)81722-5
  44. T. Kreis and R. Vale Eds. Guidebook to the Cytoskeletal and Motor Proteins (Oxford Univ. Press Oxford 1999). (10.1093/oso/9780198599579.001.0001)
  45. 10.1038/71350
  46. 10.1091/mbc.9.6.1293
  47. 10.1016/S0092-8674(00)80960-5
  48. K. Weber in (29) pp. 291–293.
  49. 10.1126/science.7892610
  50. 10.1016/S0092-8674(00)80789-8
  51. 10.1016/S0092-8674(00)81552-4
  52. 10.1146/annurev.cellbio.12.1.393
  53. 10.1016/S0168-9525(96)10051-2
  54. Blaumueller C. M., Artavanis-Tsakonas S., Perspect. Dev. Neurobiol. 4, 325 (1997); / Perspect. Dev. Neurobiol. by Blaumueller C. M. (1997)
  55. 10.1098/rstb.1998.0228
  56. 10.1101/gad.11.24.3286
  57. 10.1016/S0959-437X(99)80065-3
  58. 10.1016/S0070-2153(08)60261-6
  59. 10.1038/sj.onc.1203125
  60. ; P. W. H. Holland J. Garcia-Fernandez N. A. Williams A. Sidow Development (suppl.) (1994) p. 125. (10.1242/dev.1994.Supplement.125)
  61. 10.1126/science.282.5396.2033
  62. 10.1146/annurev.biochem.68.1.383
  63. 10.1038/sj.cdd.4400596
  64. 10.1016/S0092-8674(00)80085-9
  65. 10.1038/17135
  66. 10.1016/S0092-8674(00)80434-1
  67. Park A. G., Trends Cell Biol. 10, 394 (2000); / Trends Cell Biol. by Park A. G. (2000)
  68. 10.1038/43678
  69. 10.1101/gad.13.15.1899
  70. 10.1016/S0962-8924(99)01609-8
  71. 10.1016/S0962-8924(99)01646-3
  72. 10.1093/emboj/17.21.6135
  73. 10.1038/23462
  74. 10.1038/362318a0
  75. 10.1146/annurev.biochem.68.1.863
  76. 10.1016/0092-8674(95)90396-8
  77. 10.1016/S0092-8674(00)80412-2
  78. 10.1016/S0952-7915(96)80100-2
  79. 10.1016/S1074-7613(00)80410-0
  80. 10.1073/pnas.95.17.10078
  81. 10.1073/pnas.93.15.7888
  82. 10.1016/S0962-8924(97)01087-8
  83. 10.1016/S0952-7915(99)00045-X
  84. Miklos G. L. G., J. Am. Acad. Arts Sci. 127, 197 (1998). / J. Am. Acad. Arts Sci. by Miklos G. L. G. (1998)
  85. 10.1016/S0968-0004(98)01350-4
  86. 10.1016/S0092-8674(00)81200-3
  87. 10.1016/S0896-6273(00)80573-5
  88. J. M. Warrick et al. Nature Genet. 23 425 (1999). (10.1038/70532)
  89. 10.1016/S0014-5793(96)01351-8
  90. 10.1016/S0168-9525(99)01934-4
  91. 10.1073/pnas.94.18.9746
  92. 10.1126/science.282.5396.2022
  93. See www.sciencemag.org/feature/data/1049664.shl for complete protein domain analysis.
  94. Paralogous gene families (Table 1) were identified by running BLASTP. A version of NCBI-BLAST2 optimized for the Compaq Alpha architecture was used with the SEG filter and the effective search space length (Y option) set to 17 973 263. Each protein was used as a query against a database of all other proteins of that organism. A clustering algorithm was then used to extract protein families from these BLASTP results. Each protein sequence constitutes a vertex; each HSP between protein sequences is an arc weighted by the BLAST Expect value. The algorithm identifies protein families by first breaking all arcs with an E value greater than some user-defined value (1 × 10 –6 was used for all of the analyses reported here). The resulting graph is then split into subgraphs that contain at least two-thirds of all possible arcs between vertices. The algorithm is “greedy”; that is it arbitrarily chooses a starting sequence and adds new sequences to the subgraph as long as this criterion is met. An interesting property of this algorithm is that it inherently respects the multidomain nature of proteins: For example two multidomain proteins may have significant similarity to one another but share only one or a few domains. In such a case the two proteins will not be clustered if the unshared domains introduce a large number of other arcs.
  95. An NxN BLASTP analysis was performed for each fly worm and yeast sequence in a combined data set of these completed proteomes. The databases used are as follows: Celera-BDGP 14 195 predicted protein sequences (1/5/2000); WormPep18 Sanger Centre 18 424 protein sequences; and SGD 6246 protein sequences (1/7/2000). BLASTP analysis was also performed against known mammalian proteins (2/1/2000 GenBank nonredundant amino acid Human Mouse and Rat 75 236 protein sequences) and TBLASTN analysis was performed against a database of mammalian ESTs (2/1/00 GenBank dbEST Human Mouse and Rat). A version of NCBI-BLAST2 optimized for the Compaq Alpha architecture was used with the SEG filter and the effective search space length (Y option) set to 17 973 263.
  96. The many participants from academic institutions are grateful for their various sources of support. Participants from the Berkeley Drosophila Genome Project are supported by NIH grant P50HG00750 (G.M.R.) and grant P4IHG00739 (W.M.G.).
Dates
Type When
Created 23 years, 1 month ago (July 27, 2002, 1:35 a.m.)
Deposited 1 year, 7 months ago (Jan. 13, 2024, 12:48 a.m.)
Indexed 1 week, 4 days ago (Aug. 21, 2025, 12:59 p.m.)
Issued 25 years, 5 months ago (March 24, 2000)
Published 25 years, 5 months ago (March 24, 2000)
Published Print 25 years, 5 months ago (March 24, 2000)
Funders 0

None

@article{Rubin_2000, title={Comparative Genomics of the Eukaryotes}, volume={287}, ISSN={1095-9203}, url={http://dx.doi.org/10.1126/science.287.5461.2204}, DOI={10.1126/science.287.5461.2204}, number={5461}, journal={Science}, publisher={American Association for the Advancement of Science (AAAS)}, author={Rubin, Gerald M. and Yandell, Mark D. and Wortman, Jennifer R. and Gabor, George L. and Miklos and Nelson, Catherine R. and Hariharan, Iswar K. and Fortini, Mark E. and Li, Peter W. and Apweiler, Rolf and Fleischmann, Wolfgang and Cherry, J. Michael and Henikoff, Steven and Skupski, Marian P. and Misra, Sima and Ashburner, Michael and Birney, Ewan and Boguski, Mark S. and Brody, Thomas and Brokstein, Peter and Celniker, Susan E. and Chervitz, Stephen A. and Coates, David and Cravchik, Anibal and Gabrielian, Andrei and Galle, Richard F. and Gelbart, William M. and George, Reed A. and Goldstein, Lawrence S. B. and Gong, Fangcheng and Guan, Ping and Harris, Nomi L. and Hay, Bruce A. and Hoskins, Roger A. and Li, Jiayin and Li, Zhenya and Hynes, Richard O. and Jones, S. J. M. and Kuehl, Peter M. and Lemaitre, Bruno and Littleton, J. Troy and Morrison, Deborah K. and Mungall, Chris and O’Farrell, Patrick H. and Pickeral, Oxana K. and Shue, Chris and Vosshall, Leslie B. and Zhang, Jiong and Zhao, Qi and Zheng, Xiangqun H. and Zhong, Fei and Zhong, Wenyan and Gibbs, Richard and Venter, J. Craig and Adams, Mark D. and Lewis, Suzanna}, year={2000}, month=mar, pages={2204–2215} }