DOI: 10.1126/science.285.5428.751. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences

Detecting Protein Function and Protein-Protein Interactions from Genome Sequences

10.1126/science.285.5428.751

Crossref journal-article

American Association for the Advancement of Science (AAAS)

Science (221)

Abstract

A computational method is proposed for inferring protein interactions from genome sequences on the basis of the observation that some pairs of interacting proteins have homologs in another organism fused into a single protein chain. Searching sequences from many genomes revealed 6809 such putative protein-protein interactions in Escherichia coli and 45,502 in yeast. Many members of these pairs were confirmed as functionally related; computational filtering further enriches for interactions. Some proteins have links to several other proteins; these coupled links appear to represent functional interactions such as complexes or pathways. Experimentally confirmed interacting pairs are documented in a Database of Interacting Proteins.

Bibliography

Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O., & Eisenberg, D. (1999). Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285(5428), 751â753.

Authors 6

Edward M. Marcotte (first)
Matteo Pellegrini (additional)
Ho-Leung Ng (additional)
Danny W. Rice (additional)
Todd O. Yeates (additional)
David Eisenberg (additional)

References 30 Referenced 1,233

B. Alberts et al. Molecular Biology of the Cell (Garland New York ed. 3 1994); H. Lodish et al. Molecular Cell Biology (Scientific American Books New York ed. 3 1995).
Fields S., Song O. K., Nature 340, 243 (1989). (10.1038/340245a0) / Nature by Fields S. (1989)
Berger J. M. Gamblin S. J. Harrison S. C. Wang J. C. 379 225 (1996). (10.1038/379225a0)
10.1126/science.277.5331.1453
The triplets of proteins are found with the aid of protein domain databases such as the ProDom or Pfam databases (17). Here a list of all ProDom domains in every one of the 64 568 SWISS-PROT proteins was prepared as well as a list of all proteins that contain each of the 53 597 ProDom domains. Then every protein in ProDom was considered for its ability to be a linking (or Rosetta Stone) member in a triplet. All pairs of domains that are both members of a given protein P were defined as being linked by protein P if we could find at least one protein with only one of the two domains. By this method we found 14 899 links between the 7843 ProDom domains. Then in a single genome (such as E. coli ) we found all nonhomologous pairs of proteins containing linked domains. These pairs are linked by the Rosetta Stone proteins. For E. coli this method finds 3531 protein pairs. An alternate method for discovering protein triplets uses amino acid sequence alignment techniques to find two proteins that align to a Rosetta Stone protein such that the alignments do not overlap on the Rosetta Stone protein. For E. coli this method finds 4487 protein pairs 1209 of which were also found by the ProDom search method (even though different sequence databases were searched for each method). All predictions are available on the World Wide Web at www.doe-mbi.ucla.edu.
Two amino acid sequences are said to be similar when the sequences align with a statistically significant alignment score. The significance is described by the probability of obtaining a higher alignment score when comparing shuffled sequences with the acceptable probability threshold set by considering the total number of sequence comparisons performed. That is if n proteins in E. coli are compared with m proteins in other genomes n × m total comparisons are performed. We set a probability of 1/( n × m ) as the threshold as this is the lowest value that could be obtained by comparing n × m random sequences. For the ProDom-based identification of homologs definitions of sequence similarity are as in the ProDom database.
The SWISS-PROT database is available at www.expasy.ch/sprot/.
The Database of Interacting Proteins is available on the Web at .
Pellegrini M., Marcotte E. M., Thompson M. J., Eisenberg D., Yeates T. O., Proc. Natl. Acad. Sci. U.S.A. 96, 4285 (1999). (10.1073/pnas.96.8.4285) / Proc. Natl. Acad. Sci. U.S.A. by Pellegrini M. (1999)
Erickson H. P., J. Mol. Biol. 206, 465 (1989); (10.1016/0022-2836(89)90494-4) / J. Mol. Biol. by Erickson H. P. (1989)
Nagi A. D., Regan L., Folding Design 2, 67 (1997). (10.1016/S1359-0278(97)00007-2) / Folding Design by Nagi A. D. (1997)
Pederson S., Bloch P. S., Reen S., Neidhardt F. C., Cell 14, 179 (1978). (10.1016/0092-8674(78)90312-4) / Cell by Pederson S. (1978)
Robinson C. R., Sauer R. T., Proc. Natl. Acad. Sci. U.S.A. 95, 5929 (1998). (10.1073/pnas.95.11.5929) / Proc. Natl. Acad. Sci. U.S.A. by Robinson C. R. (1998)
Horton N., Lewis M., Protein Sci. 1, 169 (1992); (10.1002/pro.5560010117) / Protein Sci. by Horton N. (1992)
Janin J., Biochimie 77, 497 (1995). (10.1016/0300-9084(96)88166-1) / Biochimie by Janin J. (1995)
Tsai C. J., Nussinov R., J. Mol. Biol. 260, 604 (1996). (10.1006/jmbi.1996.0424) / J. Mol. Biol. by Tsai C. J. (1996)
10.1038/385595a0
; F. Sicheri I. Moarefi J. Kuriyan ibid. p. 602.
The error in predicting protein-protein interactions due to the inability to distinguish homologs was estimated as 1– T where T is the mean percentage of potential true positives calculated for all domain pairs in E. coli. For each domain pair linked by a Rosetta Stone protein there are n proteins with the first domain but not the second and m proteins with the second domain but not the first. The percentage of true positives T is therefore estimated as the smaller of n or m divided by n times m.
10.1093/nar/26.1.323
Bateman A. et al. 27 260 (1999). (10.1093/nar/27.1.260)
A. Sugino N. P. Higgins N. R. Cozzarelli ibid. 8 3865 (1980); (10.1093/nar/8.17.3865)
Yeh W. K., Ornston L. N., J. Biol. Chem. 256, 1565 (1981); (10.1016/S0021-9258(19)69841-8) / J. Biol. Chem. by Yeh W. K. (1981)
McHenry C. S. Crow W. 254 1748 (1979). (10.1016/S0021-9258(17)37836-5)
See Table II of
Richardson J. S., Adv. Protein Chem. 34, 167 (1981); (10.1016/S0065-3233(08)60520-3) / Adv. Protein Chem. by Richardson J. S. (1981)
. Note also that eukaryotic genes in contrast to prokaryotic genes often code for multidomain proteins [
10.1038/41024
10.1073/pnas.91.8.3127
Supported by the following grants: Department of Energy (DOE) DE-FC03-87ER-60615 NIH PO1 GM 31299 and NSF MCB 94 20769. E. M. was supported by a DOE Hollaender fellowship. We thank M. K. Baron for her work with the Database of Interacting Proteins.

Dates

Type	When
Created	23 years ago (July 27, 2002, 5:42 a.m.)
Deposited	1 year, 7 months ago (Jan. 13, 2024, 4:12 a.m.)
Indexed	3 weeks, 6 days ago (July 30, 2025, 10:12 a.m.)
Issued	26 years ago (July 30, 1999)
Published	26 years ago (July 30, 1999)
Published Print	26 years ago (July 30, 1999)

Funders 0

None

BibTeX

@article{Marcotte_1999, title={Detecting Protein Function and Protein-Protein Interactions from Genome Sequences}, volume={285}, ISSN={1095-9203}, url={http://dx.doi.org/10.1126/science.285.5428.751}, DOI={10.1126/science.285.5428.751}, number={5428}, journal={Science}, publisher={American Association for the Advancement of Science (AAAS)}, author={Marcotte, Edward M. and Pellegrini, Matteo and Ng, Ho-Leung and Rice, Danny W. and Yeates, Todd O. and Eisenberg, David}, year={1999}, month=jul, pages={751–753} }

JSON

{
  "indexed": {
    "date-parts": [
      [
        2025,
        7,
        30
      ]
    ],
    "date-time": "2025-07-30T14:12:03Z",
    "timestamp": 1753884723138
  },
  "reference-count": 30,
  "publisher": "American Association for the Advancement of Science (AAAS)",
  "issue": "5428",
  "content-domain": {
    "domain": [],
    "crossmark-restriction": false
  },
  "published-print": {
    "date-parts": [
      [
        1999,
        7,
        30
      ]
    ]
  },
  "abstract": "<jats:p>\n            A computational method is proposed for inferring protein interactions from genome sequences on the basis of the observation that some pairs of interacting proteins have homologs in another organism fused into a single protein chain. Searching sequences from many genomes revealed 6809 such putative protein-protein interactions in\n            <jats:italic>Escherichia coli</jats:italic>\n            and 45,502 in yeast. Many members of these pairs were confirmed as functionally related; computational filtering further enriches for interactions. Some proteins have links to several other proteins; these coupled links appear to represent functional interactions such as complexes or pathways. Experimentally confirmed interacting pairs are documented in a Database of Interacting Proteins.\n          </jats:p>",
  "DOI": "10.1126/science.285.5428.751",
  "type": "journal-article",
  "created": {
    "date-parts": [
      [
        2002,
        7,
        27
      ]
    ],
    "date-time": "2002-07-27T09:42:20Z",
    "timestamp": 1027762940000
  },
  "page": "751-753",
  "source": "Crossref",
  "is-referenced-by-count": 1233,
  "title": "Detecting Protein Function and Protein-Protein Interactions from Genome Sequences",
  "prefix": "10.1126",
  "volume": "285",
  "author": [
    {
      "given": "Edward M.",
      "family": "Marcotte",
      "sequence": "first",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    },
    {
      "given": "Matteo",
      "family": "Pellegrini",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    },
    {
      "given": "Ho-Leung",
      "family": "Ng",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    },
    {
      "given": "Danny W.",
      "family": "Rice",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    },
    {
      "given": "Todd O.",
      "family": "Yeates",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    },
    {
      "given": "David",
      "family": "Eisenberg",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "UCLA\u2013Department of Energy Laboratory of Structural Biology and Molecular Medicine, Departments of Chemistry and Biochemistry and Biological Chemistry, Box 951570, University of California at Los Angeles, Los Angeles, CA 90095\u20131570, USA."
        }
      ]
    }
  ],
  "member": "221",
  "reference": [
    {
      "key": "e_1_3_1_2_2",
      "unstructured": "B. Alberts et al. Molecular Biology of the Cell (Garland New York ed. 3 1994); H. Lodish et al. Molecular Cell Biology (Scientific American Books New York ed. 3 1995)."
    },
    {
      "key": "e_1_3_1_3_2",
      "doi-asserted-by": "crossref",
      "first-page": "243",
      "DOI": "10.1038/340245a0",
      "volume": "340",
      "author": "Fields S.",
      "year": "1989",
      "unstructured": "Fields S., Song O. K., Nature 340, 243 (1989).",
      "journal-title": "Nature"
    },
    {
      "key": "e_1_3_1_4_2",
      "doi-asserted-by": "crossref",
      "unstructured": "Berger J. M. Gamblin S. J. Harrison S. C. Wang J. C. 379 225 (1996).",
      "DOI": "10.1038/379225a0"
    },
    {
      "key": "e_1_3_1_5_2",
      "doi-asserted-by": "publisher",
      "DOI": "10.1126/science.277.5331.1453"
    },
    {
      "key": "e_1_3_1_6_2",
      "unstructured": "The triplets of proteins are found with the aid of protein domain databases such as the ProDom or Pfam databases (17). Here a list of all ProDom domains in every one of the 64 568 SWISS-PROT proteins was prepared as well as a list of all proteins that contain each of the 53 597 ProDom domains. Then every protein in ProDom was considered for its ability to be a linking (or Rosetta Stone) member in a triplet. All pairs of domains that are both members of a given protein P were defined as being linked by protein P if we could find at least one protein with only one of the two domains. By this method we found 14 899 links between the 7843 ProDom domains. Then in a single genome (such as E. coli ) we found all nonhomologous pairs of proteins containing linked domains. These pairs are linked by the Rosetta Stone proteins. For E. coli this method finds 3531 protein pairs. An alternate method for discovering protein triplets uses amino acid sequence alignment techniques to find two proteins that align to a Rosetta Stone protein such that the alignments do not overlap on the Rosetta Stone protein. For E. coli this method finds 4487 protein pairs 1209 of which were also found by the ProDom search method (even though different sequence databases were searched for each method). All predictions are available on the World Wide Web at www.doe-mbi.ucla.edu."
    },
    {
      "key": "e_1_3_1_7_2",
      "unstructured": "Two amino acid sequences are said to be similar when the sequences align with a statistically significant alignment score. The significance is described by the probability of obtaining a higher alignment score when comparing shuffled sequences with the acceptable probability threshold set by considering the total number of sequence comparisons performed. That is if n proteins in E. coli are compared with m proteins in other genomes n \u00d7 m total comparisons are performed. We set a probability of 1/( n \u00d7 m ) as the threshold as this is the lowest value that could be obtained by comparing n \u00d7 m random sequences. For the ProDom-based identification of homologs definitions of sequence similarity are as in the ProDom database."
    },
    {
      "key": "e_1_3_1_8_2",
      "unstructured": "The SWISS-PROT database is available at www.expasy.ch/sprot/."
    },
    {
      "key": "e_1_3_1_9_2",
      "unstructured": "The Database of Interacting Proteins is available on the Web at ."
    },
    {
      "key": "e_1_3_1_10_2",
      "doi-asserted-by": "crossref",
      "first-page": "4285",
      "DOI": "10.1073/pnas.96.8.4285",
      "volume": "96",
      "author": "Pellegrini M.",
      "year": "1999",
      "unstructured": "Pellegrini M., Marcotte E. M., Thompson M. J., Eisenberg D., Yeates T. O., Proc. Natl. Acad. Sci. U.S.A. 96, 4285 (1999).",
      "journal-title": "Proc. Natl. Acad. Sci. U.S.A."
    },
    {
      "key": "e_1_3_1_11_2",
      "doi-asserted-by": "crossref",
      "first-page": "465",
      "DOI": "10.1016/0022-2836(89)90494-4",
      "volume": "206",
      "author": "Erickson H. P.",
      "year": "1989",
      "unstructured": "Erickson H. P., J. Mol. Biol. 206, 465 (1989);",
      "journal-title": "J. Mol. Biol."
    },
    {
      "key": "e_1_3_1_11_3",
      "doi-asserted-by": "crossref",
      "first-page": "67",
      "DOI": "10.1016/S1359-0278(97)00007-2",
      "volume": "2",
      "author": "Nagi A. D.",
      "year": "1997",
      "unstructured": "Nagi A. D., Regan L., Folding Design 2, 67 (1997).",
      "journal-title": "Folding Design"
    },
    {
      "key": "e_1_3_1_12_2",
      "doi-asserted-by": "crossref",
      "first-page": "179",
      "DOI": "10.1016/0092-8674(78)90312-4",
      "volume": "14",
      "author": "Pederson S.",
      "year": "1978",
      "unstructured": "Pederson S., Bloch P. S., Reen S., Neidhardt F. C., Cell 14, 179 (1978).",
      "journal-title": "Cell"
    },
    {
      "key": "e_1_3_1_13_2",
      "doi-asserted-by": "crossref",
      "first-page": "5929",
      "DOI": "10.1073/pnas.95.11.5929",
      "volume": "95",
      "author": "Robinson C. R.",
      "year": "1998",
      "unstructured": "Robinson C. R., Sauer R. T., Proc. Natl. Acad. Sci. U.S.A. 95, 5929 (1998).",
      "journal-title": "Proc. Natl. Acad. Sci. U.S.A."
    },
    {
      "key": "e_1_3_1_14_2",
      "doi-asserted-by": "crossref",
      "first-page": "169",
      "DOI": "10.1002/pro.5560010117",
      "volume": "1",
      "author": "Horton N.",
      "year": "1992",
      "unstructured": "Horton N., Lewis M., Protein Sci. 1, 169 (1992);",
      "journal-title": "Protein Sci."
    },
    {
      "key": "e_1_3_1_14_3",
      "doi-asserted-by": "crossref",
      "first-page": "497",
      "DOI": "10.1016/0300-9084(96)88166-1",
      "volume": "77",
      "author": "Janin J.",
      "year": "1995",
      "unstructured": "Janin J., Biochimie 77, 497 (1995).",
      "journal-title": "Biochimie"
    },
    {
      "key": "e_1_3_1_15_2",
      "doi-asserted-by": "crossref",
      "first-page": "604",
      "DOI": "10.1006/jmbi.1996.0424",
      "volume": "260",
      "author": "Tsai C. J.",
      "year": "1996",
      "unstructured": "Tsai C. J., Nussinov R., J. Mol. Biol. 260, 604 (1996).",
      "journal-title": "J. Mol. Biol."
    },
    {
      "key": "e_1_3_1_16_2",
      "doi-asserted-by": "publisher",
      "DOI": "10.1038/385595a0"
    },
    {
      "key": "e_1_3_1_16_3",
      "unstructured": "; F. Sicheri I. Moarefi J. Kuriyan ibid. p. 602."
    },
    {
      "key": "e_1_3_1_17_2",
      "unstructured": "The error in predicting protein-protein interactions due to the inability to distinguish homologs was estimated as 1\u2013 T where T is the mean percentage of potential true positives calculated for all domain pairs in E. coli. For each domain pair linked by a Rosetta Stone protein there are n proteins with the first domain but not the second and m proteins with the second domain but not the first. The percentage of true positives T is therefore estimated as the smaller of n or m divided by n times m."
    },
    {
      "key": "e_1_3_1_18_2",
      "doi-asserted-by": "publisher",
      "DOI": "10.1093/nar/26.1.323"
    },
    {
      "key": "e_1_3_1_18_3",
      "doi-asserted-by": "crossref",
      "unstructured": "Bateman A. et al. 27 260 (1999).",
      "DOI": "10.1093/nar/27.1.260"
    },
    {
      "key": "e_1_3_1_19_2",
      "doi-asserted-by": "crossref",
      "unstructured": "A. Sugino N. P. Higgins N. R. Cozzarelli ibid. 8 3865 (1980);",
      "DOI": "10.1093/nar/8.17.3865"
    },
    {
      "key": "e_1_3_1_19_3",
      "doi-asserted-by": "crossref",
      "first-page": "1565",
      "DOI": "10.1016/S0021-9258(19)69841-8",
      "volume": "256",
      "author": "Yeh W. K.",
      "year": "1981",
      "unstructured": "Yeh W. K., Ornston L. N., J. Biol. Chem. 256, 1565 (1981);",
      "journal-title": "J. Biol. Chem."
    },
    {
      "key": "e_1_3_1_19_4",
      "doi-asserted-by": "crossref",
      "unstructured": "McHenry C. S. Crow W. 254 1748 (1979).",
      "DOI": "10.1016/S0021-9258(17)37836-5"
    },
    {
      "key": "e_1_3_1_20_2",
      "unstructured": "See Table II of"
    },
    {
      "key": "e_1_3_1_20_3",
      "doi-asserted-by": "crossref",
      "first-page": "167",
      "DOI": "10.1016/S0065-3233(08)60520-3",
      "volume": "34",
      "author": "Richardson J. S.",
      "year": "1981",
      "unstructured": "Richardson J. S., Adv. Protein Chem. 34, 167 (1981);",
      "journal-title": "Adv. Protein Chem."
    },
    {
      "key": "e_1_3_1_20_4",
      "unstructured": ". Note also that eukaryotic genes in contrast to prokaryotic genes often code for multidomain proteins ["
    },
    {
      "key": "e_1_3_1_20_5",
      "doi-asserted-by": "publisher",
      "DOI": "10.1038/41024"
    },
    {
      "key": "e_1_3_1_21_2",
      "doi-asserted-by": "publisher",
      "DOI": "10.1073/pnas.91.8.3127"
    },
    {
      "key": "e_1_3_1_22_2",
      "unstructured": "Supported by the following grants: Department of Energy (DOE) DE-FC03-87ER-60615 NIH PO1 GM 31299 and NSF MCB 94 20769. E. M. was supported by a DOE Hollaender fellowship. We thank M. K. Baron for her work with the Database of Interacting Proteins."
    }
  ],
  "container-title": "Science",
  "original-title": [],
  "language": "en",
  "link": [
    {
      "URL": "https://www.science.org/doi/pdf/10.1126/science.285.5428.751",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "similarity-checking"
    }
  ],
  "deposited": {
    "date-parts": [
      [
        2024,
        1,
        13
      ]
    ],
    "date-time": "2024-01-13T09:12:36Z",
    "timestamp": 1705137156000
  },
  "score": 1,
  "resource": {
    "primary": {
      "URL": "https://www.science.org/doi/10.1126/science.285.5428.751"
    }
  },
  "subtitle": [],
  "short-title": [],
  "issued": {
    "date-parts": [
      [
        1999,
        7,
        30
      ]
    ]
  },
  "references-count": 30,
  "journal-issue": {
    "issue": "5428",
    "published-print": {
      "date-parts": [
        [
          1999,
          7,
          30
        ]
      ]
    }
  },
  "alternative-id": [
    "10.1126/science.285.5428.751"
  ],
  "URL": "http://dx.doi.org/10.1126/science.285.5428.751",
  "relation": {},
  "ISSN": [
    "0036-8075",
    "1095-9203"
  ],
  "subject": [],
  "container-title-short": "Science",
  "published": {
    "date-parts": [
      [
        1999,
        7,
        30
      ]
    ]
  }
}