DOI: 10.1093/bioinformatics/18.1.77. Tolerating some redundancy significantly speeds up clustering of large protein databases

Tolerating some redundancy significantly speeds up clustering of large protein databases

10.1093/bioinformatics/18.1.77

Crossref journal-article

Oxford University Press (OUP)

Bioinformatics (286)

Abstract

Abstract Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnham-inst.org * To whom correspondence should be addressed.

Bibliography

Li, W., Jaroszewski, L., & Godzik, A. (2002). Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18(1), 77â82.

Authors 3

Weizhong Li (first)
Lukasz Jaroszewski (additional)
Adam Godzik (additional)

References 0 Referenced 432

None

Dates

Type	When
Created	23 years, 1 month ago (July 26, 2002, 6:37 p.m.)
Deposited	2 years, 7 months ago (Jan. 25, 2023, 2:24 a.m.)
Indexed	4 days, 22 hours ago (Aug. 23, 2025, 9:55 p.m.)
Issued	23 years, 7 months ago (Jan. 1, 2002)
Published	23 years, 7 months ago (Jan. 1, 2002)
Published Online	23 years, 7 months ago (Jan. 1, 2002)
Published Print	23 years, 7 months ago (Jan. 1, 2002)

Funders 0

None

BibTeX

@article{Li_2002, title={Tolerating some redundancy significantly speeds up clustering of large protein databases}, volume={18}, ISSN={1367-4803}, url={http://dx.doi.org/10.1093/bioinformatics/18.1.77}, DOI={10.1093/bioinformatics/18.1.77}, number={1}, journal={Bioinformatics}, publisher={Oxford University Press (OUP)}, author={Li, Weizhong and Jaroszewski, Lukasz and Godzik, Adam}, year={2002}, month=jan, pages={77–82} }

JSON

{
  "indexed": {
    "date-parts": [
      [
        2025,
        8,
        24
      ]
    ],
    "date-time": "2025-08-24T01:55:31Z",
    "timestamp": 1756000531679
  },
  "reference-count": 0,
  "publisher": "Oxford University Press (OUP)",
  "issue": "1",
  "content-domain": {
    "domain": [],
    "crossmark-restriction": false
  },
  "published-print": {
    "date-parts": [
      [
        2002,
        1,
        1
      ]
    ]
  },
  "abstract": "<jats:title>Abstract</jats:title>\n               <jats:p>Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in \u223c1 h and at 75% identity in \u223c1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds.</jats:p>\n               <jats:p>Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program\u2019s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in \u223c5 days. Although some redundancy is present after clustering, our new program\u2019s results only differ from our previous program\u2019s by less than 0.4%.</jats:p>\n               <jats:p>Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi</jats:p>\n               <jats:p>Contact: liwz@burnham-inst.org; adam@burnham-inst.org</jats:p>\n               <jats:p>* To whom correspondence should be addressed.</jats:p>",
  "DOI": "10.1093/bioinformatics/18.1.77",
  "type": "journal-article",
  "created": {
    "date-parts": [
      [
        2002,
        7,
        26
      ]
    ],
    "date-time": "2002-07-26T22:37:51Z",
    "timestamp": 1027723071000
  },
  "page": "77-82",
  "source": "Crossref",
  "is-referenced-by-count": 432,
  "title": "Tolerating some redundancy significantly speeds up clustering\n  of large protein databases",
  "prefix": "10.1093",
  "volume": "18",
  "author": [
    {
      "given": "Weizhong",
      "family": "Li",
      "sequence": "first",
      "affiliation": [
        {
          "name": "The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA"
        }
      ]
    },
    {
      "given": "Lukasz",
      "family": "Jaroszewski",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA"
        }
      ]
    },
    {
      "given": "Adam",
      "family": "Godzik",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA"
        }
      ]
    }
  ],
  "member": "286",
  "published-online": {
    "date-parts": [
      [
        2002,
        1,
        1
      ]
    ]
  },
  "container-title": "Bioinformatics",
  "original-title": [],
  "language": "en",
  "link": [
    {
      "URL": "https://academic.oup.com/bioinformatics/article-pdf/18/1/77/48850435/bioinformatics_18_1_77.pdf",
      "content-type": "application/pdf",
      "content-version": "vor",
      "intended-application": "syndication"
    },
    {
      "URL": "https://academic.oup.com/bioinformatics/article-pdf/18/1/77/48850435/bioinformatics_18_1_77.pdf",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "similarity-checking"
    }
  ],
  "deposited": {
    "date-parts": [
      [
        2023,
        1,
        25
      ]
    ],
    "date-time": "2023-01-25T07:24:44Z",
    "timestamp": 1674631484000
  },
  "score": 1,
  "resource": {
    "primary": {
      "URL": "https://academic.oup.com/bioinformatics/article/18/1/77/243728"
    }
  },
  "subtitle": [],
  "short-title": [],
  "issued": {
    "date-parts": [
      [
        2002,
        1,
        1
      ]
    ]
  },
  "references-count": 0,
  "journal-issue": {
    "issue": "1",
    "published-print": {
      "date-parts": [
        [
          2002,
          1,
          1
        ]
      ]
    }
  },
  "URL": "http://dx.doi.org/10.1093/bioinformatics/18.1.77",
  "relation": {},
  "ISSN": [
    "1367-4811",
    "1367-4803"
  ],
  "subject": [],
  "published-other": {
    "date-parts": [
      [
        2002,
        1
      ]
    ]
  },
  "published": {
    "date-parts": [
      [
        2002,
        1,
        1
      ]
    ]
  }
}