Crossref journal-article
Oxford University Press (OUP)
Bioinformatics (286)
Abstract

Abstract Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnham-inst.org * To whom correspondence should be addressed.

Bibliography

Li, W., Jaroszewski, L., & Godzik, A. (2002). Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18(1), 77–82.

Authors 3
  1. Weizhong Li (first)
  2. Lukasz Jaroszewski (additional)
  3. Adam Godzik (additional)
References 0 Referenced 432

None

Dates
Type When
Created 23 years, 1 month ago (July 26, 2002, 6:37 p.m.)
Deposited 2 years, 7 months ago (Jan. 25, 2023, 2:24 a.m.)
Indexed 4 days, 22 hours ago (Aug. 23, 2025, 9:55 p.m.)
Issued 23 years, 7 months ago (Jan. 1, 2002)
Published 23 years, 7 months ago (Jan. 1, 2002)
Published Online 23 years, 7 months ago (Jan. 1, 2002)
Published Print 23 years, 7 months ago (Jan. 1, 2002)
Funders 0

None

@article{Li_2002, title={Tolerating some redundancy significantly speeds up clustering of large protein databases}, volume={18}, ISSN={1367-4803}, url={http://dx.doi.org/10.1093/bioinformatics/18.1.77}, DOI={10.1093/bioinformatics/18.1.77}, number={1}, journal={Bioinformatics}, publisher={Oxford University Press (OUP)}, author={Li, Weizhong and Jaroszewski, Lukasz and Godzik, Adam}, year={2002}, month=jan, pages={77–82} }