Crossref journal-article
Oxford University Press (OUP)
Bioinformatics (286)
Abstract

Abstract Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92–0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86–0.93 range). The proposed approach is particularly useful when labeled datasets are small. Contact:  vucetic@ist.temple.edu Supplementary Information: The supplementary data are available from

Bibliography

Han, B., Obradovic, Z., Hu, Z.-Z., Wu, C. H., & Vucetic, S. (2006). Substring selection for biomedical document classification. Bioinformatics, 22(17), 2136–2142.

Authors 5
  1. Bo Han (first)
  2. Zoran Obradovic (additional)
  3. Zhang-Zhi Hu (additional)
  4. Cathy H. Wu (additional)
  5. Slobodan Vucetic (additional)
References 22 Referenced 15
  1. 10.1093/bioinformatics/14.7.600 / Bioinformatics / Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families by Andrade (1998)
  2. 10.1197/jamia.M1641 / J. Am. Med. Inform. Assoc. / Text categorization models for retrieval of high quality articles in internal medicine by Aphinyanaphongs (2005)
  3. 10.1137/1037127 / SIAM Rev. / Using linear algebra for intelligent information retrieval by Berry (1995)
  4. 10.1090/S0025-5718-1969-0247736-4 / Math. Comp. / Rational Chebyshev approximations for the error function by Cody (1969)
  5. 10.1093/bioinformatics/btg1011 / Bioinformatics / Combining NLP and probabilistic categorization for document and term selection for Swiss-Prot medical annotation by Dobrokhotov (2003)
  6. 10.1145/772862.772876 / SIGKDD Explor. Newslett. / Automatic scientific text classification using local patterns: KDD Cup 2002 (task 1) by Ghanem (2003)
  7. 10.1016/j.compbiolchem.2004.09.010 / Comput. Biol. Chem. / iProLINK: an integrated protein resource for literature mining by Hu (2004)
  8. 10.1093/bioinformatics/bti390 / Bioinformatics / Literature mining and database annotation of protein phosphorylation using a rule-based system by Hu (2005)
  9. {'key': '2023012409140701500_b9', 'first-page': '137', 'article-title': 'Text categorization with support vector machines: learning with many relevant features', 'author': 'Joachims', 'year': '1998'} / Text categorization with support vector machines: learning with many relevant features by Joachims (1998)
  10. {'key': '2023012409140701500_b10', 'first-page': '41', 'article-title': 'Making large-scale SVM learning practical', 'volume-title': 'In Advances in Kernel Methods—Support Vector Learning.', 'author': 'Joachims', 'year': '1999'} / In Advances in Kernel Methods—Support Vector Learning. / Making large-scale SVM learning practical by Joachims (1999)
  11. 10.1093/bioinformatics/17.4.359 / Bioinformatics / Mining literature for protein-protein interactions by Marcotte (2001)
  12. {'key': '2023012409140701500_b12', 'first-page': '41', 'article-title': 'A comparison of event models for Naïve Bayes text classification', 'author': 'McCallum', 'year': '1998'} / A comparison of event models for Naïve Bayes text classification by McCallum (1998)
  13. {'key': '2023012409140701500_b13', 'first-page': '121', 'article-title': 'Selecting text features for gene name classification: from documents to terms', 'author': 'Nenadic', 'year': '2003'} / Selecting text features for gene name classification: from documents to terms by Nenadic (2003)
  14. 10.1108/eb046814 / Program / An algorithm for sux stripping by Porter (1980)
  15. 10.1145/772862.772874 / SIGKDD Explor. Newslett. / Rulebased extraction of experimental evidence in the biomedical domain—the KDD Cup 2002 (task 1) by Regev (2003)
  16. 10.1186/1471-2105-6-S1-S22 / BMC Bioinformatics / Mining protein function from text using term-based support vector machines by Rice (2005)
  17. {'key': '2023012409140701500_b17', 'first-page': '93', 'article-title': 'A machine learning approach for the curation of biomedical literature–KDD Cup 2002 (task 1)', 'volume': '4', 'author': 'Shi', 'year': '2003', 'journal-title': 'SIGKDD Explor. Newslett.'} / SIGKDD Explor. Newslett. / A machine learning approach for the curation of biomedical literature–KDD Cup 2002 (task 1) by Shi (2003)
  18. 10.1007/978-1-4757-2440-0 / The Nature of Statistical Learning Theory by Vapnik (1995)
  19. {'key': '2023012409140701500_b19', 'first-page': '918', 'article-title': 'Boosting Naïve Bayesian learning on a large subset of MEDLINE', 'author': 'Wilbur', 'year': '2000'} / Boosting Naïve Bayesian learning on a large subset of MEDLINE by Wilbur (2000)
  20. 10.1093/nar/gkg040 / Nucleic Acids Res. / The Protein information resource by Wu (2003)
  21. 10.1093/nar/gkj161 / Nucleic Acids Res. / The Universal Protein Resource (UniProt): an expanding universe of protein information by Wu (2006)
  22. {'key': '2023012409140701500_b22', 'first-page': '412', 'article-title': 'A comparative study on feature selection in text categorization', 'author': 'Yang', 'year': '1997'} / A comparative study on feature selection in text categorization by Yang (1997)
Dates
Type When
Created 19 years, 1 month ago (July 12, 2006, 8:39 p.m.)
Deposited 2 years, 6 months ago (Jan. 24, 2023, 4:53 a.m.)
Indexed 2 years ago (Aug. 20, 2023, 5:37 p.m.)
Issued 19 years, 1 month ago (June 23, 2006)
Published 19 years, 1 month ago (June 23, 2006)
Published Online 19 years, 1 month ago (June 23, 2006)
Published Print 18 years, 11 months ago (Sept. 1, 2006)
Funders 0

None

@article{Han_2006, title={Substring selection for biomedical document classification}, volume={22}, ISSN={1367-4803}, url={http://dx.doi.org/10.1093/bioinformatics/btl350}, DOI={10.1093/bioinformatics/btl350}, number={17}, journal={Bioinformatics}, publisher={Oxford University Press (OUP)}, author={Han, Bo and Obradovic, Zoran and Hu, Zhang-Zhi and Wu, Cathy H. and Vucetic, Slobodan}, year={2006}, month=jun, pages={2136–2142} }