Crossref journal-article
Wiley
Medical Physics (311)
Abstract

Purpose:The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer‐aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.Methods:Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve. The mean values obtained by resubstitution and hold‐out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.Results:It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold‐out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.Conclusions:None of the investigated feature selection‐classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

Bibliography

Way, T. W., Sahiner, B., Hadjiiski, L. M., & Chan, H. (2010). Effect of finite sample size on feature selection and classification: A simulation study. Medical Physics, 37(2), 907–920. Portico.

Authors 4
  1. Ted W. Way (first)
  2. Berkman Sahiner (additional)
  3. Lubomir M. Hadjiiski (additional)
  4. Heang‐Ping Chan (additional)
References 39 Referenced 60
  1. 10.1118/1.598805
  2. 10.1118/1.599017
  3. 10.1118/1.2868757
  4. 10.1016/j.neunet.2007.12.012
  5. 10.1118/1.1999126
  6. 10.1118/1.2437130
  7. 10.1109/TPAMI.2003.1251149
  8. 10.1109/34.574797
  9. 10.1016/0167-8655(94)90127-9
  10. 10.1016/S0031-3203(99)00041-2
  11. 10.1093/bioinformatics/btl407
  12. 10.1016/j.patcog.2008.08.001
  13. 10.1016/j.csda.2004.03.017
  14. 10.1093/bioinformatics/bti748
  15. {'key': 'e_1_2_7_16_1', 'first-page': '845', 'article-title': 'Feature selection for unsupervised learning', 'volume': '5', 'author': 'Dy J. G.', 'year': '2004', 'journal-title': 'J. Mach. Learn. Res.'} / J. Mach. Learn. Res. / Feature selection for unsupervised learning by Dy J. G. (2004)
  16. 10.1016/0167-8655(96)00047-5
  17. 10.1016/S0004-3702(97)00043-X
  18. {'key': 'e_1_2_7_19_1', 'first-page': '1205', 'article-title': 'Efficient feature selection via analysis of relevance and redundancy', 'volume': '5', 'author': 'Yu L.', 'year': '2004', 'journal-title': 'J. Mach. Learn. Res.'} / J. Mach. Learn. Res. / Efficient feature selection via analysis of relevance and redundancy by Yu L. (2004)
  19. K. Fukunaga 1990 Academic New York
  20. R. O. Duda P. E. Hart 1973 Wiley New York
  21. D. J. Hand 1981 Wiley New York
  22. 10.1364/JOSAA.15.001520
  23. 10.1118/1.2207129
  24. 10.1016/S0146-664X(75)80008-6
  25. 10.1016/0167-8655(91)80014-2
  26. 10.1109/TIT.1963.1057810
  27. 10.1109/T-C.1971.223410
  28. N. R. Draper 1998 Wiley New York
  29. M. M. Tatsuoka 1988 Macmillan New York
  30. M. J. Norusis 1993 SPSS Chicago
  31. {'key': 'e_1_2_7_32_1', 'series-title': 'Third International Conference on Pattern Recognition', 'first-page': '71', 'author': 'Stearns S. D.', 'year': '1976'} / Third International Conference on Pattern Recognition by Stearns S. D. (1976)
  32. 10.1109/34.824819
  33. P. A. Lachenbruch 1975 Hafner New York
  34. 10.1023/A:1009715923555
  35. 10.1118/1.2795672
  36. 10.1109/TMI.2006.884198
  37. 10.1016/j.acra.2004.04.024
  38. {'key': 'e_1_2_7_39_1', 'series-title': 'Proceedings of the IEEE International Conference on Data Mining', 'first-page': '641', 'author': 'Ruping S.', 'year': '2001'} / Proceedings of the IEEE International Conference on Data Mining by Ruping S. (2001)
  39. 10.1109/72.788646
Dates
Type When
Created 15 years, 6 months ago (Jan. 28, 2010, 6:11 p.m.)
Deposited 1 year, 10 months ago (Oct. 4, 2023, 12:18 a.m.)
Indexed 1 month ago (July 20, 2025, 12:28 a.m.)
Issued 15 years, 6 months ago (Jan. 28, 2010)
Published 15 years, 6 months ago (Jan. 28, 2010)
Published Online 15 years, 6 months ago (Jan. 28, 2010)
Published Print 15 years, 6 months ago (Feb. 1, 2010)
Funders 1
  1. U.S. Public Health Service 10.13039/100007197

    Region: Americas

    gov (National government)

    Labels6
    1. United States Public Health Service
    2. Commissioned Corps of the U.S. Public Health Service
    3. USPHS Commissioned Corps
    4. U.S. Public Health Service Commissioned Corps
    5. USPHS
    6. PHS
    Awards2
    1. CA 93517
    2. CA95153

@article{Way_2010, title={Effect of finite sample size on feature selection and classification: A simulation study}, volume={37}, ISSN={2473-4209}, url={http://dx.doi.org/10.1118/1.3284974}, DOI={10.1118/1.3284974}, number={2}, journal={Medical Physics}, publisher={Wiley}, author={Way, Ted W. and Sahiner, Berkman and Hadjiiski, Lubomir M. and Chan, Heang‐Ping}, year={2010}, month=jan, pages={907–920} }