DOI: 10.1118/1.3284974. Effect of finite sample size on feature selection and classification: A simulation study

Effect of finite sample size on feature selection and classification: A simulation study

10.1118/1.3284974

Crossref journal-article

Wiley

Medical Physics (311)

Abstract

Purpose:The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer‐aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.Methods:Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve. The mean values obtained by resubstitution and hold‐out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.Results:It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold‐out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.Conclusions:None of the investigated feature selection‐classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

Bibliography

Way, T. W., Sahiner, B., Hadjiiski, L. M., & Chan, H. (2010). Effect of finite sample size on feature selection and classification: A simulation study. Medical Physics, 37(2), 907â920. Portico.

Authors 4

Ted W. Way (first)
Berkman Sahiner (additional)
Lubomir M. Hadjiiski (additional)
Heang‐Ping Chan (additional)

References 39 Referenced 60

10.1118/1.598805
10.1118/1.599017
10.1118/1.2868757
10.1016/j.neunet.2007.12.012
10.1118/1.1999126
10.1118/1.2437130
10.1109/TPAMI.2003.1251149
10.1109/34.574797
10.1016/0167-8655(94)90127-9
10.1016/S0031-3203(99)00041-2
10.1093/bioinformatics/btl407
10.1016/j.patcog.2008.08.001
10.1016/j.csda.2004.03.017
10.1093/bioinformatics/bti748
{'key': 'e_1_2_7_16_1', 'first-page': '845', 'article-title': 'Feature selection for unsupervised learning', 'volume': '5', 'author': 'Dy J. G.', 'year': '2004', 'journal-title': 'J. Mach. Learn. Res.'} / J. Mach. Learn. Res. / Feature selection for unsupervised learning by Dy J. G. (2004)
10.1016/0167-8655(96)00047-5
10.1016/S0004-3702(97)00043-X
{'key': 'e_1_2_7_19_1', 'first-page': '1205', 'article-title': 'Efficient feature selection via analysis of relevance and redundancy', 'volume': '5', 'author': 'Yu L.', 'year': '2004', 'journal-title': 'J. Mach. Learn. Res.'} / J. Mach. Learn. Res. / Efficient feature selection via analysis of relevance and redundancy by Yu L. (2004)
K. Fukunaga 1990 Academic New York
R. O. Duda P. E. Hart 1973 Wiley New York
D. J. Hand 1981 Wiley New York
10.1364/JOSAA.15.001520
10.1118/1.2207129
10.1016/S0146-664X(75)80008-6
10.1016/0167-8655(91)80014-2
10.1109/TIT.1963.1057810
10.1109/T-C.1971.223410
N. R. Draper 1998 Wiley New York
M. M. Tatsuoka 1988 Macmillan New York
M. J. Norusis 1993 SPSS Chicago
{'key': 'e_1_2_7_32_1', 'series-title': 'Third International Conference on Pattern Recognition', 'first-page': '71', 'author': 'Stearns S. D.', 'year': '1976'} / Third International Conference on Pattern Recognition by Stearns S. D. (1976)
10.1109/34.824819
P. A. Lachenbruch 1975 Hafner New York
10.1023/A:1009715923555
10.1118/1.2795672
10.1109/TMI.2006.884198
10.1016/j.acra.2004.04.024
{'key': 'e_1_2_7_39_1', 'series-title': 'Proceedings of the IEEE International Conference on Data Mining', 'first-page': '641', 'author': 'Ruping S.', 'year': '2001'} / Proceedings of the IEEE International Conference on Data Mining by Ruping S. (2001)
10.1109/72.788646

Dates

Type	When
Created	15 years, 6 months ago (Jan. 28, 2010, 6:11 p.m.)
Deposited	1 year, 10 months ago (Oct. 4, 2023, 12:18 a.m.)
Indexed	1 month ago (July 20, 2025, 12:28 a.m.)
Issued	15 years, 6 months ago (Jan. 28, 2010)
Published	15 years, 6 months ago (Jan. 28, 2010)
Published Online	15 years, 6 months ago (Jan. 28, 2010)
Published Print	15 years, 6 months ago (Feb. 1, 2010)

Funders 1

U.S. Public Health Service 10.13039/100007197
Region: Americas
gov (National government)
Labels6
1. United States Public Health Service
2. Commissioned Corps of the U.S. Public Health Service
3. USPHS Commissioned Corps
4. U.S. Public Health Service Commissioned Corps
5. USPHS
6. PHS
Awards2
1. CA 93517
2. CA95153

BibTeX

@article{Way_2010, title={Effect of finite sample size on feature selection and classification: A simulation study}, volume={37}, ISSN={2473-4209}, url={http://dx.doi.org/10.1118/1.3284974}, DOI={10.1118/1.3284974}, number={2}, journal={Medical Physics}, publisher={Wiley}, author={Way, Ted W. and Sahiner, Berkman and Hadjiiski, Lubomir M. and Chan, Heang‐Ping}, year={2010}, month=jan, pages={907–920} }

JSON

{
  "indexed": {
    "date-parts": [
      [
        2025,
        7,
        20
      ]
    ],
    "date-time": "2025-07-20T04:28:23Z",
    "timestamp": 1752985703552
  },
  "reference-count": 39,
  "publisher": "Wiley",
  "issue": "2",
  "license": [
    {
      "start": {
        "date-parts": [
          [
            2010,
            1,
            28
          ]
        ],
        "date-time": "2010-01-28T00:00:00Z",
        "timestamp": 1264636800000
      },
      "content-version": "vor",
      "delay-in-days": 0,
      "URL": "http://onlinelibrary.wiley.com/termsAndConditions#vor"
    }
  ],
  "funder": [
    {
      "DOI": "10.13039/100007197",
      "name": "U.S. Public Health Service",
      "doi-asserted-by": "publisher",
      "award": [
        "CA95153",
        "CA 93517"
      ],
      "id": [
        {
          "id": "10.13039/100007197",
          "id-type": "DOI",
          "asserted-by": "publisher"
        }
      ]
    }
  ],
  "content-domain": {
    "domain": [],
    "crossmark-restriction": false
  },
  "published-print": {
    "date-parts": [
      [
        2010,
        2
      ]
    ]
  },
  "abstract": "<jats:sec><jats:title>Purpose:</jats:title><jats:p>The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer\u2010aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.</jats:p></jats:sec><jats:sec><jats:title>Methods:</jats:title><jats:p>Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve<jats:inline-graphic xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"graphic/mp4974-math-0001.png\" xlink:title=\"urn:x-wiley:00942405:media:mp4974:mp4974-math-0001\" />. The mean <jats:inline-graphic xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"graphic/mp4974-math-0002.png\" xlink:title=\"urn:x-wiley:00942405:media:mp4974:mp4974-math-0002\" /> values obtained by resubstitution and hold\u2010out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.</jats:p></jats:sec><jats:sec><jats:title>Results:</jats:title><jats:p>It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold\u2010out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.</jats:p></jats:sec><jats:sec><jats:title>Conclusions:</jats:title><jats:p>None of the investigated feature selection\u2010classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.</jats:p></jats:sec>",
  "DOI": "10.1118/1.3284974",
  "type": "journal-article",
  "created": {
    "date-parts": [
      [
        2010,
        1,
        28
      ]
    ],
    "date-time": "2010-01-28T23:11:59Z",
    "timestamp": 1264720319000
  },
  "page": "907-920",
  "source": "Crossref",
  "is-referenced-by-count": 60,
  "title": "Effect of finite sample size on feature selection and classification: A simulation study",
  "prefix": "10.1002",
  "volume": "37",
  "author": [
    {
      "given": "Ted W.",
      "family": "Way",
      "sequence": "first",
      "affiliation": []
    },
    {
      "given": "Berkman",
      "family": "Sahiner",
      "sequence": "additional",
      "affiliation": []
    },
    {
      "given": "Lubomir M.",
      "family": "Hadjiiski",
      "sequence": "additional",
      "affiliation": []
    },
    {
      "given": "Heang\u2010Ping",
      "family": "Chan",
      "sequence": "additional",
      "affiliation": []
    }
  ],
  "member": "311",
  "published-online": {
    "date-parts": [
      [
        2010,
        1,
        28
      ]
    ]
  },
  "reference": [
    {
      "key": "e_1_2_7_2_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.598805"
    },
    {
      "key": "e_1_2_7_3_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.599017"
    },
    {
      "key": "e_1_2_7_4_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.2868757"
    },
    {
      "key": "e_1_2_7_5_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/j.neunet.2007.12.012"
    },
    {
      "key": "e_1_2_7_6_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.1999126"
    },
    {
      "key": "e_1_2_7_7_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.2437130"
    },
    {
      "key": "e_1_2_7_8_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TPAMI.2003.1251149"
    },
    {
      "key": "e_1_2_7_9_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/34.574797"
    },
    {
      "key": "e_1_2_7_10_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/0167-8655(94)90127-9"
    },
    {
      "key": "e_1_2_7_11_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/S0031-3203(99)00041-2"
    },
    {
      "key": "e_1_2_7_12_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1093/bioinformatics/btl407"
    },
    {
      "key": "e_1_2_7_13_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/j.patcog.2008.08.001"
    },
    {
      "key": "e_1_2_7_14_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/j.csda.2004.03.017"
    },
    {
      "key": "e_1_2_7_15_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1093/bioinformatics/bti748"
    },
    {
      "key": "e_1_2_7_16_1",
      "first-page": "845",
      "article-title": "Feature selection for unsupervised learning",
      "volume": "5",
      "author": "Dy J. G.",
      "year": "2004",
      "journal-title": "J. Mach. Learn. Res."
    },
    {
      "key": "e_1_2_7_17_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/0167-8655(96)00047-5"
    },
    {
      "key": "e_1_2_7_18_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/S0004-3702(97)00043-X"
    },
    {
      "key": "e_1_2_7_19_1",
      "first-page": "1205",
      "article-title": "Efficient feature selection via analysis of relevance and redundancy",
      "volume": "5",
      "author": "Yu L.",
      "year": "2004",
      "journal-title": "J. Mach. Learn. Res."
    },
    {
      "key": "e_1_2_7_20_1",
      "unstructured": "K. Fukunaga 1990 Academic New York"
    },
    {
      "key": "e_1_2_7_21_1",
      "unstructured": "R. O. Duda P. E. Hart 1973 Wiley New York"
    },
    {
      "key": "e_1_2_7_22_1",
      "unstructured": "D. J. Hand 1981 Wiley New York"
    },
    {
      "key": "e_1_2_7_23_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1364/JOSAA.15.001520"
    },
    {
      "key": "e_1_2_7_24_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.2207129"
    },
    {
      "key": "e_1_2_7_25_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/S0146-664X(75)80008-6"
    },
    {
      "key": "e_1_2_7_26_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/0167-8655(91)80014-2"
    },
    {
      "key": "e_1_2_7_27_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TIT.1963.1057810"
    },
    {
      "key": "e_1_2_7_28_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/T-C.1971.223410"
    },
    {
      "key": "e_1_2_7_29_1",
      "unstructured": "N. R. Draper 1998 Wiley New York"
    },
    {
      "key": "e_1_2_7_30_1",
      "unstructured": "M. M. Tatsuoka 1988 Macmillan New York"
    },
    {
      "key": "e_1_2_7_31_1",
      "unstructured": "M. J. Norusis 1993 SPSS Chicago"
    },
    {
      "key": "e_1_2_7_32_1",
      "series-title": "Third International Conference on Pattern Recognition",
      "first-page": "71",
      "author": "Stearns S. D.",
      "year": "1976"
    },
    {
      "key": "e_1_2_7_33_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/34.824819"
    },
    {
      "key": "e_1_2_7_34_1",
      "unstructured": "P. A. Lachenbruch 1975 Hafner New York"
    },
    {
      "key": "e_1_2_7_35_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1023/A:1009715923555"
    },
    {
      "key": "e_1_2_7_36_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1118/1.2795672"
    },
    {
      "key": "e_1_2_7_37_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TMI.2006.884198"
    },
    {
      "key": "e_1_2_7_38_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1016/j.acra.2004.04.024"
    },
    {
      "key": "e_1_2_7_39_1",
      "series-title": "Proceedings of the IEEE International Conference on Data Mining",
      "first-page": "641",
      "author": "Ruping S.",
      "year": "2001"
    },
    {
      "key": "e_1_2_7_40_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/72.788646"
    }
  ],
  "container-title": "Medical Physics",
  "original-title": [],
  "language": "en",
  "link": [
    {
      "URL": "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1118%2F1.3284974",
      "content-type": "application/pdf",
      "content-version": "vor",
      "intended-application": "text-mining"
    },
    {
      "URL": "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1118%2F1.3284974",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "text-mining"
    },
    {
      "URL": "https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1118/1.3284974",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "similarity-checking"
    }
  ],
  "deposited": {
    "date-parts": [
      [
        2023,
        10,
        4
      ]
    ],
    "date-time": "2023-10-04T04:18:14Z",
    "timestamp": 1696393094000
  },
  "score": 1,
  "resource": {
    "primary": {
      "URL": "https://aapm.onlinelibrary.wiley.com/doi/10.1118/1.3284974"
    }
  },
  "subtitle": [],
  "short-title": [],
  "issued": {
    "date-parts": [
      [
        2010,
        1,
        28
      ]
    ]
  },
  "references-count": 39,
  "journal-issue": {
    "issue": "2",
    "published-print": {
      "date-parts": [
        [
          2010,
          2
        ]
      ]
    }
  },
  "alternative-id": [
    "10.1118/1.3284974"
  ],
  "URL": "http://dx.doi.org/10.1118/1.3284974",
  "relation": {},
  "ISSN": [
    "0094-2405",
    "2473-4209"
  ],
  "subject": [],
  "container-title-short": "Medical Physics",
  "published": {
    "date-parts": [
      [
        2010,
        1,
        28
      ]
    ]
  }
}