DOI: 10.1093/bioinformatics/btl407. What should be expected from feature selection in small-sample settings

What should be expected from feature selection in small-sample settings

10.1093/bioinformatics/btl407

Crossref journal-article

Oxford University Press (OUP)

Bioinformatics (286)

Abstract

Abstract Motivation: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist? Results: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets. Availability: Contact: edward@ece.tamu.edu

Bibliography

Sima, C., & Dougherty, E. R. (2006). What should be expected from feature selection in small-sample settings. Bioinformatics, 22(19), 2430â2436.

Authors 2

Chao Sima (first)
Edward R. Dougherty (additional)

References 14 Referenced 62

10.1038/sj.onc.1208984 / Oncogene / Colon cancer prognosis prediction by gene expression profiling by Barrier (2005)
10.1016/j.patcog.2003.08.017 / Pattern Recogn. / Bolstered error estimation by Braga-Neto (2004)
10.1093/bioinformatics/btg419 / Bioinformatics / Is cross-validation valid for small-sample microarray classification? by Braga-Neto (2004)
10.1093/bioinformatics/btl008 / Bioinformatics / Genetic test bed for feature selection by Choudhary (2006)
10.1109/TSMC.1977.4309803 / IEEE Trans. Syst. Man Cybernet. / On the possible orderings in the measurement selection problem by Cover (1977)
10.1056/NEJM200102223440801 / N. Eng. J. Med. / Gene-expression profiles in hereditary breast cancer by Hedenfalk (2001)
10.1109/34.574797 / IEEE Trans. Pattern Anal. Machine Intell. / Feature selection—evaluation, application, and small sample performance by Jain (1997)
10.1016/S0031-3203(99)00041-2 / Pattern Recogn. / Comparison of algorithms that select features for pattern classifiers by Kudo (2000)
10.1016/0167-8655(94)90127-9 / Pattern Recogn. Lett. / Floating search methods in feature selection by Pudil (1994)
10.1158/0008-5472.CAN-05-1069 / Cancer Res. / High-level coexpression of JAG1 and NOTCH1 is observed in human breast cancer and is associated with poor overall survival by Reedijk (2005)
10.1016/j.patcog.2005.03.026 / Pattern Recogn. / Impact of error estimation on feature-selection algorithms by Sima (2005)
10.1056/NEJMoa021967 / N. Eng. J. Med. / A gene-expression signature as a predictor of survival in breast cancer by van de Vijver (2002)
10.1038/415530a / Nature / Gene expression profiling predicts clinical outcome of breast cancer by van't Veer (2002)
10.1158/0008-5472.CAN-04-0695 / Cancer Res. / Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma by Wei (2004)

Dates

Type	When
Created	19 years ago (July 26, 2006, 8:58 p.m.)
Deposited	2 years, 6 months ago (Jan. 24, 2023, 5:10 a.m.)
Indexed	2 weeks, 2 days ago (Aug. 7, 2025, 4:55 a.m.)
Issued	19 years ago (July 26, 2006)
Published	19 years ago (July 26, 2006)
Published Online	19 years ago (July 26, 2006)
Published Print	18 years, 10 months ago (Oct. 1, 2006)

Funders 0

None

BibTeX

@article{Sima_2006, title={What should be expected from feature selection in small-sample settings}, volume={22}, ISSN={1367-4803}, url={http://dx.doi.org/10.1093/bioinformatics/btl407}, DOI={10.1093/bioinformatics/btl407}, number={19}, journal={Bioinformatics}, publisher={Oxford University Press (OUP)}, author={Sima, Chao and Dougherty, Edward R.}, year={2006}, month=jul, pages={2430–2436} }

JSON

{
  "indexed": {
    "date-parts": [
      [
        2025,
        8,
        7
      ]
    ],
    "date-time": "2025-08-07T08:55:26Z",
    "timestamp": 1754556926272
  },
  "reference-count": 14,
  "publisher": "Oxford University Press (OUP)",
  "issue": "19",
  "content-domain": {
    "domain": [],
    "crossmark-restriction": false
  },
  "published-print": {
    "date-parts": [
      [
        2006,
        10,
        1
      ]
    ]
  },
  "abstract": "<jats:title>Abstract</jats:title>\n               <jats:p>Motivation: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist?</jats:p>\n               <jats:p>Results: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets.</jats:p>\n               <jats:p>Availability: \u00a0</jats:p>\n               <jats:p>Contact: \u00a0edward@ece.tamu.edu</jats:p>",
  "DOI": "10.1093/bioinformatics/btl407",
  "type": "journal-article",
  "created": {
    "date-parts": [
      [
        2006,
        7,
        27
      ]
    ],
    "date-time": "2006-07-27T00:58:50Z",
    "timestamp": 1153961930000
  },
  "page": "2430-2436",
  "source": "Crossref",
  "is-referenced-by-count": 62,
  "title": "What should be expected from feature selection in small-sample settings",
  "prefix": "10.1093",
  "volume": "22",
  "author": [
    {
      "given": "Chao",
      "family": "Sima",
      "sequence": "first",
      "affiliation": [
        {
          "name": "Department of Electrical and Computer Engineering, Texas A&M University, College Station 1 \u00a0 1 \u00a0 \u00a0 TX 77843, USA"
        }
      ]
    },
    {
      "given": "Edward R.",
      "family": "Dougherty",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "Department of Electrical and Computer Engineering, Texas A&M University, College Station 1 \u00a0 1 \u00a0 \u00a0 TX 77843, USA"
        },
        {
          "name": "Computational Biology Division, Translational Genomics Research Institute 2 \u00a0 2 \u00a0 \u00a0 Phoenix, AZ 85004, USA"
        }
      ]
    }
  ],
  "member": "286",
  "published-online": {
    "date-parts": [
      [
        2006,
        7,
        26
      ]
    ]
  },
  "reference": [
    {
      "key": "2023012409235529500_b1",
      "doi-asserted-by": "crossref",
      "first-page": "6155",
      "DOI": "10.1038/sj.onc.1208984",
      "article-title": "Colon cancer prognosis prediction by gene expression profiling",
      "volume": "24",
      "author": "Barrier",
      "year": "2005",
      "journal-title": "Oncogene"
    },
    {
      "key": "2023012409235529500_b2",
      "doi-asserted-by": "crossref",
      "first-page": "1267",
      "DOI": "10.1016/j.patcog.2003.08.017",
      "article-title": "Bolstered error estimation",
      "volume": "37",
      "author": "Braga-Neto",
      "year": "2004",
      "journal-title": "Pattern Recogn."
    },
    {
      "key": "2023012409235529500_b3",
      "doi-asserted-by": "crossref",
      "first-page": "374",
      "DOI": "10.1093/bioinformatics/btg419",
      "article-title": "Is cross-validation valid for small-sample microarray classification?",
      "volume": "20",
      "author": "Braga-Neto",
      "year": "2004",
      "journal-title": "Bioinformatics"
    },
    {
      "key": "2023012409235529500_b4",
      "doi-asserted-by": "crossref",
      "first-page": "837",
      "DOI": "10.1093/bioinformatics/btl008",
      "article-title": "Genetic test bed for feature selection",
      "volume": "22",
      "author": "Choudhary",
      "year": "2006",
      "journal-title": "Bioinformatics"
    },
    {
      "key": "2023012409235529500_b5",
      "doi-asserted-by": "crossref",
      "first-page": "657",
      "DOI": "10.1109/TSMC.1977.4309803",
      "article-title": "On the possible orderings in the measurement selection problem",
      "volume": "7",
      "author": "Cover",
      "year": "1977",
      "journal-title": "IEEE Trans. Syst. Man Cybernet."
    },
    {
      "key": "2023012409235529500_b6",
      "doi-asserted-by": "crossref",
      "first-page": "539",
      "DOI": "10.1056/NEJM200102223440801",
      "article-title": "Gene-expression profiles in hereditary breast cancer",
      "volume": "344",
      "author": "Hedenfalk",
      "year": "2001",
      "journal-title": "N. Eng. J. Med."
    },
    {
      "key": "2023012409235529500_b7",
      "doi-asserted-by": "crossref",
      "first-page": "153",
      "DOI": "10.1109/34.574797",
      "article-title": "Feature selection\u2014evaluation, application, and small sample performance",
      "volume": "19",
      "author": "Jain",
      "year": "1997",
      "journal-title": "IEEE Trans. Pattern Anal. Machine Intell."
    },
    {
      "key": "2023012409235529500_b8",
      "doi-asserted-by": "crossref",
      "first-page": "25",
      "DOI": "10.1016/S0031-3203(99)00041-2",
      "article-title": "Comparison of algorithms that select features for pattern classifiers",
      "volume": "33",
      "author": "Kudo",
      "year": "2000",
      "journal-title": "Pattern Recogn."
    },
    {
      "key": "2023012409235529500_b9",
      "doi-asserted-by": "crossref",
      "first-page": "1119",
      "DOI": "10.1016/0167-8655(94)90127-9",
      "article-title": "Floating search methods in feature selection",
      "volume": "15",
      "author": "Pudil",
      "year": "1994",
      "journal-title": "Pattern Recogn. Lett."
    },
    {
      "key": "2023012409235529500_b10",
      "doi-asserted-by": "crossref",
      "first-page": "8530",
      "DOI": "10.1158/0008-5472.CAN-05-1069",
      "article-title": "High-level coexpression of JAG1 and NOTCH1 is observed in human breast cancer and is associated with poor overall survival",
      "volume": "65",
      "author": "Reedijk",
      "year": "2005",
      "journal-title": "Cancer Res."
    },
    {
      "key": "2023012409235529500_b11",
      "doi-asserted-by": "crossref",
      "first-page": "2472",
      "DOI": "10.1016/j.patcog.2005.03.026",
      "article-title": "Impact of error estimation on feature-selection algorithms",
      "volume": "38",
      "author": "Sima",
      "year": "2005",
      "journal-title": "Pattern Recogn."
    },
    {
      "key": "2023012409235529500_b12",
      "doi-asserted-by": "crossref",
      "first-page": "1999",
      "DOI": "10.1056/NEJMoa021967",
      "article-title": "A gene-expression signature as a predictor of survival in breast cancer",
      "volume": "347",
      "author": "van de Vijver",
      "year": "2002",
      "journal-title": "N. Eng. J. Med."
    },
    {
      "key": "2023012409235529500_b13",
      "doi-asserted-by": "crossref",
      "first-page": "530",
      "DOI": "10.1038/415530a",
      "article-title": "Gene expression profiling predicts clinical outcome of breast cancer",
      "volume": "415",
      "author": "van't Veer",
      "year": "2002",
      "journal-title": "Nature"
    },
    {
      "key": "2023012409235529500_b14",
      "doi-asserted-by": "crossref",
      "first-page": "6883",
      "DOI": "10.1158/0008-5472.CAN-04-0695",
      "article-title": "Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma",
      "volume": "64",
      "author": "Wei",
      "year": "2004",
      "journal-title": "Cancer Res."
    }
  ],
  "container-title": "Bioinformatics",
  "original-title": [],
  "language": "en",
  "link": [
    {
      "URL": "https://academic.oup.com/bioinformatics/article-pdf/22/19/2430/48841422/bioinformatics_22_19_2430.pdf",
      "content-type": "application/pdf",
      "content-version": "vor",
      "intended-application": "syndication"
    },
    {
      "URL": "https://academic.oup.com/bioinformatics/article-pdf/22/19/2430/48841422/bioinformatics_22_19_2430.pdf",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "similarity-checking"
    }
  ],
  "deposited": {
    "date-parts": [
      [
        2023,
        1,
        24
      ]
    ],
    "date-time": "2023-01-24T10:10:55Z",
    "timestamp": 1674555055000
  },
  "score": 1,
  "resource": {
    "primary": {
      "URL": "https://academic.oup.com/bioinformatics/article/22/19/2430/241527"
    }
  },
  "subtitle": [],
  "short-title": [],
  "issued": {
    "date-parts": [
      [
        2006,
        7,
        26
      ]
    ]
  },
  "references-count": 14,
  "journal-issue": {
    "issue": "19",
    "published-print": {
      "date-parts": [
        [
          2006,
          10,
          1
        ]
      ]
    }
  },
  "URL": "http://dx.doi.org/10.1093/bioinformatics/btl407",
  "relation": {},
  "ISSN": [
    "1367-4811",
    "1367-4803"
  ],
  "subject": [],
  "published-other": {
    "date-parts": [
      [
        2006,
        10,
        1
      ]
    ]
  },
  "published": {
    "date-parts": [
      [
        2006,
        7,
        26
      ]
    ]
  }
}