CaltechAUTHORS
  A Caltech Library Service

A numeric comparison of variable selection algorithms for supervised learning

Palombo, G. and Narsky, I. (2009) A numeric comparison of variable selection algorithms for supervised learning. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 612 (1). pp. 187-195. ISSN 0168-9002. https://resolver.caltech.edu/CaltechAUTHORS:20100122-131703884

[img] PDF - Published Version
Restricted to Repository administrators only
See Usage Policy.

410Kb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20100122-131703884

Abstract

Datasets in modern High Energy Physics (HEP) experiments are often described by dozens or even hundreds of input variables. Reducing a full variable set to a subset that most completely represents information about data is therefore an important task in analysis of HEP data. We compare various variable selection algorithms for supervised learning using several datasets such as, for instance, imaging gamma-ray Cherenkov telescope (MAGIC) data found at the UCI repository. We use classifiers and variable selection methods implemented in the statistical package StatPatternRecognition (SPR), a free open-source C++ package developed in the HEP community (http://sourceforge.net/projects/statpatrec/). For each dataset, we select a powerful classifier and estimate its learning accuracy on variable subsets obtained by various selection algorithms. When possible, we also estimate the CPU time needed for the variable subset selection. The results of this analysis are compared with those published previously for these datasets using other statistical packages such as R and Weka. We show that the most accurate, yet slowest, method is a wrapper algorithm known as generalized sequential forward selection (“Add N Remove R”) implemented in SPR.


Item Type:Article
Related URLs:
URLURL TypeDescription
http://dx.doi.org/10.1016/j.nima.2009.09.059DOIUNSPECIFIED
Additional Information:© 2009 Elsevier B.V. Received 12 June 2009; revised 22 September 2009; accepted 23 September 2009. Available online 26 September 2009.
Subject Keywords:StatPatternRecognition; Machine learning; Data analysis; Variable selection; Classification
Issue or Number:1
Record Number:CaltechAUTHORS:20100122-131703884
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20100122-131703884
Official Citation:G. Palombo, I. Narsky, A numeric comparison of variable selection algorithms for supervised learning, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Volume 612, Issue 1, 21 December 2009, Pages 187-195, ISSN 0168-9002, DOI: 10.1016/j.nima.2009.09.059. (http://www.sciencedirect.com/science/article/B6TJM-4X9TTP9-8/2/f156f73cd9c96b1e931b5bd420df4150)
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:17290
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:28 Jan 2010 23:53
Last Modified:03 Oct 2019 01:25

Repository Staff Only: item control page