Feature Selection and Ranking in Machine Learning

Recent emergence of datasets with massive numbers of features has made pattern recognition an ever-challenging task. In particular, such high numbers of features give rise to various issues such as overfitting, poor generalization, and inferior prediction performance, (2) slow and computationally expensive predictors, and (3)  difficulty in comprehending the underlying process. Feature selection (FS) can be defined as selecting a subset of available features in a dataset that are associated with the response variable by excluding irrelevant and redundant features. An effective feature selection process mitigates the problems associated with large datasets in the sense that it results in (1) better classification performance, (2) reduced storage and computational cost, and (3) generalized and more interpretable models. An alternative to FS for dimensionality reduction is feature extraction (FE) wherein original features are first combined and then projected into a new feature space with lower dimensionality. A major downside of FE is that the transformed features lose their physical meaning, which complicates further analysis of the model and makes it difficult to interpret. Thus, FS is superior to FE in terms of readability and interpretability.

In the FS problem, the goal is to identify the optimal subset of the available set of features with respect to a particular classification performance criterion. For a dataset with p features, the size of the solution space is 2^p-1, which makes the FS problem computationally intractable. FS methods currently available in the literature fall into three broad categories: filters, wrappers, and embedded methods. Filter methods utilize statistical characteristics of data in order to remove poorly associated features. Filter methods ignore the effects of the chosen feature set on the performance of the intended classifier. Wrapper methods alleviate this issue by exploring the space of feature subsets for the set that optimizes a desired performance criterion for a given classifier. Embedded methods' performance criterion is the same as that of wrappers, yet they incorporate the FS process as part of the training process.    

Aksakalli and Malekipirbazari (Pattern Recognition Letters, 2016)  introduces a new wrapper approach for FS based on a pseudo-gradient descent stochastic optimization algorithm called Binary Simultaneous Perturbation Stochastic Approximation (BSPSA). This algorithm starts with a random solution vector and moves toward the optimal solution vector via successive iterations in which the current solution vector's individual components are perturbed simultaneously by random offsets from a qualified probability distribution. Regarding gradient descent based optimization approaches for FS, discuss a stochastic gradient descent algorithm where a probability distribution is first estimated on the full set of features whose mass is then distributed over the more informative features. The closest work to ours is that of  wherein the authors use conventional (continuous-space) SPSA for feature selection for a nearest neighbor classifier with the Minkowski distance metric for two particular datasets. On the other hand, binary SPSA takes advantage of the inherent binary nature of FS and eliminates the need for defining and fine-tuning additional algorithm parameters as in continuous SPSA.