Feature Selection : Benefits and Methods. How to Choose a Feature Selection Method?

Feature selection, one of the main components of feature engineering, is the process of selecting the most important features to input in machine learning algorithms. Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features and narrowing down the set of features to those most relevant to the machine learning model.

The main benefits of performing feature selection in advance, rather than letting the machine learning model figure out which features are most important, include:

simpler models: simple models are easy to explain - a model that is too complex and unexplainable is not valuable
shorter training times: a more precise subset of features decreases the amount of time needed to train a model
variance reduction: increase the precision of the estimates that can be obtained for a given simulation
avoid the curse of high dimensionality: dimensionally cursed phenomena states that, as dimensionality and the number of features increases, the volume of space increases so fast that the available data become limited - PCA feature selection may be used to reduce dimensionality

Three key benefits of performing feature selection on your data are:

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: Less data means that algorithms train faster.

The most common input variable data types include: Numerical Variables, such as Integer Variables and Floating Point Variables; and Categorical Variables, such as Boolean Variables, Ordinal Variables, and Nominal Variables. Popular libraries for feature selection include sklearn feature selection, feature selection Python, and feature selection in R.

Feature Selection Methods by Label Information

Feature selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. Unsupervised techniques are classified as filter methods, wrapper methods, embedded methods, or hybrid methods:

Filter methods: Filter methods select features based on statistics rather than feature selection cross-validation performance. A selected metric is applied to identify irrelevant attributes and perform recursive feature selection. Filter methods are either univariate, in which an ordered ranking list of features is established to inform the final selection of feature subset; or multivariate, which evaluates the relevance of the features as a whole, identifying redundant and irrelevant features.
Wrapper methods: Wrapper feature selection methods consider the selection of a set of features as a search problem, whereby their quality is assessed with the preparation, evaluation, and comparison of a combination of features to other combinations of features. This method facilitates the detection of possible interactions amongst variables. Wrapper methods focus on feature subsets that will help improve the quality of the results of the clustering algorithm used for the selection. Popular examples include Boruta feature selection and Forward feature selection.‍
Embedded methods: Embedded feature selection methods integrate the feature selection machine learning algorithm as part of the learning algorithm, in which classification and feature selection are performed simultaneously. The features that will contribute the most to each iteration of the model training process are carefully extracted. Random forest feature selection, decision tree feature selection, and LASSO feature selection are common embedded methods.

Feature Selection Techniques:

The feature selection technique can be roughly classified into three families:

Supervised methods,
Semi-supervised methods,
Unsupervised methods.

1. Supervised Methods

Supervised feature selection methods are classified into four types, based on the interaction with the learning model as Filter, Wrapper, Hybrid, and Embedded Methods.

1. Filter Method :

In the Filter, method features are selected based on statistical measures. It is independent of the learning algorithm and requires less computational time. Information gain, chi-square test, Fisher score, correlation coefficient, and variance threshold are some of the statistical measures used to understand the importance of the features.

2. Wrapper Methods:

The Wrapper methodology considers the selection of feature sets as a search problem, where different combinations are prepared, evaluated, and compared to other combinations. A predictive model is used to evaluate a combination of features and assign model performance scores.

The performance of the Wrapper method depends on the classifier. The best subset of features is selected based on the results of the classifier. Wrapper methods are computationally more expensive than filter methods, due to the repeated learning steps and cross-validation.

However, these methods are more accurate than the filter method. Some of the examples are Recursive feature elimination, Sequential feature selection algorithms, and Genetic algorithms.

3. Hybrid Methods :

The process of creating hybrid feature selection methods depends on what you choose to combine. The main priority is to select the methods you’re going to use, then follow their processes. The idea here is to use these ranking methods to generate a feature ranking list in the first step, then use the top k features from this list to perform wrapper methods. With that, we can reduce the feature space of our dataset using these filter-based rangers to improve the time complexity of the wrapper methods.

4. Embedded Methods :

In the Embedded method, there are ensemble learning and hybrid learning methods for feature selection. Since it has a collective decision, its performance is better than the other two models. Random forest is one such example. It is computationally less intensive than wrapper methods. However, this method has a drawback specific to a learning model.

2. Unsupervised Methods

Due to the scarcity of readily available labels, unsupervised feature selection methods are widely adopted in the analysis of high-dimensional data. However, most of the existing UFS methods primarily focus on the significance of features in maintaining the data structure while ignoring the redundancy among features. Moreover, the determination of the proper number of features is another challenge.

There are two issues involved in developing an automated Feature Subset Selection Algorithm for unlabeled data:

the need for finding the number of clusters in conjunction with feature selection, and
the need for normalizing the bias of feature selection criteria concerning the dimension

Unsupervised feature selection methods are classified into four types, based on the interaction with the learning model as Filter, Wrapper, and Hybrid methods.

1. Filter Methods :

Unsupervised feature selection methods based on the filter approach can be categorized as univariate and multivariate. Univariate methods, aka ranking-based unsupervised feature selection methods use certain criteria to evaluate each feature to get an ordered ranking list of features, where the final feature subset is selected according to this order.

Such methods can effectively identify and remove irrelevant features, but they are unable to remove redundant ones since they do not take into account possible dependencies among features.

On the other hand, multivariate filter methods evaluate the relevance of the features jointly rather than individually. Multivariate methods can handle redundant and irrelevant features; thus, in many cases, the accuracy reached by learning algorithms using the subset of features selected by multivariate methods is better than the one achieved by using univariate methods.

2. Wrapper Methods:

Unsupervised feature selection methods based on the wrapper approach can be divided into three broad categories according to the feature search strategy: sequential, bio-inspired, and iterative. In sequential methodology, features are added or removed sequentially. Methods based on sequential search are easy to implement and fast.

On the other hand, bio-inspired methodology tries to incorporate randomness into the search process, aiming to escape from local optima. Iterative methods address the unsupervised feature selection problem by casting it as an estimation problem and thus avoiding a combinatorial search.

Wrapper methods evaluate feature subsets using the results of a specific clustering algorithm. Methods developed under this approach are characterized by finding feature subsets that contribute to improving the quality of the results of the clustering algorithm used for the selection.

However, the main disadvantage of wrapper methods is that they usually have a high computational cost, and they are limited to be used in conjunction with a particular clustering algorithm.

3. Hybrid Method:

Hybrid methods try to exploit the qualities of both approaches, filter, and wrapper, trying to have a good compromise between efficiency (computational effort) and effectiveness (quality in the associated objective task when using the selected features).

To take advantage of the filter and wrapper approaches, hybrid methods, in a filter stage, the features are ranked or selected applying a measure based on intrinsic properties of the data. While, in a wrapper stage, certain feature subsets are evaluated for finding the best one, through a specific clustering algorithm. We can distinguish two types of hybrid methods: methods based on ranking and methods non-based on the ranking of features.

It is worth noting that, in the literature, some hybrid unsupervised feature selection methods like (Jashki et al. 2009; Hu et al. 2009; Yang et al. 2011a; Yu 2011) designed specifically for handling data in specific domains also have been proposed.

How to Choose a Feature Selection Method

Choosing the best feature selection method depends on the input and output in consideration:

Numerical Input, Numerical Output: feature selection regression problem with numerical input variables - use a correlation coefficient, such as Pearson’s correlation coefficient (for linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
Numerical Input, Categorical Output: feature selection classification problem with numerical input variables - use a correlation coefficient, taking into account the categorical target, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
Categorical Input, Numerical Output: regression predictive modeling problem with categorical input variables (rare) - use a correlation coefficient, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear), but in reverse.‍
Categorical Input, Categorical Output: classification predictive modeling problem with categorical input variables - use a correlation coefficient, such as Chi-Squared test (contingency tables) or Mutual Information, which is a powerful method that is agnostic to data types.