While working on any Machine Learning project, there are many phases before you reach model training and one of the major phases of that is Feature Selection.
Here, I want to share about a few common methods for feature selection with scikit-learn library.
- Recursive feature elimination: as we can understand by name, it runs the selected algorithms recursively until we reach our desired number of features. Core idea is to use inbuilt methods such as coefficient or feature importance (in case of tree-based estimators) values and then drop the features with less value.
scikit code: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
- L1 Norm Method: You might remember it from linear models, we use L1 norm for regularization of our model as it penalizes the weight and not important feature will have 0 coefficient. You will go ahead with the features with non-zero coefficient.
scikit code: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html or you can also write a custom code for this.
- Tree-based feature importance selection: It is a straightforward method; it uses feature_importance values to select the features. feature_importance is calculated using information gain.
scikit code: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html, it can be done with SelectFromModel easily or custom code can be also written for this.
- Forward/Backward sequential feature Selection: So lets first understand forward sequential feature selection, it starts with zero feature and the evaluates the cross validation score for this feature, and then it will keep doing the same step using greedy principal till we reach the desired number of features. In case of backward sequential feature selection, it starts with all features and will keep dropping the feature till we reach the desired number of features.
Other things to keep in mind while using this:
- It does not use feature importance or coef as above algorithms, but the cross-validation score.
- Using this with big data can be very time consuming.
scikit code: In-depth explanation can be found by this link, https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html
These are four methods to get started with feature selection with almost no or less manual code; this helps in keeping all features without losing the in their original form.