Main Content

Sequential Feature Selection

This topic introduces sequential feature selection and provides an example that selects features sequentially using a custom criterion and the sequentialfs function.

Introduction to Sequential Feature Selection

A common method of Feature Selection is sequential feature selection. This method has two components:

  • An objective function, called the criterion, which the method seeks to minimize over all feasible feature subsets. Common criteria are mean squared error (for regression models) and misclassification rate (for classification models).

  • A sequential search algorithm, which adds or removes features from a candidate subset while evaluating the criterion. Since an exhaustive comparison of the criterion value at all 2n subsets of an n-feature data set is typically infeasible (depending on the size of n and the cost of objective calls), sequential searches move in only one direction, always growing or always shrinking the candidate set.

The method has two variants:

  • Sequential forward selection (SFS), in which features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion.

  • Sequential backward selection (SBS), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion.

Statistics and Machine Learning Toolbox™ offers several sequential feature selection functions:

  • Stepwise regression is a sequential feature selection technique designed specifically for least-squares fitting. The functions stepwiselm and stepwiseglm use optimizations that are possible only with least-squares criteria. Unlike other sequential feature selection algorithms, stepwise regression can remove features that have been added or add features that have been removed, based on the criterion specified by the 'Criterion' name-value pair argument.

  • sequentialfs performs sequential feature selection using a custom criterion. Input arguments include predictor data, response data, and a function handle to a file implementing the criterion function. You can define a criterion function that measures the characteristics of data or the performance of a learning algorithm. Optional inputs allow you to specify SFS or SBS, required or excluded features, and the size of the feature subset. The function calls cvpartition and crossval to evaluate the criterion at different candidate sets.

  • fscmrmr and fsrmrmr rank features using the minimum redundancy maximum relevance (MRMR) algorithm for classification and regression problems, respectively.

Select Subset of Features with Comparative Predictive Power

This example selects a subset of features using a custom criterion that measures predictive power for a generalized linear regression problem.

Consider a data set with 100 observations of 10 predictors. Generate the random data from a logistic model, with a binomial distribution of responses at each set of values for the predictors. Some coefficients are set to zero so that not all of the predictors affect the response.

rng(456) % Set the seed for reproducibility
n = 100;
m = 10;
X = rand(n,m);
b = [1 0 0 2 .5 0 0 0.1 0 1];
Xb = X*b';
p = 1./(1+exp(-Xb));
N = 50;
y = binornd(N,p);

Fit a logistic model to the data using fitglm.

Y = [y N*ones(size(y))];
model0 = fitglm(X,Y,'Distribution','binomial')
model0 = 
Generalized linear regression model:
    logit(y) ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
    Distribution = Binomial

Estimated Coefficients:
                   Estimate       SE        tStat        pValue  
                   _________    _______    ________    __________

    (Intercept)      0.22474    0.30043     0.74806       0.45443
    x1               0.68782    0.17207      3.9973     6.408e-05
    x2                0.2003    0.18087      1.1074       0.26811
    x3             -0.055328    0.18871    -0.29319       0.76937
    x4                2.2576     0.1813      12.452    1.3566e-35
    x5               0.54603    0.16836      3.2432     0.0011821
    x6              0.069701    0.17738     0.39294       0.69437
    x7              -0.22562    0.16957     -1.3306       0.18334
    x8              -0.19712    0.17317     -1.1383       0.25498
    x9              -0.20373    0.16796      -1.213       0.22514
    x10              0.99741    0.17247      5.7832    7.3296e-09

100 observations, 89 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 222, p-value = 4.92e-42

Display the deviance of the fit.

dev0 = model0.Deviance
dev0 = 101.5648

This model is the full model, with all of the features and an initial constant term. Sequential feature selection searches for a subset of the features in the full model with comparative predictive power.

Before performing feature selection, you must specify a criterion for selecting the features. In this case, the criterion is the deviance of the fit (a generalization of the residual sum of squares). The critfun function (shown at the end of this example) calls fitglm and returns the deviance of the fit.

If you use the live script file for this example, the critfun function is already included at the end of the file. Otherwise, you need to create this function at the end of your .m file or add it as a file on the MATLAB® path.

Perform feature selection. sequentialfs calls the criterion function via a function handle.

maxdev = chi2inv(.95,1);     
opt = statset('display','iter',...

inmodel = sequentialfs(@critfun,X,Y,...
Start forward sequential feature selection:
Initial columns included:  none
Columns that can not be included:  none
Step 1, used initial columns, criterion value 323.173
Step 2, added column 4, criterion value 184.794
Step 3, added column 10, criterion value 139.176
Step 4, added column 1, criterion value 119.222
Step 5, added column 5, criterion value 107.281
Final columns included:  1 4 5 10 

The iterative display shows a decrease in the criterion value as each new feature is added to the model. The final result is a reduced model with only four of the original ten features: columns 1, 4, 5, and 10 of X, as indicated in the logical vector inmodel returned by sequentialfs.

The deviance of the reduced model is higher than the deviance of the full model. However, the addition of any other single feature would not decrease the criterion value by more than the absolute tolerance, maxdev, set in the options structure. Adding a feature with no effect reduces the deviance by an amount that has a chi-square distribution with one degree of freedom. Adding a significant feature results in a larger change in the deviance. By setting maxdev to chi2inv(.95,1), you instruct sequentialfs to continue adding features provided that the change in deviance is more than the change expected by random chance.

Create the reduced model with an initial constant term.

model = fitglm(X(:,inmodel),Y,'Distribution','binomial')
model = 
Generalized linear regression model:
    logit(y) ~ 1 + x1 + x2 + x3 + x4
    Distribution = Binomial

Estimated Coefficients:
                    Estimate       SE         tStat        pValue  
                   __________    _______    _________    __________

    (Intercept)    -0.0052025    0.16772    -0.031018       0.97525
    x1                0.73814    0.16316       4.5241    6.0666e-06
    x2                 2.2139    0.17402       12.722    4.4369e-37
    x3                0.54073     0.1568       3.4485    0.00056361
    x4                 1.0694    0.15916       6.7191    1.8288e-11

100 observations, 95 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 216, p-value = 1.44e-45

This code creates the function critfun.

function dev = critfun(X,Y)
model = fitglm(X,Y,'Distribution','binomial');
dev = model.Deviance;

See Also

| |

Related Topics