Generate randomized subset of features
[IDX, Z] = randfeatures(X, Group, '
randfeatures(..., 'Classifier', C)
randfeatures(..., 'ClassOptions', CO)
randfeatures(..., 'PerformanceThreshold', PT)
randfeatures(..., 'ConfidenceThreshold', CT)
randfeatures(..., 'SubsetSize', SS)
randfeatures(..., 'PoolSize', PS)
randfeatures(..., 'NumberOfIndices', N)
randfeatures(..., 'CrossNorm', CN)
randfeatures(..., 'Verbose', VerboseValue)
[IDX, Z] = randfeatures(X, Group, ' performs
a randomized subset feature search reinforced by classification.
generates subsets of features used to classify the samples. Every
subset is evaluated with the apparent error. Only the best subsets
are kept, and they are joined into a single final pool. The cardinality
for every feature in the pool gives the measurement of the significance.
X contains the training samples. Every column of
X is an
Group contains the class labels.
Group can be a numeric vector, a cell array of character
vectors or string vector;
numel(Group) must be the same as the
number of columns in
numel(unique(Group)) must be greater than or equal to
Z is the classification significance
for every feature.
IDX contains the indices after sorting
Z; i.e., the first one points to the most significant
randfeatures(..., 'Classifier', C) sets
the classifier. Options are
'da' (default) Discriminant analysis 'knn' K nearest neighbors
randfeatures(..., 'ClassOptions', CO) is
a cell with extra options for the selected classifier. When you specify
the discriminant analysis model (
'da') as a classifier,
classify function with its
default parameters. For the KNN classifier,
fitcknn with the following default options.
PT) sets the correct classification threshold used to pick
the subsets included in the final pool. For the
the default is
0.8. For the
the default is
CT) uses the posterior probability of the discriminant
analysis to invalidate classified subvectors with low confidence.
When using the
'da' model, the default is
of classes). When using the
the default is 1, meaning any classified subvector must have all k neighbors
classified to the same class in order to be kept in the pool.
randfeatures(..., 'SubsetSize', SS) sets
the number of features considered in every subset. Default is
randfeatures(..., 'PoolSize', PS) sets
the targeted number of accepted subsets for the final pool. Default
N) sets the number of output indices in
Default is the same as the number of features.
randfeatures(..., 'CrossNorm', CN) applies
independent normalization across the observations for every feature.
Cross-normalization ensures comparability among different features,
although it is not always necessary because the selected classifier
properties might already account for this. Options are
'none' (default) Intensities are not cross-normalized. 'meanvar' x_new = (x - mean(x))/std(x) 'softmax' x_new = (1+exp((mean(x)-x)/std(x)))^-1 'minmax' x_new = (x - min(x))/(max(x)-min(x))
randfeatures(..., 'Verbose', VerboseValue),
off verbosity. Default is
Find a reduced set of genes that is sufficient for classification of all the cancer types in the t-matrix NCI60 data set. Load sample data.
I = randfeatures(X,GROUP,'SubsetSize',15,'Classifier','da');
Test features with a linear discriminant classifier.
C = classify(X(I(1:25),:)',X(I(1:25),:)',GROUP); cp = classperf(GROUP,C); cp.CorrectRate
ans = 1
 Li, L., Umbach, D.M., Terry, P., and Taylor, J.A. (2003). Application of the GA/KNN method to SELDI proteomics data. PNAS. 20, 1638-1640.
 Liu, H., Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.
 Ross, D.T. et.al. (2000). Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics. 24 (3), 227-235.