Evaluate clustering solutions
a clustering evaluation object using additional options specified
by one or more name-value pair arguments.eva
= evalclusters(x
Evaluate Clustering Solution Using Calinski-Harabasz Criterion
Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.
Load the sample data.
load fisheriris
The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using kmeans
rng('default') % For reproducibility eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',1:6)
eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3
The OptimalK
value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.
Evaluate a Matrix of Clustering Solutions
Use an input matrix of proposed clustering solutions to evaluate the optimal number of clusters.
Load the sample data.
load fisheriris;
The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Use kmeans
to create an input matrix of proposed clustering solutions for the sepal length measurements, using 1, 2, 3, 4, 5, and 6 clusters.
clust = zeros(size(meas,1),6); for i=1:6 clust(:,i) = kmeans(meas,i,'emptyaction','singleton',... 'replicate',5); end
Each row of clust
corresponds to one sepal length measurement. Each of the six columns corresponds to a clustering solution containing 1 to 6 clusters.
Evaluate the optimal number of clusters using the Calinski-Harabasz criterion.
eva = evalclusters(meas,clust,'CalinskiHarabasz')
eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3
The OptimalK
value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.
Specify Clustering Algorithm with a Function Handle
Use a function handle to specify the clustering algorithm, then evaluate the optimal number of clusters.
Load the sample data.
load fisheriris;
The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Use a function handle to specify the clustering algorithm.
myfunc = @(X,K)(kmeans(X,K,Emptyaction="singleton",Replicate=5));
Evaluate the optimal number of clusters for the sepal length data using the Calinski-Harabasz criterion.
eva = evalclusters(meas,myfunc,'CalinskiHarabasz',KList=1:6)
eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3
The OptimalK
value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.
Input Arguments
— Input data
Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.
Data Types: single
| double
— Clustering algorithm or solutions
| 'linkage'
| 'gmdistribution'
| matrix of clustering solutions | function handle
Clustering algorithm, specified as one of the following.
'kmeans' | Cluster the data in x using the kmeans clustering
algorithm, with 'EmptyAction' set to
'singleton' and 'Replicates'
set to 5 . |
'linkage' | Cluster the data in x using the clusterdata agglomerative
clustering algorithm, with 'Linkage' set to
'ward' . |
'gmdistribution' | Cluster the data in x using the gmdistribution Gaussian
mixture distribution algorithm, with 'SharedCov' set
to true and 'Replicates' set to
5 . |
If criterion
is 'CalinskiHarabasz'
, or 'silhouette'
, you can
specify a clustering algorithm using a function handle. The function must be of the
form C = clustfun(DATA,K)
, where DATA
is the data
to be clustered, and K
is the number of clusters. The output of
must be one of the following:
A vector of integers representing the cluster index for each observation in
. There must beK
unique values in this vector.A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.
If criterion
is 'CalinskiHarabasz'
, or 'silhouette'
, you can also
specify clust
as a n-by-K
matrix containing the proposed clustering solutions. n is the number
of observations in the sample data, and K is the number of proposed
clustering solutions. Column j contains the cluster indices for each
of the N points in the jth clustering
Data Types: single
| double
| char
| string
| function_handle
— Clustering evaluation criterion
| 'DaviesBouldin'
| 'gap'
| 'silhouette'
Clustering evaluation criterion, specified as one of the following.
'CalinskiHarabasz' | Create a CalinskiHarabaszEvaluation
clustering evaluation object containing Calinski-Harabasz index values.
For more information, see Calinski-Harabasz Criterion. |
'DaviesBouldin' | Create a DaviesBouldinEvaluation
cluster evaluation object containing Davies-Bouldin index values. For
more information, see Davies-Bouldin Criterion. |
'gap' | Create a GapEvaluation cluster
evaluation object containing gap criterion values. For more information,
see Gap Value. |
'silhouette' | Create a SilhouetteEvaluation cluster
evaluation object containing silhouette values. For more information,
see Silhouette Value and Criterion. |
Name-Value Arguments
Specify optional pairs of arguments as
, where Name
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
in quotes.
Example: evalclusters(x,"kmeans","gap",KList=1:5,Distance="cityblock")
specifies to test 1, 2, 3, 4, and 5 clusters using the city block distance
— List of number of clusters to evaluate
vector of positive integer values
List of number of clusters to evaluate, specified as a vector of positive integer values. You
must specify KList
when clust
is a clustering
algorithm name or a function handle. When criterion
, clust
must be a character vector, a
string scalar, or a function handle, and you must specify
Example: KList=1:6
Data Types: single
| double
— Distance metric
(default) | 'Euclidean'
| 'cityblock'
| vector | function | ...
Distance metric used for computing the criterion values, specified
as the comma-separated pair consisting of 'Distance'
one of the following.
'sqEuclidean' | Squared Euclidean distance |
'Euclidean' | Euclidean distance. This option is not valid for the kmeans clustering
algorithm. |
'cityblock' | Sum of absolute differences |
'cosine' | One minus the cosine of the included angle between points (treated as vectors) |
'correlation' | One minus the sample correlation between points (treated as sequences of values) |
'Hamming' | Percentage of coordinates that differ. This option is only
valid for the Silhouette criterion. |
'Jaccard' | Percentage of nonzero coordinates that differ. This option
is only valid for the Silhouette criterion. |
For detailed information about each distance metric, see pdist
You can also specify a function for the distance metric using
a function handle. The distance
function must be of the form d2 = distfun(XI,XJ)
where XI
is a 1-by-n vector
corresponding to a single row of the input matrix X
and XJ
is an m2-by-n matrix
corresponding to multiple rows of X
. distfun
return an m2-by-1 vector
of distances d2
, whose kth element
is the distance between XI
and XJ(k,:)
only accepts a function handle if the clustering algorithm
accepts a function handle as the distance metric. For
example, the kmeans
clustering algorithm does not accept a function
handle as the distance metric. Therefore, if you use the kmeans
algorithm and then specify a function handle for Distance
, the
software errors.
, you can also specifyDistance
as the output vector created by the functionpdist
uses the distance metric specified forDistance
to cluster the data.If
, andDistance
is either'sqEuclidean'
, then the clustering algorithm uses the Euclidean distance and Ward linkage.If
is any other metric, then the clustering algorithm uses the specified distance metric and average linkage.In all other cases, the distance metric specified for
must match the distance metric used in the clustering algorithm to obtain meaningful results.
Example: 'Distance','Euclidean'
Data Types: single
| double
| char
| string
| function_handle
— Prior probabilities for each cluster
(default) | 'equal'
Prior probabilities for each cluster, specified as the comma-separated
pair consisting of 'ClusterPriors'
and one of the
'empirical' | Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size. |
'equal' | Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size. |
Example: 'ClusterPriors','empirical'
— Number of reference data sets
(default) | positive integer value
Number of reference data sets generated from the reference distribution ReferenceDistribution
specified as the comma-separated pair consisting of 'B'
a positive integer value.
Example: 'B',150
Data Types: single
| double
— Reference data generation method
(default) | 'uniform'
Reference data generation method, specified as the comma-separated
pair consisting of 'ReferenceDistributions'
one of the following.
'PCA' | Generate reference data from a uniform distribution over a
box aligned with the principal components of the data matrix x . |
'uniform' | Generate reference data uniformly over the range of each feature
in the data matrix x . |
Example: 'ReferenceDistribution','uniform'
— Method for selecting optimal number of clusters
(default) | 'firstMaxSE'
Method for selecting the optimal number of clusters, specified
as the comma-separated pair consisting of 'SearchMethod'
one of the following.
'globalMaxSE' |
Evaluate each proposed number of clusters in
where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, GAPMAX is the largest gap value, and SE(GAPMAX) is the standard error corresponding to the largest gap value. |
'firstMaxSE' |
Evaluate each proposed number of clusters in
where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, and SE(K + 1) is the standard error of the clustering solution with K + 1 clusters. |
Example: 'SearchMethod','globalMaxSE'
Output Arguments
— Clustering evaluation data
clustering evaluation object
Clustering evaluation data, returned as a clustering evaluation object.
Version History
Introduced in R2013b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
- América Latina (Español)
- Canada (English)
- United States (English)
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)