Why are negative GAP statistic values, provided by evalclusters, allowed as solution?
Mostrar comentarios más antiguos
Dear fellow Matlab users and developers,
Facing the question of an optimal number of cluster of a data set, I wondered why sometimes negativ GAP values are allowed as solution of the evalcluster function of the statistics toolbox.
Example:
data = [[25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46];...
[79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]]';
% get optimal number of cluster
eva1 = evalclusters(data,'kmeans','Gap','KList',[1:5],'SearchMethod','firstMaxSE');
figure()
plot(eva1)

The original implementation of the evalcluster function respectively class evaluates the optimal number of clusters by the criterion
. The Gap value is defined as
.
This is in accordance with the original paper provided by
Tibshirani, R.;Walther, G. and Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B 63: pp. 411–423.
For the example above this is true. The Gap value of two clusters minus the error would be below the Gap value of one cluster.
However, if the Gap value is negative the
curve is still above
.
In the original paper it further states: "Our estimate of the optimal number of clusters is then the value of k for which
falls the farthest below this reference curve."
In my interpretation, negative Gap values should not be allowed as solution, since the condition below is not fullfilled.
This implementation of the GapEvaluation.m function reads as followed:
if ( isempty(nextValid) ...
|| this.CriterionValues(j) >=...
(this.CriterionValues(j+nextValid)-this.SE(j+nextValid)))
this.OptimalK = NC;
this.OptimalY = IDX;
end
A further condition should be applied to the if condition this.CriterionValues(j)>0. This would ensure that the actual
is below
.
if (( isempty(nextValid) ...
|| this.CriterionValues(j) >=...
(this.CriterionValues(j+nextValid)-this.SE(j+nextValid)))) && this.CriterionValues(j)>0
this.OptimalK = NC;
this.OptimalY = IDX;
end
This would lead to the optimal number of clusters of three.

Would you agree or am I missing something important?
How to treat the outcome of Gap values if all are negative?
Cheers
Respuesta aceptada
Más respuestas (0)
Categorías
Más información sobre Hierarchical Clustering en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!