Why does the speed of histcounts vary so much?
5 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
I have a vector with integer or floating point values and want to count the number of occurences.
histcounts() uses the lase element of the input Edge as upper limit for the last bin, so I have to append a number larger than the greatest element to count correctly. This slows down the processing by a factor of > 2, when the values are integers:
% Integer values:
x = round((1:1e6) / 1); % Equivalent for "/ 10" also
x = x(randperm(numel(x)));
ux = unique(x);
ux2 = [ux, Inf];
ux3 = [ux, 1e8]; % Any huge number
ux4 = [ux, ux(end) + 1];
timeit(@() histcounts(x, ux)) % 0.045
timeit(@() histcounts(x, ux2)) % 0.171 % Much slower!!!
timeit(@() histcounts(x, ux3)) % 0.199 % Slower than with Inf?
timeit(@() histcounts(x, ux4)) % 0.0787 % Fast again
% BinMethod=Integers is fast, but the number of bins is limited:
timeit(@() histcounts(x, 'BinMethod', 'Integers')) % 0.0059 Fast!
numel(histcounts(x, 'BinMethod', 'Integers')) % But just 10001 instead of 1e6
% According to the documentation, this should be 65536 ?!?
% Floating point values:
y = rand(1, 1e6);
uy = unique(y);
uy2 = [uy, Inf];
uy3 = [uy, uy(end) + 1];
timeit(@() histcounts(y, uy)) % 0.11148
timeit(@() histcounts(y, uy2)) % 0.2066
timeit(@() histcounts(y, uy3)) % 0.2100
% Local R2018b: similar behavior for integers, floating point values processed faster:
% 0.073 % Integer, ux
% 0.199 % [ux, Inf]
% 0.189 % [ux, 1e8]
% 0.061 % [ux, ux(end + 1)]
% 0.162 % Floating point, uy
% 0.167 % [uy, Inf]
% 0.315 % [uy, uy(end) + 1] % half speed!!!
I'd expect nearly the same speed, because it is almost the same amount of work.
By the way, a simple counting function is fast also:
timeit(@() simpleCount(x)) % 0.0602
timeit(@() simpleCount(y)) % 0.0628
function N = simpleCount(x)
S = sort(x);
q = [true, diff(S) ~= 0];
N = diff([find(q), numel(x) + 1]);
end
histcounts() with counting the last element to the previous bin is not useful, but attaching a final edge slows down the processing massively. I've sent an enhancement request to TWM already.
5 comentarios
Adam Danz
el 22 de Ag. de 2021
Editada: Adam Danz
el 22 de Ag. de 2021
Looks like groupcounts does not have this limitation
N = groupcounts((1:70000)');
fprintf('Numel: %d Last count: %d\n', numel(N), N(end))
And its timing for counting integers is similar to the [ux, ux(end) + 1] method but slower.
x = round((1:1e6) / 1);
x = x(randperm(numel(x)));
ux = [unique(x), inf];
ux2 = [unique(x), max(x)+1];
timeit(@() histcounts(x, ux))
timeit(@() histcounts(x, ux2))
timeit(@() groupcounts(x'))
isequal(histcounts(x,ux), histcounts(x, ux2), groupcounts(x')')
Respuestas (1)
Bruno Luong
el 22 de Ag. de 2021
It's just an assumption but histcounts probably use binary search on finite edge values. The large last edge probably penalizes the speed. Not sure how Inf is processed with bonary search (possibly an extra comparison check?)
2 comentarios
Bruno Luong
el 22 de Ag. de 2021
"The BinMethod 'Integers' is useful for the case, when it can be guaranteed, that 2^16 bins are enough"
Just take your hand over and use accumarray instead.
Ver también
Categorías
Más información sobre Logical en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!