How to maximize MATLAB's GPU utility?

I've surveyed my GPU's performance against itself and the CPU for varying matrix sizes, and found the opposite of what most GPU literature suggests: the GPU's computing advantage diminishes with array size. Code, results, & specs shown below. Noteworthy observations: . (1) GPU utility remains sub-10%, according to Task Manager (2) ~(50%, 20%) = (RAM, CPU) usage for large (K > 9000) array (3) Considerable speed ratio drop's observed for around K > 8000 (4) Splitting the K > 8000 (= 9000) Xga matrix into four increases vectorized speed two-fold (5) My GPU ranks far higher among GPUs than my CPU (#24 vs. #174); it thus seems an on-par CPU would outperform the GPU for larger arrays (6) Last pic's GPU vs. CPU benchmark supports (5); GPU isn't as vastly superior as expected
What's the culprit - is my code, or MATLAB, or hardware configuration under-utilizing the GPU? How to find out and resolve it? m-files: testrun.zip (testrun compares performance for a single K; testrun0 for multiple)
%% CODE: centroid indexing in K-means algorithm
% size(X) = [16000, 3]
% size(c) = [K, 3]
% Xsg = single(X); csg = single(c);
% Xga = gpuArray(Xsg); cga = gpuArray(csg);
% Speed ratio = t2/t1, if t2 > t1 - else, t1/t2
%% TIMING
f1 = fasterFunction(...); % e.g. vectorized(Xga, cga, K, m)
f2 = slowerFunction(...); % e.g. forVectorized(X, c, m)
t1 = gputimeit(f1) % OR timeit(f1) for non-GPU arrays
t2 = timeit(f2) % OR gputimeit(f2) for GPU arrays
%% FUNCTIONS
function out = vectorized(X, c, K, m)
[~, out] = min(reshape(permute(sum((X-permute(c,[3 2 1])).^2,2), ...
[1 2 3]),m,K),[],2);
end
function out = forVectorized(X, c, m)
out = zeros(m,1);
for j=1:m
[~,out(j)] = min(sum(((X(j,:))'-c').^2));
end
end
function out = forFor(X,c,K,m)
out = zeros(m,1); idxtemp = zeros(K,1);
for i=1:m
for j=1:K
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
end
[~, out(i)] = min(idxtemp);
end
end
%% PLOTS
% GPU vectorized = vectorized(Xga, cga, K, m) for varying K, timed w/ gputimeit
% CPU vectorized = vectorized(Xsg, csg, K, m) for varying K, timed w/ timeit
% for-loop = forFor(Xsg, csg, K, m) for varying K, timed w/ timeit

5 comentarios

Jan
Jan el 20 de Mzo. de 2019
Editada: Jan el 20 de Mzo. de 2019
It is hard to follow your descriptions. "GPU utility remains sub-10%", "My GPU ranks far higher among GPUs than my CPU (#24 vs. #174)", "Last pic's GPU vs. CPU benchmark supports (5)" - this might be clear for you, but it requires a lot of educated guessing for the readers. "f1 = fasterFunction(...)"? Please post running code. It is not clear, which code creates which diagram. Most of all I do not understand the actual question: "maximize MATLAB's GPU utility?"
What do you do? Which problem do you want to solve? What is your question?
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
The row-wise processing wastes ime compared to a columnwise processing in the CPU. Transpose the inputs to avoid this.
John Muradeli
John Muradeli el 20 de Mzo. de 2019
@Jan -- Added examples to fasterFunction(...), changed function names a bit; as to the rest - the question is clear enough; those able to respond should understand.
Jan
Jan el 20 de Mzo. de 2019
@John: Thanks for clarifying the question a little bit.
"those able to respond should understand" - yes, of course, we agree here: they should. I've mentioned, that at least for me "maximize MATLAB's GPU utility" is too vague to be answered. Why not decreasing the number of readers who do not understand the question?
You've spent some time to produce the nice diagrams. If you post the complete code instead of letting the readers guess it based on some rough comments, the members of the forum can run it on their machines and maybe confirm your observations.
"the GPU's computing advantage diminishes with array size" - doesn't the last diagram "Single precision matrix-matrix multiply" tell the opposite?
John Muradeli
John Muradeli el 20 de Mzo. de 2019
@Jan -- Unsure how columns/rows affect CPU computing, but - transposed per your suggestion, and interchanged (i,:) with (i,:) (same w/ j) - results: https://puu.sh/D2Lex/ea9c4d6189.png -- not a significant difference for range of K's tested
John Muradeli
John Muradeli el 20 de Mzo. de 2019
Editada: John Muradeli el 20 de Mzo. de 2019
@Jan: Very well, I'll clarify below; as for the complete code - there's a tradeoff between conciseness and thoroughness - too much of the latter tends to throw off readers the fastest. This said, would an m-file suffice? The code isn't brief.
"Maximize GPU Utility" - see (1), (2); that is, it seems that majority of GPU resources aren't being utiilzied - and that there may be a way to utilzie them. For example, dividing workload evenly across the entire GPU - rather than have a few take all and most lay idle. I tried one method (see (4)); but strangely, for K <= 8000, computing time increases. Hence, I may be doing it wrong.
@"Doesn't the last diagram tell the opposite?" it's not so much GPU vs CPU as GPU vs GPU: performance slightly decreases after peak (circled) - but not as much as in plots above. I couldn't test for 1e9 per 'Out of Memory'

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Productos

Versión

R2018b

Preguntada:

el 20 de Mzo. de 2019

Editada:

el 20 de Mzo. de 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by