What is the cause of CUDA_ERROR​_LAUNCH_FA​ILED?

3 visualizaciones (últimos 30 días)
Brian Lee
Brian Lee el 31 de Oct. de 2018
Editada: cui,xingxing el 27 de Abr. de 2024
I was working on multi-GPU training of a Neural Network and occasionally receive the error, "CUDA_ERROR_LAUNCH_FAILED" (full error and code below). What might be the cause of this? I successfully ran the code to completion once, tried to change some hyperparameters, then received this message. Reverting the hyperparameter changes did not fix the problem. Thanks in advance.
The code I ran:
%{
Test out transfer learning with pretrained model
See example 'Transfer Learning Using AlexNet'
%}
imds = imageDatastore('PetImages', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
net = alexnet;
inputSize = net.Layers(1).InputSize;
layersTransfer = net.Layers(1:end-3);
numClasses = numel(categories(imdsTrain.Labels));
layers = [
layersTransfer
fullyConnectedLayer(100,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
fullyConnectedLayer(100,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
softmaxLayer
classificationLayer];
pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
options = trainingOptions('sgdm', ...
'MiniBatchSize',1000, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress',...
'ExecutionEnvironment','multi-gpu');
netTransfer = trainNetwork(augimdsTrain,layers,options);
And the full error text:
Error using trainNetwork (line 150)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error in transferLearning (line 50)
netTransfer = trainNetwork(augimdsTrain,layers,options);
Caused by:
Error using nnet.internal.cnn.DistributedDispatcher/computeInParallel
(line 190)
Error detected on worker 1.
Error using nnet.internal.cnn.TrainerGPUStrategy/computeAccumImage
(line 23)
An unexpected error occurred during CUDA execution. The CUDA error
was:
CUDA_ERROR_LAUNCH_FAILED
  1 comentario
Joss Knight
Joss Knight el 5 de Nov. de 2018
This doesn't look great, sorry about that. Does the problem stop recurring if you reduce the MiniBatchSize?

Iniciar sesión para comentar.

Respuestas (2)

cui,xingxing
cui,xingxing el 20 de Jul. de 2019
Editada: cui,xingxing el 27 de Abr. de 2024
I meet same error, have you resolved?@Joss Knight,@Brian Lee thanks
-------------------------Off-topic interlude, 2024-------------------------------
I am currently looking for a job in the field of CV algorithm development, based in Shenzhen, Guangdong, China,or a remote support position. I would be very grateful if anyone is willing to offer me a job or make a recommendation. My preliminary resume can be found at: https://cuixing158.github.io/about/ . Thank you!
Email: cuixingxing150@gmail.com
  1 comentario
Joss Knight
Joss Knight el 22 de Jul. de 2019
This error occurs in all sorts of circumstances, usually because your card does not have enough memory. Try posting a new question, provide reproduction code, and give us the output of gpuDevice.

Iniciar sesión para comentar.


Brian Lee
Brian Lee el 22 de Jul. de 2019
Sorry for the lack of follow up, but the issue did seem to be a lack of memory. I haven't seen the issue when using much smaller mini batch sizes.
  1 comentario
Jacques
Jacques el 16 de Dic. de 2019
Editada: Jacques el 16 de Dic. de 2019
I had the same problem.
You have two choices. The first one consists to work with CPU. The second one consists to work with smaller matrices with GPU (computation-memory tradeoff)

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel and Cloud en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by