Trainnet with parallel-CPU mode giving incorrect results

7 visualizaciones (últimos 30 días)

Collin Rich el 25 de Mayo de 2024

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/2122571-trainnet-with-parallel-cpu-mode-giving-incorrect-results

Comentada: Collin Rich el 25 de Mayo de 2024

I'm using trainnet to train a convolutional regression network to find the X-Y centroid of a subtle gradient region in an input image. The training data consist of paired 130x326 grayscale images and ground-truth output coordinates. Both the RMSE and loss function reach very small numbers (eg 10^-3) after a few minutes of training on a smal dataset. The trained network gives the expected results when trained in single-CPU mode, but when trained in parallel-CPU mode, the predictions are significantly off. To attempt debugging, I scaled back to a very simple network, disabled normalization, and trained with only two datapoints--fully expecting it to memorize the training data perfectly. Using single-CPU training mode, the trained network yields perfect predictions (as expected) on the training data, but after using parallel-CPU mode, the trained network does not predict correctly on the training data. I added in a more verbose loss function and confirmed that the reported losses (i.e. showin in the loss function during training) are consistent with the (Y,T) pairs during training, and that the T values are being correctly read from the training data.

It seems perhaps the final outputted network in parallel-CPU mode does not correcltly capture the results of the training.

I'm running 2024a on a MBPro (M2 Max), using Apple Accelerate BLAS. (Default BLAS persistently crashed in parallel mode with trainnet.)

Code snippet below...

layers = [
    imageInputLayer([130 326 1],"Name","imageinput","Normalization","none")
    convolution2dLayer([10 10],8,"dilation",[2 2],"Name","conv_1")
    maxPooling2dLayer([2 2],"Name","maxpool_4")
    batchNormalizationLayer
    reluLayer("Name","relu_1")
    convolution2dLayer([2 2],16,"Name","conv_2")
    fullyConnectedLayer(2,"Name","fc")];
opts = trainingOptions('sgdm', ...
    'InitialLearnRate',1e-7, ...
    'LearnRateSchedule','piecewise',...
    'LearnRateDropPeriod',500,...
    'LearnRateDropFactor',.25,...
    'MaxEpochs',1000, ...
    'Verbose',false, ...
    'ExecutionEnvironment','parallel',...
    'Shuffle','every-epoch',...
    'Plots','training-progress', ...
    'OutputNetwork','last-iteration');
FOVCnet = trainnet(trainingData,net,@modelLoss,opts); 
function loss = modelLoss(Y,T) % define loss function
Y
T
loss = mse(Y,T)
end

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Matt J el 25 de Mayo de 2024

We can't run the code without trainingData. Please attach your two data point test case in a .mat file (as an arrayDatastore).

Collin Rich el 25 de Mayo de 2024

Here are the two test images and coordinates. (Sorry for not putting in an arrayDatastore; I'm not sure how to put both in a single arrayDatastore. Still learning the ropes...)

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.