Question about randStream and Cross Validation in a parfoor loop

3 visualizaciones (últimos 30 días)
Martin Randau
Martin Randau el 2 de Dic. de 2023
Comentada: Martin Randau el 7 de Dic. de 2023
Dear MathWorks
I use the following code to generate 100 seeds of a 10-5 nested cross validation algorithm in a parfoor loop:
I start the cluster and assign a RandStream
parallel.defaultClusterProfile('dcc R2021b');
clust=parcluster('dcc R2021b');
clust.AdditionalProperties.ProcsPerNode = 24;
clust.AdditionalProperties.MemUsage = '2GB';
clust.AdditionalProperties.WallTime = '60:00';
clust.AdditionalProperties.QueueName = 'compute';
numw=100;
parpool(clust, numw);
sc = parallel.pool.Constant(RandStream('Threefry'))
Then I create a CV partition in each outer loop and one in the inner loop to be used for hyperparameter optimization:
parfor seeds = 1:100
stream = sc.Value;
stream.Substream = seeds;
...
CV = cvpartition(y, 'Kfold', 10);
...
% generate y_train from y
...
for k = 1:10
cv_in = cvpartition(y_train,'Kfold',5); % used for hyperparameter optimization
% and e.g.
mdl_LL_hp_opts = struct('AcquisitionFunctionName','expected-improvement-plus',...
'Optimizer','bayesopt','CVPartition',cv_in,'MaxObjectiveEvaluations',100,...
'UseParallel',0,'ShowPlots',0,'Verbose',0,'Repartition',0);
[mdl_all, mdl_LL_all_fitinfo,mdl_LL_all_HyperparameterOptimizationResults] = fitclinear(X_sel_train,y_train, 'learner', 'logistic', 'Regularization', 'ridge', ...
'OptimizeHyperparameters',{'Lambda'},'HyperparameterOptimizationOptions',mdl_LL_hp_opts, 'CategoricalPredictors', "gender", 'PredictorNames', predVars_all);
The code works but it seems that the CV partitions are not very random as can be seen from this result of Balanced Accuracy, where I only show the first ten rows of each iteration. Never mind the similar values (in this case there were too few of the minority class). The problem is that every iteration finds NaNs in the 9th outer fold (because of lack of minority class). I would expect that if CV partition used random seeds, the NaN column would be located randomly. The columns are the outer folds and the rows are iterations (seeds).
>> ERPres(1:10,:)
ans =
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.3333 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 NaN 0.5000
So - is something wrong the the RandStream approach?
BW
Martin

Respuestas (1)

Drew
Drew el 5 de Dic. de 2023
Editada: Drew el 6 de Dic. de 2023
Is this code running in R2021b or R2023b? Your code references 'dcc R2021b', but the MATLAB answers sidebar says R2023b.
(1) Since cvpartition uses the global stream of random numbers, if you are running in R2021b, remember to set the global stream from the threefry RandStream that you have created. That is, in your parfor loop, after
stream = sc.Value;
stream.Substream = seeds;
add
RandStream.setGlobalStream(stream);
(2) Note, as you may already know, starting in R2023a, each parallel worker has an independent random number stream by default. See https://www.mathworks.com/help/parallel-computing/control-random-number-streams-on-workers.html.
  • "By default, the MATLAB client and MATLAB workers use different random number generators, even if the workers are part of a local cluster on the same machine as the client."
  • "By default, each worker in a cluster working on the same job has an independent random number stream. If rand, randi, or randn are called in parallel, each worker produces a unique sequence of random numbers."
So, if you want independent random number streams on each worker, just accept the defaults (as of 23a or higher), with no need to call rng or RandStream functions. If you want some other behavior with repeatable random number sequences, see the instructions at https://www.mathworks.com/help/parallel-computing/control-random-number-streams-on-workers.html .
See the "Version History" section of the page https://www.mathworks.com/help/matlab/ref/rng.html for notes about the random number generation changes in 23b and 23a.
If this answer helps you, please remember to accept the answer.
  1 comentario
Martin Randau
Martin Randau el 7 de Dic. de 2023
Dear Drew,
thanks for the answer. The version is 2021b, sorry for the confusion.
I added RandStream.setGlobalStream(stream); at the location you suggested. However, the NaNs are still occuring in the same column across the seeds. Of course, it's possible that, given the distribution of cases/non-cases that every iteration finds the NaNs in the same order, even though the starting points are different.
The parallel job is in a distributed environment, i.e., MATLAB Parallel Server (https://www.hpc.dtu.dk/?page_id=2021). So I start the job with a .sh script where I specify "# -- Number of cores requested -- #BSUB -n 1".
BW
Martin

Iniciar sesión para comentar.

Productos


Versión

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by