How to decide the number of workers by seeing computer specs like CPUS or Cores or threads?

104 visualizaciones (últimos 30 días)
I have a Desktop that has 24 CPUs, (1- Socket(12-Cores), (2Threads per core)). I am trying to find how I can efficiently use my Desktop for parallel programming.
This is my code with 50 workers
myCluster = parcluster('local');
myCluster.NumWorkers = 50; % 'Modified' property now TRUE
saveProfile(myCluster);
parpool('local',50)
tic
parfor i = 1:100
pause(10)
end
toc
It takes approximately 20 seconds,
Now I change my number of workers from 50 to 40, it takes approximately 30 seconds. My question is how I can set the maximum number of workers by seeing CPU, cores or threads. Becuase clearly CPUs are 24 so 50 has nothing to do anything with 24, same applies to 40. How? Ans: 50 is not multiple of 24 and neither 40 is. Also if I change number of workers to 100 the Desktop becomes hanged. So this means there is definitely a limit that sets the number of workers. These all things melts down to one single question for a matlab users, how I can find the maximum number of workers to get the above code run in minimum time?
Thanks for your help.

Respuestas (1)

Walter Roberson
Walter Roberson el 27 de Oct. de 2021
By examination of your code, we can see that if any worker is scheduled for N iterations of the work, that it must take approximately 10*N seconds.
Therefore, if any worker is scheduled for 2 iterations, the minimum possible time is approximately 20 seconds.
We also see that since there must exist at least one worker that gets scheduled once, that the minimum we could possibly finish is approximately 10 seconds.
If we request M workers, M strictly less than 100, then is it possible to finish all of the work in less than about 20 seconds? No, that would require that 100 tasks be distributed amongst less than 100 workers with none of the workers being given 2 tasks, but by the Pigeon Hole Principle that is not possible: to have each of the 100 tasks scheduled simultaneously, you need a minimum of 100 workers.
If your MATLAB hangs with 100 workers requested, then it follows that even if it did not hang with 99 workers, that some worker would get scheduled for two tasks and the minimum time would be about 20 seconds.
I would tend to suspect that when you request 100 workers, that you are running into memory limits.
If you are using R2021b or newer (somehow ;-) ) then you could experiment with using backgroundpool() as my understanding is that uses less memory. You just might be able to get 100 background thread workers without running out of real memory... maybe.
  3 comentarios
muhammad imran
muhammad imran el 27 de Oct. de 2021
This means I should use 11 workers if I have 12 cores desktop? Also I am still confused that matlab parallization for simple parfor loop is designed after seeing the hardware. Like can I answer this question if a person gives me information about the specs of a desktop and asks can you answer how many workers are available. I am interested in knowing answer in terms of math equation, if possbile. Like I have 12 cores with 2 threads, 24 cpus, intel core(TM) i9-10920X CPU @ 3.58 GHz, RAM 128GB. Like in my case why is my desktop even able to run 50 or 40 workers if it only has 12 cores. How is the desktop working. The code is very simple to check the working mechanism for parfor efficiency. I will use this to go from Desktop to cluster. And basically there I will be also given the information of cores and cpus and memories. But if a 12 core can run 40 or 50 workers then I am confused whats going on? How to efficiently use parallelization or just parfor loop. Thanks
Walter Roberson
Walter Roberson el 27 de Oct. de 2021
When you ask for N workers (and are not using a backgroundpool), then MATLAB tries to create N processes, each with a copy of the MATLAB execution engine (which is roughly 2 gigabytes per process.)
Sometimes, the operating system may have limits on the number of sub-processes that may be created; clusters are often running Linux, and for linux see ulimit nproc; https://linuxhint.com/set_max_user_processes_linux/
When processes are successfully created, it becomes up to the operating system to schedule them.
Now, suppose that two processes exist, both of which are competing to use a single core, and suppose that the cost to the operating system of switching between them is . Then compare
to where and -- that is suppose in the first case that the operating system interrupts each of the two tasks once to switch to the other task, and in the second case that the operating system lets the tasks run to completion. You can see that the first total would be as compared to for the case where it allows running to completion. The difference is more expensive (slower) in the case that the operating system interrupts a compute-bound task.
It follows from this that given compute_bound tasks, that the lowest cost is to assign a process to a core and let it run uninterrupted until it finishes. It follows from this that for compute-bound tasks it is inefficient to schedule more processes than you have the ability to compute simultaneously.
So, given cores and given the existence of hyperthreads, how many tasks do you have the ability to compute simultaneously? It turns out that (in most cases) hyperthreads does not increase your ability to compute simultaneously. What hyperthreads are is compute threads put into a hardware fast-switch standby, and when a thread voluntarily gives up control of a core, the hyperthread quickly switches over to use the core.
Threads voluntarily give up control of a core when the thread voluntarily sleeps, or when the thread asks for a hardware resource that is not immediately ready for it. For example, if it is reading from a file and the operating system has pulled in a 4K buffer from the file, then it immediately read up to 4K, and that doesn't necessarily lose control -- buf at the end of the buffer it has to ask the operating system for another buffer-full and the operating system puts it to sleep while it fetches the buffer.
If you have done the default of requesting only one core per worker, and your computation does not do any I/O, then using hyperthreads only slows things down (because there is a hardware cost for using hyperthreads.)
If you asked for multiple cores per worker (which can often be an advantage), then MATLAB will automatically use multiple cores to do "expensive" computations (which can include simple but repetitive computations such as summation of a large-enough vector). In such a case, some cores might finish faster than other cores, or some cores might not be needed all the time, so potentially in such a case hyperthreading could have an advantage.
Your use of pause() is not compute-bound; you voluntarily give up the core as soon as you start, so in that case, hyperthreads can make a bit of a difference, but even then the second thread gives up control nearly immediately too. Your operating system would end up scheduling each process (worker) soon after a different worker gave up control to wait.
So in the case of your demonstration code, your timing limit is based upon ceil(tasks / workers) * delay_per_task... or would be if you do not exhaust the memory limits or process limits of your system by asking for that many workers.
Please also remember that setting up workers has a cost, and that communicating data to them so that they know what to do on which data, has a cost, and that communicating results back from the worker has a cost. Those costs can be fairly noticable. It is common that if you do not plan workers carefully, that running compute bound tasks in parallel can require more time than doing them in serial.
Number of simultaneous compute tasks == number of physical cores
If you have assigned multiple cores per worker, or you have any I/O, then you might get overall improvement with hyperthreads... but it depends upon the instruction mix and the exact CPU involved and upon the overclocking... hyperthreads can be slower.
Hyperthreads == primarily a fast scheduling switch when a thread gives up a core voluntarily
Workers requested == operating system processes created
Operating system is responsible for scheduling processes to cores / hyperthreads
For compute-bound processes that only request one core per worker, the optimal is typically when you only ask for as many workers as you have physical cores... possibly leaving one core free to handle the operating system (and I/O.)
(On some hardware, with some instruction mixes, a hyperthread can make use of logic units not being used by the primary thread; for example a hyperthread might potentially be able to use an integer processing unit while the primary thread is using the floating point units. You pretty much have to custom assign processes to cores in order to take meaningful advantage of this.)

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel for-Loops (parfor) en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by