Further Notes on Communicating Jobs
Number of Tasks in a Communicating Job
Although you create only one task for a communicating job, the system copies this
task for each worker that runs the job. For example, if a communicating job runs on
four workers, the Tasks property of the job contains four task
objects. The first task in the job's Tasks property corresponds
to the task run by the worker whose spmdIndex is 1, and so on, so that the
ID property for the task object and
spmdIndex for the worker that ran that task have the same
value. Therefore, the sequence of results returned by the fetchOutputs function corresponds to
the value of spmdIndex and to the order of tasks in the job's
Tasks property.
Avoid Deadlock and Other Dependency Errors
Because code running in one worker for a communicating job can block execution
until some corresponding code executes on another worker, the potential for deadlock
exists in communicating jobs. This is most likely to occur when transferring data
between workers or when making code dependent upon the
spmdIndex in an if statement. Some
examples illustrate common pitfalls.
Suppose you have a codistributed array D, and you want to use
the gather function to assemble the
entire array in the workspace of a single
worker.
if spmdIndex == 1
assembled = gather(D);
endThe reason this fails is because the gather function requires
communication between all the workers across which the array is distributed. When
the if statement limits execution to a single worker, the other
workers required for execution of the function are not executing the statement. As
an alternative, you can use gather itself to collect the data
into the workspace of a single worker: assembled = gather(D,
1).
In another example, suppose you want to transfer data from every worker to the
next worker on the right (defined as the next higher
spmdIndex). First you define for each worker what the workers
on the left and right
are.
from_worker_left = mod(spmdIndex - 2, spmdSize) + 1; to_worker_right = mod(spmdIndex, spmdSize) + 1;
Then try to pass data around the ring.
spmdSend (outdata, to_worker_right); indata = spmdReceive(from_worker_left);
The reason this code might fail is because, depending on the size of the data
being transferred, the spmdSend function can block execution in a worker until the
corresponding receiving worker executes its spmdReceive function. In this case, all the workers are attempting
to send at the same time, and none are attempting to receive while
spmdSend has them blocked. In other words, none of the
workers get to their spmdReceive statements because they are
all blocked at the spmdSend statement. To avoid this particular
problem, you can use the spmdSendReceive function.