Suppose we have an iterative algorithm in which a buffer of data must be transmitted to all processes in the communicator world. At each step of the algorithm, each process requires access to all of the data in the buffer, but it modifies just its own piece. At the end of each step, each process must receive the updated data from all other processes so that they can perform the next step.
Using the "basic" collective messaging we have seen so far, this would be accomplished as follows:
while not doneThe messaging overhead involved with the gather at step 3 followed by the broadcast at the (next) step 1 can be optimized considerably if each process at step 3 instead directly sent its data to all other processes. This is accomplished with all-to-all gathering. The modified algorithmic structure looks like the following:
Broadcast data to the N processes from rank 0Our earlier examples implemented C = A * B, where A was an MxP matrix, B was PxK, and hence the resulting C matrix was MxK. Let's consider a common special case of this:
This can be accomplished as follows:
Scatter A to all N processes from rank 0