All-to-All: Typical Use Case

Suppose we have an iterative algorithm in which a buffer of data must be transmitted to all processes in the communicator world. At each step of the algorithm, each process requires access to all of the data in the buffer, but it modifies just its own piece. At the end of each step, each process must receive the updated data from all other processes so that they can perform the next step.

Using the "basic" collective messaging we have seen so far, this would be accomplished as follows:

while not done
  1. Broadcast data to the N processes from rank 0
  2. Each process uses the data in the buffer and updates its section
  3. Each process uses some version of a gather call to send its modified data back to rank 0

The messaging overhead involved with the gather at step 3 followed by the broadcast at the (next) step 1 can be optimized considerably if each process at step 3 instead directly sent its data to all other processes. This is accomplished with all-to-all gathering. The modified algorithmic structure looks like the following:

Broadcast data to the N processes from rank 0
while not done
  1. Each process uses the data in the buffer and updates its section
  2. Each process uses some version of an all-to-all gather call to send its modified data to all ranks

A Concrete Example: An Algorithm Based on Repeated Matrix Multiplication

Our earlier examples implemented C = A * B, where A was an MxP matrix, B was PxK, and hence the resulting C matrix was MxK. Let's consider a common special case of this:

  1. All 3 matrices are square and of the same size: M = P = K.
  2. Each multiplication creates a new B from the product. That is, we no longer have a separate C matrix, but instead perform B = A * B at each step of our iterative algorithm.

This can be accomplished as follows:

Scatter A to all N processes from rank 0
Broadcast B to all N processes from rank 0
while not done
  1. Btemp = AssignedRowsOfA * B
  2. All-to-All Gather of Btemp back to B in all processes