Work Grid Size Determination

This summarizes what we have seen so far in terms of establishing sizes of the computational work grid. Note that it also indirectly determines the amount of work each kernel will do since the total work to be performed can only be distributed among the threads of the work grid established here.

Many different approaches to this determination could be developed. The approach outlined here was developed in a way that should be equally applicable to OpenCL or CUDA programmers.

No significant techniques for optimization are shown here. That important aspect is largely application-dependent and will be considered as we progress.

Decide whether the grid should be 1D, 2D, or 3D based on the nature of the algorithm and the general way in which the problem is being decomposed. (Recall, for example, we saw both 1D and 2D kernels used in our matrix multiplication examples.) Let numDims be 1, 2, or 3 based on this decision.
For each dimension, d (0 ≤ d < numDims), of the computational grid:
1. Decide the Thread Block size (CUDA) or Work Group size (OpenCL). Let's generically call this size[d].
  Guideline: The product of these sizes over all dimensions should be a multiple of the warp size (32).
  Requirement: The product cannot exceed the largest size permitted by the GPU.
2. Determine the number of these thread blocks (CUDA) or work groups (OpenCL) that are needed in each dimension: k[d] = ceil(totalNumberThreadsNeeded[d]/size[d]).
3. CUDA: foo<<<k, size>>>(…);
4. OpenCL:
  globalSize[d] = k[d] * size[d]; // for each dimension size_t* globalOffset = nullptr; // or something else, if desired clEnqueueNDRangeKernel(queue, fooKernel, numDims, globalOffset, globalSize, size, …);

It is possible for the local and global grid dimensions to be different, in which case slight modifications to the formulas and processes above may be needed.