Work Grid Size Determination
This summarizes what we have seen so far in terms of establishing sizes of
the computational work grid. Note that it also indirectly determines the
amount of work each kernel will do since the total work to be performed
can only be distributed among the threads of the work grid established here.
Many different approaches to this determination could be developed.
The approach outlined here was
developed in a way that should be equally applicable to OpenCL or CUDA programmers.
No significant techniques for optimization are shown here. That important
aspect is largely application-dependent and will be considered as we progress.
- Decide whether the grid should be 1D, 2D, or 3D based on
the nature of the algorithm and the general way in which the
problem is being decomposed. (Recall, for example, we saw both 1D and 2D kernels
used in our matrix multiplication examples.) Let numDims be 1, 2, or 3
based on this decision.
- For each dimension, d (0 ≤ d < numDims), of the computational grid:
- Decide the Thread Block size (CUDA) or
Work Group size (OpenCL). Let's generically call this size[d].
Guideline: The product of these sizes
over all dimensions should be a multiple of the warp size (32).
Requirement: The product cannot exceed the largest size permitted by the GPU.
- Determine the number of these thread blocks (CUDA) or
work groups (OpenCL) that are needed in each dimension:
k[d] = ceil(totalNumberThreadsNeeded[d]/size[d]).
- CUDA: foo<<<k, size>>>(…);
- OpenCL:
globalSize[d] = k[d] * size[d]; // for each dimension
size_t* globalOffset = nullptr; // or something else, if desired
clEnqueueNDRangeKernel(queue, fooKernel, numDims, globalOffset, globalSize, size, …);
It is possible for the local and global grid dimensions to be different, in which case slight
modifications to the formulas and processes above may be needed.