EECS 675: Multicore and GPU Programming

In-Class Materials

Material may be added and/or edited throughout the semester. I always try to have any such additions and edits posted before the class in which they are first covered.

  1. Introductory architectural concepts
    1. EECS Servers
    2. Shared memory multiprocessor architectures for thread-level (e.g., C++ std::thread) MIMD parallelism; Single address space
      • Shared memory multiprocessors (SMPs) using Uniform Memory Access (UMA) (Figure 5.1 from Computer Architecture: A Quantitative Approach, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann, fifth edition, 2012.)
      • Distributed shared memory multiprocessors (DSMs) using Nonuniform Memory Access (NUMA) (Figure 5.2 from Computer Architecture: A Quantitative Approach, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann, fifth edition, 2012.)
    3. Distributed Memory MIMD ("shared nothing"); Multiple address spaces
      • A networked collection of machines programmed using, for example, OpenMPI.
    4. GPU SIMD
      • Programmed using, for example, CUDA or OpenCL.
  2. Basic shared memory multi-threaded programming (Barlas, Chapter 3)
    1. API References
    2. Some basic examples of using std::thread
      1. Listing 3.2 illustrating the use of std::thread
      2. InitializeArray.c++
      3. Listing 3.2 modified to create an arbitrary number of child threads, each with a unique "work assignment"
      4. Listing 3.2 further modified to avoid garbled output by allowing only one thread to produce output. (We will see a more satisfying way to prevent garbled output shortly.)
      5. Listing 3.2 modified one more time to introduce thread IDs.
      6. Threads accessing global variables
      7. Retrieving results from a thread
      8. Programmer-defined classes with threads
    3. Using binary semaphores (C++'s std::mutex) to control access to shared resources
    4. Interim Summary: Code and Memory in multi-threaded applications
    5. Unique locks: Listing_3_2_WithWorkAssignment_unique_lock.c++
    6. The Monitor Design Pattern. Two examples:
    7. On starting all threads at once:
  3. Multicore programming: MIMD/Distributed Memory ("Shared Nothing") (Barlas, Chapter 5 – not Chapter 4)
    1. Processes can only communicate via explicit message passing
      • The Message Passing Interface (MPI) standard (specification for the API; not a language)
      • MPI Forum: the standardization forum for the Message Passing Interface (MPI)
      • OpenMPI: One of the most common implementations of the MPI standard
        • Documentation (Select the version you want – eecs is running 2.1.1 as of Spring 2020 – to see man pages for (i) compiling and running, (ii) individual Open MPI function specifications.)
    2. Basic OpenMPI Examples
    3. Buffering and Blocking in Point-to-Point OpenMPI Message Passing
      • Buffered; locally blocking (sender only; MPI_Bsend)
      • Synchronous; globally blocking (sender only; MPI_Ssend)
      • Ready; receiving process must have already initiated a receive (sender only; MPI_Rsend)
      • Immediate; non-blocking (can be sender and/or receiver; MPI_Isend, MPI_Irecv)
    4. Summary: Basic MPI Message Passing Primitives
    5. mpiMatrixMult.c++
      • Using several message passing strategies (immediate and blocking) to perform a general matrix multiplication of two non-square matrices.
      • Be sure to note and understand:
        • Where blocking send/receive requests are used versus where immediate versions are used, and why.
        • The code does not check all the MPI_Request objects for synchronization purposes. Why?
        • The importance of the tags in the messages.
    6. Collective Message Passing in OpenMPI
    7. All to All Gathering and Scattering
    8. Summary: Collective MPI Message Passing Primitives
    9. Programmer-Defined Data Types
    10. Groups and Communicators
    11. I/O
      • Console input restricted to one rank process
      • File I/O subsystem for shared R/W access to binary files
    12. Combining multi-threading with OpenMPI
  4. GPU programming (Barlas, Chapter 6)
    1. References
      • General:
        • Chapter 6 of: Multicore and GPU Programming: An Integrated Approach, Gerassimos Barlas, Morgan Kaufmann, 2015.
      • CUDA:
      • OpenCL
    2. GPU Terminology: Comparing OpenCL and CUDA terms with those of Hennessy & Patterson
    3. Example GPU Architectures
    4. Comparing CUDA and OpenCL Thread Geometry conventions
    5. DAXPY: A quick CUDA-OpenCL comparison – (see also assignment of a similar kernel to GPU)
    6. Examples with complete CUDA and OpenCL implementations
      1. Simple OpenCL – illustrate OpenCL's platform model
      2. Hello
      3. DAXPY: Full implementation
      4. Matrix multiplication: C = A * B
      5. Useful utilities:
        • CUDA
          • helper_cuda.h: includes (see matrixMultiplyV3.cu for examples of use)
            • function checkCudaErrors(cudaError_t err): (if err!=CUDA_SUCCESS, then the text description of the error and the current file and line number are printed.)

              Typical usage: Call this following CUDA API calls that return a cudaError_t value.

            • function getLastCudaError(const char* msgTag) (calls cudaGetLastError(); if result!=CUDA_SUCCESS, the current file and line number are printed followed by your supplied msgTag followed by the text description of the CUDA error.)

              Typical usage: Call this following CUDA calls such as kernel launches that do not provide a cudaError_t result variable.

          • helper_string.h (required by previous)
          • To use:
            • Save these two header files in your project directory.
            • Add:
              INCLUDES += -I.
              to your Makefile.
        • OpenCL
          • OpenCL error code listings can be found in many places, for example: here.
    7. An intermediate summary: Work Grid Size Determination
    8. Case Study: 2D Contouring
    9. Exploiting GPU Memory Hierarchy and Access
    10. Performance Optimization
      1. High-level work decomposition
      2. Kernel Design
      3. Memory Management and Usage
      4. Asynchronous Kernel Launches
    11. Events
    12. Advanced topics
      • Kernel timing and profiling
      • Launching kernels from the GPU
        • CUDA: Dynamic Parallelism (requires compute capability ≥ 3.5)
        • OpenCL: Device-side queues and Device-side enqueueing (Requires OpenCL 2.0 and GPU support)
      • nvvp: The NVIDIA Visual Profiler
      • OpenCL Pipes
  5. Access