EECS 675 In-Class Materials

EECS 675: Multicore and GPU Programming

In-Class Materials

Material may be added and/or edited throughout the semester. I always try to have any such additions and edits posted before the class in which they are first covered.

Introductory architectural concepts
1. EECS Servers
2. Shared memory multiprocessor architectures for thread-level (e.g., C++ std::thread) MIMD parallelism; Single address space
  - Shared memory multiprocessors (SMPs) using Uniform Memory Access (UMA) (Figure 5.1 from Computer Architecture: A Quantitative Approach, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann, fifth edition, 2012.)
  - Distributed shared memory multiprocessors (DSMs) using Nonuniform Memory Access (NUMA) (Figure 5.2 from Computer Architecture: A Quantitative Approach, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann, fifth edition, 2012.)
3. Distributed Memory MIMD ("shared nothing"); Multiple address spaces
  - A networked collection of machines programmed using, for example, OpenMPI.
4. GPU SIMD
  - Programmed using, for example, CUDA or OpenCL.
Basic shared memory multi-threaded programming (Barlas, Chapter 3)
1. API References
  - C++ Multi-threading (Reference pages for C++ std::thread and related classes)
  - Comparing Qt and C++ thread-related classes
2. Some basic examples of using std::thread
  1. Listing 3.2 illustrating the use of std::thread
  2. InitializeArray.c++
  3. Listing 3.2 modified to create an arbitrary number of child threads, each with a unique "work assignment"
  4. Listing 3.2 further modified to avoid garbled output by allowing only one thread to produce output. (We will see a more satisfying way to prevent garbled output shortly.)
  5. Listing 3.2 modified one more time to introduce thread IDs.
  6. Threads accessing global variables
    - ThreadsAccessingGlobalVariables_Ex1.c++
    - ThreadsAccessingGlobalVariables_Ex2.c++
  7. Retrieving results from a thread
  8. Programmer-defined classes with threads
    - PassingClassInstancesToThreads.c++: Pass one or more instances of the same or different class to a function launched on a thread.
    - Launching threads with class or instance methods: Either a class or an instance method can be passed as the code to be executed by a thread. Notice how the specific details depend on which type of method is passed!
    - Slight modification of previous that uses std::async instead of std::thread.
3. Using binary semaphores (C++'s std::mutex) to control access to shared resources
  - ThreadsAccessingGlobalVariables_Ex3.c++ (shared resource: programmer's global variables)
  - Listing_3_2_WithWorkAssignment_mutex.c++ (shared resource: std::cout)
4. Interim Summary: Code and Memory in multi-threaded applications
  - FindSegmentMaxValues.c++
  - Memory Picture (roughly immediately after "go" is set to "true")
5. Unique locks: Listing_3_2_WithWorkAssignment_unique_lock.c++
6. The Monitor Design Pattern. Two examples:
  - Account Monitor from Section 3.6
    - Pseudocode (neither a Qt nor a C++ example.)
    - With mutex lockers (The book's Qt implementation marked up to C++)
    - With mutex lockers (A clean C++ implementation)
  - Producer-Consumer example from Section 3.7.1.1, translated to use C++ std classes.
7. On starting all threads at once:
  - GoMonitor.c++ (Uses a Monitor with a condition variable)
  - go_ToStart.c++ (Uses an atomic counter and a Boolean "go" variable)
Multicore programming: MIMD/Distributed Memory ("Shared Nothing") (Barlas, Chapter 5 – not Chapter 4)
1. Processes can only communicate via explicit message passing
  - The Message Passing Interface (MPI) standard (specification for the API; not a language)
  - MPI Forum: the standardization forum for the Message Passing Interface (MPI)
  - OpenMPI: One of the most common implementations of the MPI standard
    - Documentation (Select the version you want – eecs is running 2.1.1 as of Spring 2020 – to see man pages for (i) compiling and running, (ii) individual Open MPI function specifications.)
2. Basic OpenMPI Examples
  - mpiVersions.c++
  - mpiArgcArgv.c++
  - mpiSimpleMessagePassing.c++
  - mpiSeveralMessages.c++ (checks error codes)
  - mpiSeveralMessagesMixed.c++ (messages can be received in any order; not all sent messages must be received)
3. Buffering and Blocking in Point-to-Point OpenMPI Message Passing
  - Buffered; locally blocking (sender only; MPI_Bsend)
    - mpiBsend_smallBuffer.c++ (Rank 0 sends m messages to all other ranks; one buffer used per rank)
    - mpiBsend_largeBuffer.c++ (Rank 0 sends m messages to all other ranks; one buffer for all N-1 ranks)
  - Synchronous; globally blocking (sender only; MPI_Ssend)
  - Ready; receiving process must have already initiated a receive (sender only; MPI_Rsend)
  - Immediate; non-blocking (can be sender and/or receiver; MPI_Isend, MPI_Irecv)
    - mpiSendRecv_ImmImm.c++ (Both sender and receiver use immediate.)
    - mpiSendRecv_BlkImm.c++ (Sender blocks; receiver immediate)
    - mpiSendRecv_ImmBlk.c++ (Sender immediate; receiver blocks)
4. Summary: Basic MPI Message Passing Primitives
5. mpiMatrixMult.c++
  - Using several message passing strategies (immediate and blocking) to perform a general matrix multiplication of two non-square matrices.
  - Be sure to note and understand:
    - Where blocking send/receive requests are used versus where immediate versions are used, and why.
    - The code does not check all the MPI_Request objects for synchronization purposes. Why?
    - The importance of the tags in the messages.
6. Collective Message Passing in OpenMPI
  - mpiMatrixMult_BcastScatterGather.c++: A modified version of mpiMatrixMult.c++ that uses broadcast, scatter, and gather to optimize communication.
7. All to All Gathering and Scattering
  - Typical use case
  - Comparing All-to-All Gathering and Scattering
8. Summary: Collective MPI Message Passing Primitives
9. Programmer-Defined Data Types
  - Motivation and Basic Approach
  - Foo_struct.c++
  - mpiMatrixMult_pdTypes.c++: A modified version of mpiMatrixMult.c++ that uses programmer-defined vector types.
10. Groups and Communicators
11. I/O
  - Console input restricted to one rank process
  - File I/O subsystem for shared R/W access to binary files
12. Combining multi-threading with OpenMPI
GPU programming (Barlas, Chapter 6)
1. References
  - General:
    - Chapter 6 of: Multicore and GPU Programming: An Integrated Approach, Gerassimos Barlas, Morgan Kaufmann, 2015.
  - CUDA:
    - NVIDIA CUDA site
      - CUDA API
    - Programming Massively Parallel Processors, David B. Kirk & Wen-mei W. Hwu, Morgan-Kaufmann.
  - OpenCL
    - Khronos site
      - OpenCL 2.0 API
      - OpenCL 1.2 API
    - Heterogeneous Computing with OpenCL, B. R. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Morgan-Kaufmann.
2. GPU Terminology: Comparing OpenCL and CUDA terms with those of Hennessy & Patterson
3. Example GPU Architectures
  - High-level architecture (Figures 1.3 and 1.5 from Barlas' text)
  - Kepler GK110-GK210: NVIDIA White Paper
  - Sample deviceQuery output:
    - cycle1's GPUs (2020-03-30)
    - cycle2's GPUs (2020-03-30)
    - cycle3's GPUs (2020-03-30)
    - my previous laptop's GPU (2019-03-18)
4. Comparing CUDA and OpenCL Thread Geometry conventions
5. DAXPY: A quick CUDA-OpenCL comparison – (see also assignment of a similar kernel to GPU)
6. Examples with complete CUDA and OpenCL implementations
  1. Simple OpenCL – illustrate OpenCL's platform model
    - OpenCL Platform Model and Standard Coding Prolog Structure
    - SimpleOpenCL.c++
    - SimpleOpenCL.cl (alt: SimpleOpenCL.txt)
    - readSource.c++ – a common utility extracted from the source code associated with the book Heterogeneous Computing with OpenCL
    - Makefile
  2. Hello
    - CUDA: hello.cu ; hello_dim3.cu ; hello2.cu ; hello3.cu ; Makefile
    - OpenCL: HelloOpenCL.c++ ; HelloOpenCL.cl (alt: HelloOpenCL.txt) ; Makefile
  3. DAXPY: Full implementation
    - CUDA: daxpy.cu ; Makefile
      [Single precision version: saxpy.cu]
    - OpenCL: daxpy.c++ ; daxpy.cl (alt: daxpy.txt) ; Makefile
      [Single precision version: saxpy.c++ ; saxpy.cl (alt: saxpy.txt)]
  4. Matrix multiplication: C = A * B
    - CUDA: matrixMultiplyV1.cu ; matrixMultiplyV2.cu ; matrixMultiplyV3.cu
    - OpenCL: matrixMultiplyV1.c++ ; matrixMultiplyV1.cl (alt: matrixMultiplyV1.txt) ; Makefile
  5. Useful utilities:
    - CUDA
      - helper_cuda.h: includes (see matrixMultiplyV3.cu for examples of use)
        
        function checkCudaErrors(cudaError_t err): (if err!=CUDA_SUCCESS, then the text description of the error and the current file and line number are printed.)
        Typical usage: Call this following CUDA API calls that return a cudaError_t value.
        
        function getLastCudaError(const char* msgTag) (calls cudaGetLastError(); if result!=CUDA_SUCCESS, the current file and line number are printed followed by your supplied msgTag followed by the text description of the CUDA error.)
        Typical usage: Call this following CUDA calls such as kernel launches that do not provide a cudaError_t result variable.
      - helper_string.h (required by previous)
      - To use:
        
        Save these two header files in your project directory.
        
        Add:
        INCLUDES += -I.
        to your Makefile.
    - OpenCL
      - OpenCL error code listings can be found in many places, for example: here.
7. An intermediate summary: Work Grid Size Determination
8. Case Study: 2D Contouring
9. Exploiting GPU Memory Hierarchy and Access
  - Barlas' Histogram Example (section 6.6.2)
    - histogram.cu; Makefile
  - Converted to OpenCL
    - histogram.c++
    - histogram.cl (alt: histogram.txt)
    - Some basic OpenCL startup/teardown code: OpenCLIF.h; OpenCLIF.c++
    - Makefile
  - The test image
    - The mountains.pgm file (as a PNG file)
    - Utility code used by both the CUDA and OpenCL versions to read the "pgm" file: pgm.h; pgm.c++
10. Performance Optimization
  1. High-level work decomposition
    - 1D/2D/3D computational grid design
    - Configuring Kernel launch parameters
      - Configuring Block and Grid Parameters
      - Query hints for configuring
  2. Kernel Design
    - Appropriate amount of work
    - Minimize thread divergence
    - Case Study: Real-Time Visualization of Domain Coverage by Dynamically Moving Sensors, IEEE Computer Graphics and Applications, Vol. 32, No. 4, July/August 2012, pp. 8-13.
      (The original version with additional examples and background information.)
  3. Memory Management and Usage
  4. Asynchronous Kernel Launches
11. Events
12. Advanced topics
  - Kernel timing and profiling
  - Launching kernels from the GPU
    - CUDA: Dynamic Parallelism (requires compute capability ≥ 3.5)
    - OpenCL: Device-side queues and Device-side enqueueing (Requires OpenCL 2.0 and GPU support)
      - Background: UsingBlocks.c++: Illustrate the clang block construct.
      - See the following API pages in the OpenCL 2.0 API:
        
        clCreateCommandQueueWithProperties (and especially the properties CL_QUEUE_ON_DEVICE and CL_QUEUE_ON_DEVICE_DEFAULT)
        
        New ndrange built-in data type
        
        enqueue_kernel
        
        create_user_event
      - Extracts from Heterogeneous Computing with OpenCL: 2.0
  - nvvp: The NVIDIA Visual Profiler
  - OpenCL Pipes
Access