.. title: PAR Class 15, Thurs 2021-03-18
.. slug: class15
.. date: 2021-03-18
.. tags: class
.. category: 
.. link: 
.. description: 
.. type: text
.. has_math: true

.. raw:: html

   <style> .red {color:red} </style>
   <style> .blue {color:blue} </style>

.. role:: red
.. role:: blue

.. sectnum::
.. contents:: Table of contents
..


Student talks
-------------

Mon
===

#. Connor. cloud-based/docker-base parallel computing

#. Joseph.  OpenCL.

#. Blaine. Kubernetes.

#. Dan. Nsight.

#. Junshi. Nvidia debugging.

either
======

#. Ben. docker.

Did not reply
=============

#. You know who you are.

Nvidia GPU and accelerated computing, 9
---------------------------------------

This is from https://developer.nvidia.com/teaching-kits-downloads
This material accompanies **Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu**.  I recommend it.  (The slides etc are free but the book isn't.)

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from
Module_20_Related_Programming_Models_OpenCL/


OpenCL
------

#. Module 20.

#. Apple's competition to CUDA.

#. is largely CUDA but they changed the names and made it look like OpenGL and clunkier.

#. not interesting so long as Nvidia is dominant.


Nvidia primary documentation
----------------------------

Generally more up-to-date and accurate, but drier.   A little disorganized because it keeps growing.  The root is here: https://docs.nvidia.com/

Two major relevant sets are:

#. https://docs.nvidia.com/hpc-sdk/index.html
   
#. https://docs.nvidia.com/cuda/index.html

Other Nvidia features
---------------------

We've seen almost everything, except:

#. Texture and surface maps.

#. ML HW like A=BC+D for 4x4 matrices.

#. Ray tracing HW, to compute a ray's intersections with boxes.

#. Cooperative groups: with Ampere, subsets of a warp can synchronize.

#. Subsets of a GPU can be defined as virtual GPUS, which are walled off from each other.

#. Memory can be compressed when stored, making a space-time tradeff.

#. The terminology *CUDA core* is obsolete.  Now, they say that an SM has, perhaps 32 single float units, 32 integer units, 32 CUDA instruction dispatchers, and 16 double float units, etc.  Each unit operates independently.


   
Nvidia conceptual hierarchy
---------------------------

As always, this is as I understand it, and could be wrong.    Nvidia uses their own terminology inconsistently.   They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator).    As time progresses, they change their terminology.


#. At the bottom is the hardware **micro-architecture**.  This is an API that defines things like the available operations.  The last several Nvidia micro-architecture generations are, in order, **Tesla** (which introduced unified shaders), **Fermi**, **Kepler**, **Maxwell** (introduced in 2014), **Pascal** (2016), and **Volta** (2018).

   
#. Each micro-architecture is implemented in several different **microprocessors**.  E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc.  Pascal is GP104 etc.  The second letter describes the micro-architecture.  Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.

   
#. To be used, microprocessors are embedded in **graphics cards**, aka **modules** or **accelerators**, which are grouped into series such as GeForce, Quadro, etc.  Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures.  Two different modules using the same microprocessor may have different amounts of memory and other resources.  These are the components that you buy and insert into a computer.  A typical name is **GeForce GTX1080**.

#. There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
   
#. The same accelerator may be manufactured by different vendors, as well as by Nvidia.  These different versions may have slightly different parameters.  Nvidia's **reference version** may be relatively low performance.

#. The term **GPU** sometimes refers to the microprocessor and sometimes to the module.

#. There are at least four families of modules: **GeForce** for gamers, **Quadro** for professionals, **Tesla** for computation, and **Tegra** for mobility.

#. Nvidia uses the term **Tesla** in two unrelated ways.  It is an obsolete architecture generation and a module family.
   
#. Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm.  Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080.  We also have an unused (Kepler) Quadro K5000.
  
#. Since the highest-end (Tesla) modules don't have video out, they are also called something like **compute modules**.


GPU range of speeds
-------------------

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory.  matrixMulCUBLAS runs at 5310 GFlops.  The specs claim 16 TFlops.  However those numbers understate its `capabilities <https://www.nvidia.com/en-us/design-visualization/quadro/rtx-8000/>`_ because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory.
matrixMulCUBLAS runs at 3136 GFlops.  However the reported time (0.063 msec) is so small that it may be inaccurate.   The quoted speed of the 1080 is about triple that.   I'm impressed that the measured performance is so close.

The Quadro K2100M in  my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory.  matrixMulCUBLAS runs at 320 GFlops.   The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.


CUDA
----

Versions
========

#. CUDA has a **capability version**, whose major number corresponds to the micro-architecture generation.  Kepler is 3.x.  The K20xm is 3.5.  The GTX 1080 is 6.1.  The RTX 8000 is 7.5.  Here is a table of the `properties of different compute capabilities <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities>`_.  However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.

#. **nvcc**, the CUDA compiler, can be told which capabilities (aka architectures) to compile for.   They can be given as a real  architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

#. The CUDA driver and runtime also have a software version, defining things like available C++ functions.  The latest is 10.1.   This is unrelated to the capability version.


Misc
====

#. With CUDA, the dominant problem in program optimization is optimizing the data flow.  Getting the data quickly to the cores is harder than processing it.  It helps big to have regular arrays, where each core reads or writes a successive entry.

   This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.

#. That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.

   
#. `Nvidia CUDA FAQ <https://developer.nvidia.com/cuda-faq>`_

   a. has links to other Nvidia docs.
   #. can be a little old. 
   



Types of memory allocation
--------------------------

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

#. **Static**.  Define a fixed-size array global array.  The variable is constructed at compile time, so accesses might perhaps be faster.  Global vars with non default initial values increase the executable file size.  If they're large enough, you need to use the compiler option **-mcmodel=medium** or **-mcmodel=large**.  They cause the compiler to generate wider addresses.  I don't know the effect on the program's size or speed, but suspect that it's small.

#. **Stack**.  Define local arrays, that are created and freed as the routine is entered and exited.  Their addresses relative to the base of this call frame may be constant.  The default stack size is 8MB.  You can increase this with the command **ulimit** or in the program as shown in **stacksize.cc**.  I believe that in OpenMP, the max stacksize may be allocated when each thread is created.  Then, a really big stackssize might have a penalty.

#. **Heap**.  You use **new** and **destroy**.  Variables are constructed whenever you want.  The more objects on the heap, the more time that each new or destroy takes.  If you have lots of objects consider using *placement* *new* or creating an array of them.

   For CUDA, some variables must be on the heap.
   
I like to use *static*, then *stack*, and *heap* only when necessary.   However, allocating few but large, blocks on the heap is also fast.

Google's `allocator <http://goog-perftools.sourceforge.net/doc/tcmalloc.html>`_ is noticeably better than the default one.  To use it, link your programs with **-ltcmalloc**.  You can often use it on an existing executable *foo* thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.   

Another memory concern is speed.  Parallel has a NUMA (Non Uniform Memory Architecture).  It has two 14-core Xeons.  Each core has 128GB of main memory.  Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved).  So, each page of a large array may be on a different core.  This can be used to optimize things.  This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores.  This affects speed in ways I don't understand.  The issues are resource sharing vs conflicts.

	      

Nvidia GPU summary
------------------

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it.  It's more
compact than I've found elsewhere.  I'll add to it from time to time.  Some numbers are probably wrong.

#. The **host** is the CPU.

#. The **device** is the GPU.

#. The device contains 20 **streaming multiprocessors** (SMs).

   Different GPU generations have used the terms SMX or SMM.

#. A **thread** is a sequential program with private and shared memory, program counter, etc.

#. Threads are grouped, 32 at a time, into **warps**.

#. Warps of threads are grouped into **blocks**.  

   Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

   That abstract hides details that may be important; see below.

#. Blocks of threads are grouped into a **grid**, which is all the threads in the kernel.

#. A **kernel** is a parallel program executing on the device.

   a. The kernel runs potentially thousands of **threads**.

   #. A kernel can create other kernels and wait for their completion.

   #. There may be a limit, e.g., 5 seconds, on a kernel's run time.

#. Thread-level resources:

   a. Each thread can use up to 255 fast **registers**.  Registers are private to the thread.

      All the threads in one block have their registers allocated from a fixed pool of 65536 registers.  The more registers that each thread uses, the fewer warps in the block  can run simultaneously.

   #. Each thread has 512KB slow **local memory**, allocated from the global memory.

   #. Local memory is used when not enough registers are available, and to
      store thread-local arrays. 

#. Warp-level resources:

   a. Threads are grouped, 32 at a time, into **warps**.

   #. Each warp executes as a SIMD, with one instruction register.  At each cycle,
      every thread in a warp is either executing the same instruction, or is disabled.
      If the 32 threads want to execute 32 different instructions, then they will
      execute one after the other, sequentially.

      If you read in some NVidia doc that threads in a warp run independently, then
      continue reading the next page to get the info mentioned in the previous paragraph.

   #. If successive instructions in a warp do not depend on each other, then,
      if there are enough warp schedulers available, they may be executed in
      parallel.   This is called **Instruction Level Parallelism (ILP)**.

   #. For an array in local memory, which means that each thread will have
      its private copy, the elements for all the threads in a warp are
      **interleaved** to potentially increase the I/O rate.

      Therefore your program should try to have successive threads read successive
      words of arrays.

   #. A thread can read variables from other threads in the same warp, with the
      **shuffle** instruction.  Typical operation are to read from the K-th next
      thread, to do a butterfly permutation, or to do an indexed read.  This happens in
      parallel for the whole warp, and does not use shared memory.

   #. A **warp vote** combines a bit computed by each thread to report
      results like *all* or *any*.

#. Block-level resources:

   a. A block may contain up to 1024 threads.

   #. Each block has access to 65536 fast 32-bit **registers**,
      for the use of its threads.

   #. Each block can use up to 49152 bytes of the SM's fast **shared**
      **memory**.  The block's shared memory is shared by all the threads in
      the block, but is hidden from other blocks.

      Shared memory is basically a user-controllable cache of some global
      data.  The saving comes from reusing that shared data several times
      after you loaded it from global memory once.

      Shared memory is interleaved in banks so that some access patterns are faster than others.

   #. Warps in a block run asynchronously and run different instructions.  They
      are scheduled and executed as resources are available.

   #. However they are all running the same instruction sequence, perhaps at different points in it.

   #. That is call **SPMD**, single program multiple data.

   #. The threads in a block can be synchonized with **__syncthreads()**.

      Because of how warps are scheduled, that can be slow.

   #. The threads in a block can be arranged into a 3D array, up to
      1024x1024x64.

      That is for convenience, and does not increase performance (I think).

   #. I'll talk about **textures** later.


#. Streaming Multiprocessor (SM) - level resources:

   a. Each SM has 128 single-precision CUDA cores, 64
      double-precision units, 32 special function units, and
      32 load/store units.    

   #. In total, the GPU has 2560 CUDA cores.

   #. A **CUDA core** is akin to an ALU.  The cores, and all the units, are
      pipelined.

   #. A CUDA core is much less powerful than one core of an Intel Xeon.  My
      guess is 1/20th.

   #. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an
      SM a core.

   #. The limited number of, e.g., double precision units means that an DP
      instruction will need to be scheduled several times for all the threads
      to execute it.  That's why DP is slower.

   #. Each SM has 4 warp schedulers and 8 instruction dispatch units.

   #. 64 warps can simultaneously reside in an SM.

   #. Therefore up to 32x64=2048 threads can be executed in parallel by an
      SM.

   #. Up to 16 blocks that can simultaneously be resident in an SM.
  
      However, if each block uses too many resources, like shared memory,
      then this number is reduced.

      Each block sits on only one SM; no block is split.  However a block's
      warps are executed asynchronously (until synced).

   #. Each SM has 64KiB (?) fast memory to be divided between **shared** memory and an **L1 cache**.  Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.

   #. The 48KB L1 cache can cache local or global memory.

   #. Each SM has a read-only data cache of 48KB to cache the
      global constant memory.

   #. Each SM has 8 texture units, and many other graphics capabilities.

   #. Each SM has 256KB of L2 cache.

#. Grid-level resources:

   a. The blocks in a grid can be arranged into a 3D array.  
      up to :math:`(2^{31}-1, 2^{16}-1, 2^{16}-1)`.

   #. Blocks in a grid might run on different SMs.
      
   #. Blocks in a grid are queued and executed as resources are
      available, in an unpredictable parallel or serial order.
      Therefore they should be independent of each other.

   #. The number of instructions in a kernel is limited.

   #. Any thread can stop the kernel by calling **assert**.

#. Device-level resources:

   a. There is a large and slow 48GB **global memory**, which
      persists from kernel to kernel.

      Transactions to global memory are 128 bytes.

      Host memory can also be memory-mapped into global memory, although the
      I/O rate will be lower.

      Reading from global memory can take hundreds of cycles.  A warp that
      does this will be paused and another warp started.  Such context
      switching is very efficient.  Therefore device throughput stays high,
      although there is a latency.  This is called **Thread Level
      Parallelism (TLP)** and is a major reason for GPU performance.

      That assumes that an SM has enough active warps that there is always
      another warp available for execution.  That is a reason for having
      warps that do not use all the resources (registers etc) that they're
      allowed to.

   #. There is a 2MB L2 cache, for sharing data between SMs.
      
   #. There is a 64KiB Small and fast global **constant memory**, ,
      which also persists from kernel to kernel.  It is implemented as a
      piece of the global memory, made fast with caches.

      (Again, I'm still resolving this apparent contradiction).

   #. **Grid Management Unit (GMU)** schedules (pauses, executes, etc) grids on
      the device.  This is more important because grids can start other
      grids **(Dynamic Parallelism)**.

   #. **Hyper-Q**: 32 simultaneous CPU tasks can launch kernels into the
      queue; they don't block each other.  If one kernel is waiting, another runs.

   #. **CUDA Work Distributor (CWD)** dispatches 32 active grids at
      a time to the SMs.  There may be 1000s of grids queued and waiting.

   #. **GPU Direct**: Other devices can DMA the GPU memory.

   #. The base clock is 1607MHz.

   #. GFLOPS: 8873.

   #. Memory bandwidth: 320GB/s

#. GPU-level resources:

   a. Being a Geforce product, there are many graphics facilities that we're not using.

   #. There are 4 **Graphics processing clusters** (GPCs) to do graphics stuff.

   #. Several perspective projections can be computed in parallel, for systems with several displays.

   #. There's HW for texture processing.

#. Generational changes:

   a. With each new version, Nvidia tweaks the numbers.   Some get higher, others get lower.

      i. E.g., Maxwell had little HW for double precision, and so that was slow.

      #. Pascal's clock speed is much higher.      
      
#. Refs:

   a. The CUDA program deviceDrv.

   #. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf

   #. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

   #. `Better Performance at Lower Occupancy <http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf>`_, 
      Vasily Volkov, UC Berkeley, 2010.

   #. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.

   *(I'll keep adding to this. Suggestions are welcome.)*

More CUDA
---------

#. CUDA function qualifiers:

   a. *__global__*   device function called from host, starting a kernel.

   #. *__device__* device function called from device function.

   #. *__host__* (default)  host function called from host function.

#. CUDA variable qualifiers:

   a. *__shared__*
   #. *__device__* global
   #. *__device__ __managed__* automatically paged between host and device.
   #. *__constant__*
   #. (nothing) register if scalar, or local if array or if no more registers
      available.

#. If installing CUDA on your machine, this repository seems best:

   http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

   That includes the Thrust headers but not example programs.


Unionfs: Linux trick of the day
-------------------------------

#. aka overlay FS, translucent FS.

#. If a, b are directories, and m is an empty directory, then

   unionfs -o cow a=RW:b m

   makes m to be a combo of a and b, with a being higher priority

#. Writing a file into m writes it in a.

#. Changing a file in b writes the new version into a

#. Deleting a file in b causes a white-out note to be stored in a.

#. Unmount it thus:

   fusermount -u m

#. None of this requires superuser.  

#. Application: making a read-only directory into a read-write directory.

#. Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.


Thrust
------

#. `Stanford's parallel course notes <../files/stanford/>`_.  We did up thru the lecture 6 and parts of 7 and 8.

   
