.. title: PAR Class 18, Mon 2021-03-29
.. slug: class18
.. date: 2021-03-29
.. tags: class
.. category: 
.. link: 
.. description: 
.. type: text
.. has_math: true

.. raw:: html

   <style> .red {color:red} </style>
   <style> .blue {color:blue} </style>

.. role:: red
.. role:: blue

.. sectnum::
.. contents:: Table of contents
..


Optional day off
----------------

Would the class like a day off in the next week or two?
   
Thrust
------

   
#. `Stanford's parallel course notes <../files/stanford/>`_.

   Starting at Lecture 8, slide 22.

   Lectures 9 and later will be left for you to browse if you're interested.

#. IMO the syntax for the zip iterator could have been simpler.  When I use it, I wrap it in a simpler interface.

#. Nvidia has (at least) 3 official locations for Thrust, and they don't all have the same version, and don't necessarily have docs and examples.

   a. As part of CUDA, /local/cuda/targets/x86_64-linux/include/thrust/

   #. As part of the HPC.
  
   #. In the github repository, https://github.com/NVIDIA/thrust
   
#. The most comprehensive doc is online at https://docs.nvidia.com/cuda/thrust/index.html

   This is up-to-date, and precise.  However, it is only a summary.

#. There is also http://thrust.github.io/doc/index.html , but it is badly written and slightly obsolete.

#. There are easy-to-read various tutorials online.  However they are mostly obsolescent.  E.g., they don't use C++-11 lambdas, which are a big help.

#. Also, none of the non-Nvidia docs mention the recent unified memory additions.
   
#. Look at some Thrust programs and documentation in 2021/files/thrust/ 

#. There are other alternatives like Kokkos.

#. The alternatives are lower-level (= faster and harder to use) and newer (= possibly less debugged, fewer users).
   
#. For awhile it looked like Nvidia had stopped developing Thrust, but they've started up again.  Good.
   
#. On parallel in 2021/files/thrust/ are many little demo programs from the thrust distribution, with my additions.

#. **Thrust is fast because** the functions that look like they
   would need linear time really take only log time in parallel.

#. In functions like **reduce** and **transform**, you often see an argument like **thrust::multiplies<float>()**.  The syntax is as follows:

   a. **thrust::multiplies<float>** is a class.

   #. It overloads **operator()**.

   #. However, in the call to reduce, **thrust::multiplies<float>()** is calling the default
      constructor to construct a variable of class
      **thrust::multiplies<float>**, and passing it to **reduce**.

   #. **reduce** will treat its argument as a function name and call it with an argument, triggering **operator()**.

   #. You may also create your own variable of that class, e.g., **thrust::multiplies<float> foo**.   Then you just say **foo** in the argument list, not **foo()**.

   #. The optimizing compiler will replace the **operator()** function call
      with the defining expression and then continue optimizing.  So, there
      is no overhead, unlike if you passed in a pointer to a function.

#. Sometimes, e.g., in **saxpy.cu**, you see **saxpy_functor(A)**.    

   a. The class **saxpy_functor** has a constructor taking one argument.   

   #. **saxpy_functor(A)** constructs and returns a variable of class **saxpy_functor** and stores **A** in the variable.

   #. The class also overloads **operator()**.

   #. (Let's call the new variable **foo**).  **foo()** calls **operator()** for
      **foo**; its execution uses the stored **A**.

   #. Effectively, we did a **closure** of **saxpy_functor**; this is, we
      bound a property and returned a new, more restricted, variable or
      class.


Bug
===

I found and reported a bug in version 100904.  This version does not work with OpenMP.  It was
immediately closed because they already knew about it.  They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size.   The next minor version of CUDA was released the next day.

Two observations:

#. I'm good at breaking SW by expecting it to meet its specs.

#. Nvidia is responsive, which I like.

   
CUDACast videos on Thrust
=========================

I won't cover these in class; they're presented in case you're interested.

#. `CUDACast #.15 - Introduction to Thrust <https://www.youtube.com/watch?v=mZJEbO9Eros>`_

#. `CUDACast #.16 - Thrust Algorithms and Custom Operators <https://www.youtube.com/watch?v=xtWJCL7LMqU>`_


Alternate docs in parallel-class/2021/files/thrust/doc
======================================================

#.  An_Introduction_To_Thrust.pdf

#.  GTC_2010_Part_2_Thrust_By_Example.pdf

    We'll look at this starting at slide 27.   It shows parallel programming paradigms.