<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PAR ECSE-4740-01 Applied Parallel Computing for Engineers, Spring 2018, Rensselaer Polytechnic Institute (Posts about class)</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/</link><description></description><atom:link href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/categories/class.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2019 &lt;a href="mailto:frankwr@rpi.edu"&gt;W Randolph Franklin, RPI&lt;/a&gt; </copyright><lastBuildDate>Thu, 28 Feb 2019 20:11:24 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>PAR Class 14, Wed 2018-05-02</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class14/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class14/#course-recap" id="id1"&gt;1   Course recap&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="course-recap"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class14/#id1"&gt;1   Course recap&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;My teaching style is to work from particulars to the general.&lt;/li&gt;
&lt;li&gt;You've seen &lt;strong&gt;OpenMP&lt;/strong&gt;, a tool for shared memory parallelism.&lt;/li&gt;
&lt;li&gt;You've seen the architecture of NVidia's GPU, a widely used parallel system, and &lt;strong&gt;CUDA&lt;/strong&gt;, a tool for programming it.&lt;/li&gt;
&lt;li&gt;You've seen &lt;strong&gt;Thrust&lt;/strong&gt;, a tool on top of CUDA, built in the C++ STL style.&lt;/li&gt;
&lt;li&gt;You've seen how widely used numerical tools like &lt;strong&gt;BLAS&lt;/strong&gt; and &lt;strong&gt;FFT&lt;/strong&gt; have versions built on CUDA.&lt;/li&gt;
&lt;li&gt;You've had a chance to program in all of them on &lt;strong&gt;parallel.ecse&lt;/strong&gt;,  with dual 14-core Xeons, &lt;strong&gt;Pascal&lt;/strong&gt; NVidia board, and &lt;strong&gt;Xeon Phi&lt;/strong&gt; coprocessor.&lt;/li&gt;
&lt;li&gt;You seen talks by leaders in high performance computing, such as Jack Dongarra.&lt;/li&gt;
&lt;li&gt;You've seen quick references to parallel programming using  Matlab, Mathematica, and the cloud.&lt;/li&gt;
&lt;li&gt;Now, you can inductively reason towards general design rules for shared and non-shared parallel computers, and for the SW tools to exploit them.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class14/</guid><pubDate>Thu, 26 Apr 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 13, Wed 2018-04-25</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#inspiration-for-finishing-your-term-projects" id="id1"&gt;1   Inspiration for finishing your term projects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#parallel-computing-videos" id="id2"&gt;2   Parallel computing videos&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#welcome-distributed-systems-in-one-lesson" id="id3"&gt;2.1   Welcome Distributed Systems in One Lesson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#paying-for-lunch-c-in-the-manycore-age" id="id4"&gt;2.2   Paying for Lunch: C++ in the ManyCore Age&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#combine-lambdas-and-weak-ptrs-to-make-concurrency-easy" id="id5"&gt;2.3   Combine Lambdas and weak_ptrs to make concurrency easy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#a-pragmatic-introduction-to-multicore-synchronization" id="id6"&gt;2.4   A Pragmatic Introduction to Multicore Synchronization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#synchronization-blocking-non-blocking-1-2" id="id7"&gt;2.5   Synchronization - Blocking &amp;amp; Non-Blocking (1/2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#lock-free-programming-or-juggling-razor-blades-part-i" id="id8"&gt;2.6   Lock-Free Programming (or, Juggling Razor Blades), Part I&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#debugging-videos" id="id9"&gt;3   Debugging videos&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#uftrace-a-function-graph-tracer-for-c-c-userspace-programs" id="id10"&gt;3.1   uftrace: A function graph tracer for C/C++ userspace programs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#concepts-in-5" id="id11"&gt;3.2   Concepts in 5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#programming-videos" id="id12"&gt;4   Programming videos&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#modern-c-what-you-need-to-know" id="id13"&gt;4.1   Modern C++: What You Need to Know&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#relevant-conferences" id="id14"&gt;5   Relevant conferences&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="inspiration-for-finishing-your-term-projects"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id1"&gt;1   Inspiration for finishing your term projects&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="http://www.underhanded-c.org/"&gt;The Underhanded C Contest&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="http://ioccc.org/"&gt;The International Obfuscated C Code Contest&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures"&gt;https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Moral: After early disasters, sometimes you can eventually get things to work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=fw_C_sbfyx8"&gt;The 'Wrong' Brothers Aviation's Failures (1920s)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=13qeX98tAS8"&gt;Early U.S. rocket and space launch failures and explosion&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=CEFNjL86y9c"&gt;Numerous US Launch Failures&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="parallel-computing-videos"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id2"&gt;2   Parallel computing videos&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;We'll see some subset of these.&lt;/p&gt;
&lt;div class="section" id="welcome-distributed-systems-in-one-lesson"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id3"&gt;2.1   Welcome Distributed Systems in One Lesson&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;niko peikrishvili, 11 min&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=T9ej3NcE2gQ"&gt;https://www.youtube.com/watch?v=T9ej3NcE2gQ&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is the first of several short videos.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="paying-for-lunch-c-in-the-manycore-age"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id4"&gt;2.2   Paying for Lunch: C++ in the ManyCore Age&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;CppCon 2014: by Herb Sutter&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=AfI_0GzLWQ8"&gt;https://www.youtube.com/watch?v=AfI_0GzLWQ8&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Published on Sep 29, 2014, 1h15m&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.cppcon.org"&gt;http://www.cppcon.org&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Presentation Slides, PDFs, Source Code and other presenter materials are available at: &lt;a class="reference external" href="https://github.com/CppCon/CppCon2014"&gt;https://github.com/CppCon/CppCon2014&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Concurrency is one of the major focuses of C++17 and one of the biggest challenges facing C++ programmers today. Hear what this panel of experts has to say about how to write concurrent C++ now and in the future.&lt;/p&gt;
&lt;p&gt;MODERATOR: Herb Sutter - Author, chair of the ISO C++ committee, software architect at Microsoft.&lt;/p&gt;
&lt;p&gt;SPEAKERS:&lt;/p&gt;
&lt;p&gt;PABLO HALPERN - Pablo Halpern has been programming in C++ since 1989 and has been a member of the C++ Standards Committee since 2007. He is currently the Parallel Programming Languages Architect at Intel Corp., where he coordinates the efforts of teams working on Cilk Plus, TBB, OpenMP, and other parallelism languages, frameworks, and tools targeted to C++, C, and Fortran users. Pablo came to Intel from Cilk Arts, Inc., which was acquired by Intel in 2009. During his time at Cilk Arts, he co-authored the paper "Reducers and other Cilk++ Hyperobjects", which won best paper at the SPAA 2009 conference. His current work is focused on creating simpler and more powerful parallel programming languages and tools for Intel's customers and promoting adoption of parallel constructs into the C++ and C standards. He lives with his family in southern New Hampshire, USA. When not working on parallel programming, he enjoys studying the viola, skiing, snowboarding, and watching opera. Twitter handle: @PabloGHalpern&lt;/p&gt;
&lt;p&gt;JARED HOBEROCK - Jared Hoberock is a research scientist at NVIDIA where he develops the Thrust parallel algorithms library and edits the Technical Specification on Extensions for Parallelism for C++.Website: &lt;a class="reference external" href="http://github.com/jaredhoberock"&gt;http://github.com/jaredhoberock&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;ARTUR LAKSBERG - Artur Laksberg leads the Visual C++ Libraries development team at Microsoft. His interests include concurrency, programming language and library design, and modern C++. Artur is one of the co-authors of the Parallel STL proposal; his team is now working on the prototype implementation of the proposal.&lt;/p&gt;
&lt;p&gt;ADE MILLER - Ade Miller writes C++ for fun. He wrote his first N-body model in BASIC on an 8-bit microcomputer 30 years ago and never really looked back. He started using C++ in the early 90s. Recently, he's written two books on parallel programming with C++; "C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++" and "Parallel Programming with Microsoft Visual C++". Ade spends the long winters in Washington contributing to the open source C++ AMP Algorithms Library and well as a few other projects. His summers are mostly spent crashing expensive bicycles into trees. Website: &lt;a class="reference external" href="http://www.ademiller.com/blogs/tech/"&gt;http://www.ademiller.com/blogs/tech/&lt;/a&gt; Twitter handle: @ademiller&lt;/p&gt;
&lt;p&gt;GOR NISHANOV - Gor Nishanov is a is a Principal Software Design Engineer on the Microsoft C++ team. He works on the 'await' feature. Prior to joining C++ team, Gor was working on distributed systems in Windows Clustering team.&lt;/p&gt;
&lt;p&gt;MICHAEL WONG - You can talk to me about anything including C++ (even C and that language that shall remain nameless but starts with F), Transactional Memory, Parallel Programming, OpenMP, astrophysics (where my degree came from), tennis (still trying to see if I can play for a living), travel, and the best food (which I am on a permanent quest to eat). Michael Wong is the CEO of OpenMP. He is the IBM and Canadian representative to the C++ Standard and OpenMP Committee. And did I forget to say he is a Director of ISOCPP.org and a VP, Vice-Chair of Programming Languages for Canada's Standard Council. He has so many titles, its a wonder he can get anything done. Oh, and he chairs the WG21 SG5 Transactional Memory, and is the co-author of a number C++11/OpenMP/TM features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. Having been the past C++ team lead to IBM´s XL C++ compiler means he has been messing around with designing C++ compilers for twenty years. His current research interest, i.e. what he would like to do if he had time is in the area of parallel programming, transactional memory, C++ benchmark performance, object model, generic programming and template metaprogramming. He holds a B.Sc from University of Toronto, and a Masters in Mathematics from University of Waterloo. He has been asked to speak at ACCU, C++Now, Meeting C++, CASCON, and many Universities, research centers and companies, except his own, where he has to listen. Now he and his wife loves to teach their two children to be curious about everything.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="combine-lambdas-and-weak-ptrs-to-make-concurrency-easy"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id5"&gt;2.3   Combine Lambdas and weak_ptrs to make concurrency easy&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=fEnnmpdZllQ"&gt;https://www.youtube.com/watch?v=fEnnmpdZllQ&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;CppCon 2016: Dan Higgins,  4min&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-pragmatic-introduction-to-multicore-synchronization"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id6"&gt;2.4   A Pragmatic Introduction to Multicore Synchronization&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;by Samy Al Bahra.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=LX4ugnzwggg"&gt;https://www.youtube.com/watch?v=LX4ugnzwggg&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Published on Jun 15, 2016, 1h 2m.&lt;/p&gt;
&lt;p&gt;This talk will introduce attendees to the challenges involved in achieving high performance multicore synchronization. The tour will begin with fundamental scalability bottlenecks in multicore systems and memory models, and then extend to advanced synchronization techniques involving scalable locking and lock-less synchronization. Expect plenty of hacks and real-world war stories in the fight for vertical scalability. Some of the topics introduced include memory coherence and consistency, memory organization, scalable locking, biased asymmetric synchronization, non-blocking synchronization and safe memory reclamation.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="synchronization-blocking-non-blocking-1-2"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id7"&gt;2.5   Synchronization - Blocking &amp;amp; Non-Blocking (1/2)&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;by Petr Kuznetsov, 15min&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=k8uOOvd6Uj8"&gt;https://www.youtube.com/watch?v=k8uOOvd6Uj8&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="lock-free-programming-or-juggling-razor-blades-part-i"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id8"&gt;2.6   Lock-Free Programming (or, Juggling Razor Blades), Part I&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;CppCon 2014, by Herb Sutter&lt;/p&gt;
&lt;p&gt;Published on Oct 16, 2014&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.cppcon.org"&gt;http://www.cppcon.org&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Presentation Slides, PDFs, Source Code and other presenter materials are available at: &lt;a class="reference external" href="https://github.com/CppCon/CppCon2014"&gt;https://github.com/CppCon/CppCon2014&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Example-driven talk on how to design and write lock-free algorithms and data structures using C++ atomic -- something that can look deceptively simple, but contains very deep topics. (Important note: This is not the same as my "atomic Weapons" talk; that talk was about the "what they are and why" of the C++ memory model and atomics, and did not cover how to actually use atomics to implement highly concurrent algorithms and data structures.)&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="debugging-videos"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id9"&gt;3   Debugging videos&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="uftrace-a-function-graph-tracer-for-c-c-userspace-programs"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id10"&gt;3.1   uftrace: A function graph tracer for C/C++ userspace programs&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=LNav5qvyK7I"&gt;https://www.youtube.com/watch?v=LNav5qvyK7I&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;6min&lt;/p&gt;
&lt;p&gt;Published on Oct 7, 2016&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://CppCon.org"&gt;http://CppCon.org&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Presentation Slides, PDFs, Source Code and other presenter materials are available at: &lt;a class="reference external" href="https://github.com/cppcon/cppcon2016"&gt;https://github.com/cppcon/cppcon2016&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is one of the &lt;a class="reference external" href="https://www.youtube.com/playlist?list=PLHTh1InhhwT6aWgfHhrvYY3s-lqk0Y9iP"&gt;CppCon 2016 Lightning Talks&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="concepts-in-5"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id11"&gt;3.2   Concepts in 5&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=H8HplZtVGT0&amp;amp;list=PLHTh1InhhwT6aWgfHhrvYY3s-lqk0Y9iP&amp;amp;index=14"&gt;https://www.youtube.com/watch?v=H8HplZtVGT0&amp;amp;list=PLHTh1InhhwT6aWgfHhrvYY3s-lqk0Y9iP&amp;amp;index=14&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;CppCon 2016: David Sankel&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="programming-videos"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id12"&gt;4   Programming videos&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="modern-c-what-you-need-to-know"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id13"&gt;4.1   Modern C++: What You Need to Know&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=TJHgp1ugKGM"&gt;https://www.youtube.com/watch?v=TJHgp1ugKGM&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;by Paulo Portela&lt;/p&gt;
&lt;p&gt;Published on Apr 7, 2014, 1h&lt;/p&gt;
&lt;p&gt;This talk will give an update on recent progress and near-future directions for C++, both at Microsoft and across the industry. This is a great introduction to the current state of the language, including a glimpse into the future of general purpose, performance-intensive, power-friendly, powerful native programming.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="relevant-conferences"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/#id14"&gt;5   Relevant conferences&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.cppcon.org"&gt;http://www.cppcon.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://supercomputing.org/"&gt;Supercomputing conference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.nvidia.com/en-us/gtc/"&gt;Nvidia GPU technology conference&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class13/</guid><pubDate>Wed, 25 Apr 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 12, Wed 2018-04-18</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#git" id="id1"&gt;1   Git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#software-tips" id="id2"&gt;2   Software tips&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#freeze-decisions-early-sw-design-paradigm" id="id3"&gt;2.1   Freeze decisions early: SW design paradigm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#faster-graphical-access-to-parallel-ecse" id="id4"&gt;2.2   Faster graphical access to parallel.ecse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#jack-dongarra-videos" id="id5"&gt;3   Jack Dongarra videos&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#more-parallel-tools" id="id6"&gt;4   More parallel tools&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#cufft-notes" id="id7"&gt;4.1   cuFFT Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#cublas-etc-notes" id="id8"&gt;4.2   cuBLAS etc Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#matlab" id="id9"&gt;4.3   Matlab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#mathematica-in-parallel" id="id10"&gt;4.4   Mathematica in parallel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#nvidia-videos" id="id11"&gt;5   Nvidia videos&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#cloud-computing" id="id12"&gt;6   Cloud computing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="git"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id1"&gt;1   Git&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Git is good to simultaneously keep various versions.  A git
intro:&lt;/p&gt;
&lt;p&gt;Create a dir for the project:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
mkdir PROJECT; cd PROJECT
&lt;/pre&gt;
&lt;p&gt;Initialize:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
git init
&lt;/pre&gt;
&lt;p&gt;Create a branch (you can do this several times):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
git branch MYBRANCHNAME
&lt;/pre&gt;
&lt;p&gt;Go to a branch:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
git checkout MYBRANCHNAME
&lt;/pre&gt;
&lt;p&gt;Do things:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
vi, make, ....
&lt;/pre&gt;
&lt;p&gt;Save it:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
git add .; git commit -mCOMMENT
&lt;/pre&gt;
&lt;p&gt;Repeat&lt;/p&gt;
&lt;p&gt;I might use this to modify a program for class.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="software-tips"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id2"&gt;2   Software tips&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="freeze-decisions-early-sw-design-paradigm"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id3"&gt;2.1   Freeze decisions early: SW design paradigm&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;One of my rules is to push design decisions to take effect as
early in the process execution as possible.  Constructing
variables at compile time is best, at function call time is
2nd, and on the heap is worst.&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;If I have to construct variables on the heap, I construct few and large
variables, never many small ones.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Often I compile the max dataset size into the program, which permits
constructing the arrays at compile time.  Recompiling for a larger
dataset is quick (unless you're using CUDA).&lt;/p&gt;
&lt;p&gt;Accessing this type of variable uses one less level of pointer than
accessing a variable on the heap.  I don't know whether this is faster
with a good optimizing compiler, but it's probably not slower.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.&lt;/p&gt;
&lt;ol class="lowerroman simple"&gt;
&lt;li&gt;Read the data once to accumulate the necessary statistics.&lt;/li&gt;
&lt;li&gt;Construct the required ragged array.&lt;/li&gt;
&lt;li&gt;Reread the data and populate the array.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="faster-graphical-access-to-parallel-ecse"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id4"&gt;2.2   Faster graphical access to parallel.ecse&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;X over ssh is very slow.&lt;/p&gt;
&lt;p&gt;Here are some things I've discovered that help, and that work sometimes.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Use &lt;strong&gt;xpra&lt;/strong&gt;; here's an example:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;On parallel.ecse:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
xpra start :77; DISPLAY=:77 xeyes&amp;amp;
&lt;/pre&gt;
&lt;p&gt;Don't everyone use 77, pick your own numbers in the range 20-99.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;On server, i.e., your machine:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
xpra attach ssh:parallel.ecse.rpi.edu:77
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;I suspect that security is weak.  When you start an xpra session, In suspect that anyone on parallel.ecse can display to it.  I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Use &lt;strong&gt;nx&lt;/strong&gt;, which needs a server, e.g., &lt;a class="reference external" href="https://help.ubuntu.com/community/FreeNX"&gt;FreeNX&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="jack-dongarra-videos"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id5"&gt;3   Jack Dongarra videos&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=mhfFFJu2Hj0"&gt;Sunway TaihuLight's strengths and weaknesses highlighted&lt;/a&gt;.  9 min.  8/21/2016.&lt;/p&gt;
&lt;p&gt;This is the new fastest known machine on top500.  A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th.  These last 3 machines have been at the top for a surprisingly long time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=D1hfrtoVZDo"&gt;An Overview of High Performance Computing and Challenges for the Future&lt;/a&gt;.  57min, 11/16/2016.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="more-parallel-tools"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id6"&gt;4   More parallel tools&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="cufft-notes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id7"&gt;4.1   cuFFT Notes&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.bu.edu/pasi/files/2011/07/Lecture83.pdf"&gt;GPU Computing with CUDA&lt;/a&gt; Lecture    8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=My8YJ3v8XFE%7CCUDACast"&gt;video&lt;/a&gt; #8 -
CUDA 5.5 cuFFT FFTW API Support.  3 min.&lt;/li&gt;
&lt;li&gt;cuFFT is inspired by FFTW (the fastest Fourier transform in the west),
which they say is so fast that it's as fast as commercial FFT packages.&lt;/li&gt;
&lt;li&gt;I.e., sometimes commercial packages may be worth the money.&lt;/li&gt;
&lt;li&gt;Although the FFT is taught for N a power of two, users often want to
process other dataset sizes.&lt;/li&gt;
&lt;li&gt;The problem is that the optimal recursion method, and the relevant
coefficients, depends on the prime factors of N.&lt;/li&gt;
&lt;li&gt;FFTW and cuFFT determine the good solution procedure for the particular N.&lt;/li&gt;
&lt;li&gt;Since this computation takes time, they store the method in a &lt;strong&gt;plan&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You can then apply the plan to many datasets.&lt;/li&gt;
&lt;li&gt;If you're going to be processing very many datasets, you can tell FFTW or
cuFFT to perform sample timing experiments on your system, to help in
devising the best plan.&lt;/li&gt;
&lt;li&gt;That's a nice strategy that some other numerical SW uses.&lt;/li&gt;
&lt;li&gt;One example is &lt;a class="reference external" href="http://math-atlas.sourceforge.net/"&gt;Automatically Tuned Linear Algebra Software (ATLAS)&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="cublas-etc-notes"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id8"&gt;4.2   cuBLAS etc Notes&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.netlib.org/blas/"&gt;BLAS&lt;/a&gt; is an API for a set of simple matrix
and vector functions, such as multiplying a vector by a matrix.&lt;/li&gt;
&lt;li&gt;These functions' efficiency is important since they are the basis for
widely used numerical applications.&lt;/li&gt;
&lt;li&gt;Indeed you usually don't call BLAS functions directly, but use
higher-level packages like LAPACK that use BLAS.&lt;/li&gt;
&lt;li&gt;There are many implementations, free and commercial, of BLAS.&lt;/li&gt;
&lt;li&gt;cuBLAS is one.&lt;/li&gt;
&lt;li&gt;One reason that Fortran is still used is that, in the past, it was easier
to write efficient Fortran programs than C or C++ programs for these
applications.&lt;/li&gt;
&lt;li&gt;There are other, very efficient, C++ numerical packages.  (I can list
some, if there's interest).&lt;/li&gt;
&lt;li&gt;Their efficiency often comes from aggressively using C++ templates.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/"&gt;Matrix mult example&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="matlab"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id9"&gt;4.3   Matlab&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Good for applications that look like matrices.&lt;/p&gt;
&lt;p&gt;Considerable contortions required for, e.g., a general graph.  You'd
represent that with a large sparse adjacency matrix.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Using explicit &lt;em&gt;for&lt;/em&gt; loops is slow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Efficient execution when using builtin matrix functions,&lt;/p&gt;
&lt;p&gt;but can be difficult to write your algorithm that way, and&lt;/p&gt;
&lt;p&gt;difficult to read the code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Very expensive and getting more so.&lt;/p&gt;
&lt;p&gt;Many separately priced apps.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Uses state-of-the-art numerical algorithms.&lt;/p&gt;
&lt;p&gt;E.g., to solve large sparse overdetermined linear systems.&lt;/p&gt;
&lt;p&gt;Better than Mathematica.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Most or all such algorithms also freely available as C++ libraries.&lt;/p&gt;
&lt;p&gt;However, which library to use?&lt;/p&gt;
&lt;p&gt;Complicated calling sequences.&lt;/p&gt;
&lt;p&gt;Obscure C++ template error messages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Graphical output is mediocre.&lt;/p&gt;
&lt;p&gt;Mathematica is better.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Various ways Matlab can execute in parallel&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;Operations on arrays can execute in parallel.&lt;/p&gt;
&lt;p&gt;E.g. B=SIN(A) where A is a matrix.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Automatic multithreading by some functions&lt;/p&gt;
&lt;p&gt;Various functions, like INV(a), automatically use perhaps 8 cores.&lt;/p&gt;
&lt;p&gt;The '8' is a license limitation.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://www.mathworks.com/matlabcentral/answers/95958"&gt;Which MATLAB functions benefit from multithreaded computation?&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;PARFOR&lt;/p&gt;
&lt;p&gt;Like FOR, but multithreaded.&lt;/p&gt;
&lt;p&gt;However, FOR is slow.&lt;/p&gt;
&lt;p&gt;Many restrictions, e.g., cannot be nested.&lt;/p&gt;
&lt;p&gt;Matlab's &lt;a class="reference external" href="http://www.mathworks.com/help/distcomp/introduction-to-parallel-solutions.html"&gt;introduction to parallel solutions&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Start pools first with: MATLABPOOL OPEN 12&lt;/p&gt;
&lt;p&gt;Limited to 12 threads.&lt;/p&gt;
&lt;p&gt;Can do reductions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Parallel Computing Server&lt;/p&gt;
&lt;p&gt;This runs on a parallel machine, including Amazon EC2.&lt;/p&gt;
&lt;p&gt;Your client sends batch or interactive jobs to it.&lt;/p&gt;
&lt;p&gt;Many Matlab toolboxes are not licensed to use it.&lt;/p&gt;
&lt;p&gt;This makes it much less useful.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;GPU computing&lt;/p&gt;
&lt;p&gt;Create an array on device with gpuArray&lt;/p&gt;
&lt;p&gt;Run builtin functions on it.&lt;/p&gt;
&lt;p&gt;Matlab's &lt;a class="reference external" href="http://www.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html"&gt;run built in functions on a gpu&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="mathematica-in-parallel"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id10"&gt;4.4   Mathematica in parallel&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;You terminate an input command with &lt;em&gt;shift-enter&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Some Mathematica commands:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
 "/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[b,Method-&amp;gt;"AspectBasedShading"]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b]   *When you get bored*
   * waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount
$ProcessorType
*Select *Parallel Kernel Configuration*
   and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[  Select[Range[5000],merQ]]
Needs["CUDALink`"]  *note the back quote*
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
    {a, 1, 4}, {b, 1, 4}]
&lt;/pre&gt;
&lt;p&gt;Unfortunately there's a problem that I'm still debugging with the
Mathematica - CUDA interface.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="nvidia-videos"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id11"&gt;5   Nvidia videos&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=Usl_TCUTWD8"&gt;HPC and Supercomputing at GTC 2017&lt;/a&gt; 1 min&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=fmVWLr0X1Sk"&gt;NVIDIA Self-Driving Car Demo at CES 2017&lt;/a&gt; 2 min&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=Xt3WDUIb1xA"&gt;How Nvidia Went From Gaming to GPUs&lt;/a&gt; 3 min&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=2iqAZkjJCY4"&gt;Nvidia Volta To Be Released Q3 2017, Say Rumours | RX 480 Can be Flashed to RX 580&lt;/a&gt; 10 min.  4/19/2017&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=ljSpat74w10"&gt;NVIDIA Opening Keynote Highlights at CES 2017&lt;/a&gt; 37 min&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="cloud-computing"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/#id12"&gt;6   Cloud computing&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The material is from Wikipedia, which appeared better than any other
sources that I could find.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Hierarchy:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/IaaS"&gt;IaaS&lt;/a&gt; (Infrastructure as a Service)&lt;ol class="lowerroman"&gt;
&lt;li&gt;Sample functionality:  VM, storage&lt;/li&gt;
&lt;li&gt;Examples:&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Google_Compute_Engine"&gt;Google_Compute_Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Amazon_Web_Services"&gt;Amazon_Web_Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/OpenStack"&gt;OpenStack&lt;/a&gt; : compute, storage, networking, dashboard&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/PaaS"&gt;PaaS&lt;/a&gt; (Platform ...)&lt;ol class="lowerroman"&gt;
&lt;li&gt;Sample functionality: OS, Web server, database server&lt;/li&gt;
&lt;li&gt;Examples:&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/OpenShift"&gt;OpenShift&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Cloud_Foundry"&gt;Cloud_Foundry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Hadoop"&gt;Hadoop&lt;/a&gt; :&lt;ol class="arabic"&gt;
&lt;li&gt;distributed FS, Map Reduce&lt;/li&gt;
&lt;li&gt;derived from Google FS, map reduce&lt;/li&gt;
&lt;li&gt;used by Facebook etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Now, people often run &lt;a class="reference external" href="https://spark.apache.org/"&gt;Apache Spark™ - Lightning-Fast Cluster Computing&lt;/a&gt; instead of Hadoop, because Spark is faster.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/SaaS"&gt;SaaS&lt;/a&gt; (Software ...)&lt;ol class="lowerroman"&gt;
&lt;li&gt;Sample functionality:  email, gaming, CRM, ERP&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Cloud_computing_comparison"&gt;Cloud_computing_comparison&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Virtual machine.&lt;/p&gt;
&lt;p&gt;The big question is, at what level does the virtualization occur?  Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Virtualization"&gt;Virtualization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Hypervisor"&gt;Hypervisor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Xen"&gt;Xen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine"&gt;Kernel-based_Virtual_Machine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/QEMU"&gt;QEMU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/VMware"&gt;VMware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Containers, docker,&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Comparison_of_platform_virtual_machines"&gt;Comparison_of_platform_virtual_machines&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Distributed storage&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Virtual_file_system"&gt;Virtual_file_system&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Lustre_(file_system)"&gt;Lustre_(file_system)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems"&gt;Comparison_of_distributed_file_systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Hadoop_distributed_file_system"&gt;Hadoop_distributed_file_system&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;See also&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/VNC"&gt;VNC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Grid_computing"&gt;Grid_computing&lt;/a&gt;&lt;ol class="lowerroman"&gt;
&lt;li&gt;decentralized, heterogeneous&lt;/li&gt;
&lt;li&gt;used for major projects like protein folding&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class12/</guid><pubDate>Wed, 18 Apr 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 11, Wed 2018-04-11</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#intel-xeon-phi-7120a" id="id2"&gt;1   Intel Xeon Phi 7120A&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#in-general" id="id3"&gt;1.1   In general&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#parallel-ecse-s-mic" id="id4"&gt;1.2   parallel.ecse's mic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#programming-the-mic" id="id5"&gt;1.3   Programming  the mic&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#intel-compilers-on-parallel" id="id6"&gt;2   Intel compilers on parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#programming-the-mic-ctd" id="id7"&gt;3   Programming the MIC (ctd)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#mic0-xeon-phi-setup" id="id8"&gt;4   Mic0 (Xeon Phi) setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#boinc" id="id9"&gt;5   Boinc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#quantum-physics-talk-at-4pm-today" id="id10"&gt;6   Quantum physics talk at 4pm today&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="intel-xeon-phi-7120a"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id2"&gt;1   Intel Xeon Phi 7120A&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="in-general"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id3"&gt;1.1   In general&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;The Xeon Phi is Intel's brand name for their &lt;strong&gt;MIC&lt;/strong&gt; (for Many Integrated Core Architecture).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The 7120a is Intel's Knights Landing (1st generation) MIC architecure, launched in 2014.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It has 61 cores running about 244 threads clocked at about 1.3GHz.&lt;/p&gt;
&lt;p&gt;Having several threads per core helps to overcome latency in fetching stuff.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It has 16GB of memory accessible at 352 GB/s, 30BM L2 cache, and peaks at 1TFlops double precision.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It is a coprocessor on a card accessible from a host CPU on a local network.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It is intended as a supercomputing competitor to Nvidia.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The mic architecture is quite similar to the Xeon.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;However executables from one don't run on the other, unless the source was compiled to include both versions in the executable file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The mic has been tuned to emphasize floating performance at the expense of, e.g., speculative execution.&lt;/p&gt;
&lt;p&gt;This helps to make it competitive with Nvidia, even though Nvidia GPUs can have many more cores.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Its OS is busybox, an embedded version of linux.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The SW is called MPSS (Manycore Platform Software Stack).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The mic can be integrated with the host in various ways that I haven't (yet) implemented.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Processes on the host can execute subprocesses on the device, as happens with Nvidia CUDA.&lt;/li&gt;
&lt;li&gt;E.g., OpenMP on the host can run parallel threads on the mic.&lt;/li&gt;
&lt;li&gt;The mic can page virtual memory from the host.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The fastest machine on top500.org a few years ago used Xeon Phi cards.&lt;/p&gt;
&lt;p&gt;The 2nd used Nvidia K20 cards, and the 3rd fastest was an IBM Bluegene.&lt;/p&gt;
&lt;p&gt;So, my course lets you use the 2 fastest architectures, and there's another course available at RPI for the 3rd.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Information:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Xeon_Phi"&gt;https://en.wikipedia.org/wiki/Xeon_Phi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://ark.intel.com/products/80555/Intel-Xeon-Phi-Coprocessor-7120A-16GB-1_238-GHz-61-core"&gt;http://ark.intel.com/products/80555/Intel-Xeon-Phi-Coprocessor-7120A-16GB-1_238-GHz-61-core&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.intel.com/content/www/us/en/products/processors/xeon-phi/xeon-phi-processors.html"&gt;http://www.intel.com/content/www/us/en/products/processors/xeon-phi/xeon-phi-processors.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html"&gt;http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss"&gt;https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://pleiades.ucsc.edu/hyades/MIC_QuickStart_Guide"&gt;https://pleiades.ucsc.edu/hyades/MIC_QuickStart_Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="parallel-ecse-s-mic"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id4"&gt;1.2   parallel.ecse's mic&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;The hostname (of this particular MIC) is &lt;strong&gt;parallel-mic0&lt;/strong&gt; or &lt;strong&gt;mic0&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The local filesystem is in RAM and is reinitialized when mic0 is rebooted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Parallel:/home and /parallel-class are NFS exported to mic0.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;/home can be used to move files back and forth.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;All the user accounts on parallel were given accounts on mic0.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;You can ssh to mic0 from parallel.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Your current parallel ssh key pair should work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Your parallel login password as of a few days ago should work on
mic0.&lt;/p&gt;
&lt;p&gt;However, future changes to your parallel password will not
propagate to mic0 and you cannot change your mic0 password.&lt;/p&gt;
&lt;p&gt;(The mic0 setup snapshotted parallel's accounts and created a
read-only image to boot mic0 from.  Any changes to
mic0:/etc/shadow are reverted when mic0 reboots.)&lt;/p&gt;
&lt;p&gt;So use your public key.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="programming-the-mic"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id5"&gt;1.3   Programming  the mic&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Parallel:/parallel-class/mic/bin has versions of gcc, g++,
etc, with names like k1om-mpss-linux-g++ .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;These run on parallel and produce executable files that run (only) on mic0.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Here's an example of compiling (on parallel) a C program in /parallel-class/mic&lt;/p&gt;
&lt;pre class="literal-block"&gt;
bin/k1om-mpss-linux-gcc hello.c -o hello-mic
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Run it thus from parallel (it runs on mic0):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
ssh mic0  /parallel-class/mic/hello-mic
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="intel-compilers-on-parallel"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id6"&gt;2   Intel compilers on parallel&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Note: currently they don't work, but maybe soon..&lt;/em&gt;&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Intel Parallel Studio XE Cluster 2017 is now installed on &lt;tt class="docutils literal"&gt;parallel&lt;/tt&gt;, in
&lt;tt class="docutils literal"&gt;/opt/intel/&lt;/tt&gt; .  It is a large package with compilers, debuggers, analyzers,
MPI, etc, etc.  There is is extensive doc on Intel's web site.  Have fun.&lt;/p&gt;
&lt;p&gt;Students and free SW developers can get also free licenses for their
machines.  Commercial licenses cost $thousands.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://software.intel.com/en-us/videos/whats-new-in-intel-parallel-studio-xe-2017"&gt;What’s Inside Intel Parallel Studio XE 2017&lt;/a&gt;.  There're &lt;a class="reference external" href="https://software.intel.com/sites/default/files/managed/2a/1f/intel-parallel-studio-xe-2017-create-faster-code-faster.pdf"&gt;PDF slides&lt;/a&gt;, a webinar, and &lt;a class="reference external" href="https://software.intel.com/en-us/intel-parallel-studio-xe-support/training"&gt;training stuff&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/"&gt;https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Before using the compiler, you should initialize some envars thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
source /opt/intel/bin/iccvars.sh -arch intel64
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Then you can compile a C or C++ program thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icc -openmp -O3 foo.c -o foo
icpc -qopenmp -O3 foo.cc -o foo
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;On my simple tests, not using the mic, icpc and g++ produced equally good code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Compile a C++ program with OpenMP thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icpc -qopenmp -std=c++11    omp_hello.cc   -o omp_hello
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Test it thus, it is in /parallel-class/mic&lt;/p&gt;
&lt;pre class="literal-block"&gt;
OMP_NUM_THREADS=4 ./omp_hello
&lt;/pre&gt;
&lt;p&gt;Note how the output from the various threads is mixed up.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="programming-the-mic-ctd"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id7"&gt;3   Programming the MIC (ctd)&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;It turns out that I (but not you) can update a login password on &lt;tt class="docutils literal"&gt;mic0&lt;/tt&gt;, but it's a little
tedious.  Use your ssh key.&lt;/p&gt;
&lt;p&gt;Details: at startup, &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;mic0:/etc&lt;/span&gt;&lt;/tt&gt; is initialized from
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;parallel:/var/mpss/mic0/etc&lt;/span&gt;&lt;/tt&gt; So I could edit &lt;tt class="docutils literal"&gt;shadow&lt;/tt&gt; and insert a new
encrypted password.&lt;/p&gt;
&lt;p&gt;So is &lt;tt class="docutils literal"&gt;/home&lt;/tt&gt;, but it's then replaced by the NFS mount.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;I can also change, e.g., your login shell.  Use bash on &lt;tt class="docutils literal"&gt;mic0&lt;/tt&gt; since zsh does
not exist there.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Dr.Dobb's: &lt;a class="reference external" href="http://www.drdobbs.com/parallel/programming-the-xeon-phi/240152106"&gt;Programming the Xeon Phi&lt;/a&gt; by Rob Farber.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Some &lt;a class="reference external" href="http://www.hpc.cineca.it/content/mic-tutorial"&gt;MIC demos&lt;/a&gt; .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;To cross compile with icc and icpc, see the &lt;a class="reference external" href="http://registrationcenter-download.intel.com/akdlm/irc_nas/11194/mpss_users_guide.pdf"&gt;MPSS users guide&lt;/a&gt;,    Section 8.1.4.   Use the -mmic flag.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Intel &lt;a class="reference external" href="https://software.intel.com/en-us/node/694285"&gt;OpenMP* Support Overview&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://software.intel.com/en-us/videos/new-era-for-openmp-beyond-traditional-shared-memory-parallel-programming"&gt;New Era for OpenMP*: Beyond Traditional Shared Memory Parallel Programming&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Book:  &lt;a class="reference external" href="http://www.colfax-intl.com/nd/xeonphi/book.aspx?DXP#"&gt;Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors, 2nd Edition [508 Pages]&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;See also &lt;a class="reference external" href="https://www.cilkplus.org/"&gt;https://www.cilkplus.org/&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Rolling Out the New Intel Xeon Phi Processor at ISC 2016    &lt;a class="reference external" href="https://www.youtube.com/watch?v=HDPYymREyV8"&gt;https://www.youtube.com/watch?v=HDPYymREyV8&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Supermicro Showcases Intel Xeon Phi and Nvidia P100 Solutions at ISC 2016
&lt;a class="reference external" href="https://www.youtube.com/watch?v=nVWqSjt6hX4"&gt;https://www.youtube.com/watch?v=nVWqSjt6hX4&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Insider Look: New Intel® Xeon Phi™ processor on the Cray® XC™ Supercomputer &lt;a class="reference external" href="https://www.youtube.com/watch?v=lkf3U_5QG_4"&gt;https://www.youtube.com/watch?v=lkf3U_5QG_4&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="mic0-xeon-phi-setup"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id8"&gt;4   Mic0 (Xeon Phi) setup&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;mic0 is the hostname for the Xeon Phi coprocessor.&lt;ol class="loweralpha"&gt;
&lt;li&gt;It's logically a separate computer on a local net named mic0, which is established when the mpss service starts.&lt;/li&gt;
&lt;li&gt;It's accessible only from parallel.&lt;/li&gt;
&lt;li&gt;It has its own filesystem, accounts, etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;On parallel, /opt and /home are exported by being listed in /etc/exports&lt;/li&gt;
&lt;li&gt;mic0 has no disk;  its root partition is in memory.&lt;/li&gt;
&lt;li&gt;parallel:/var/mpss/mic0/   is copied as mic0's root partition.&lt;ol class="loweralpha"&gt;
&lt;li&gt;This is copied when the mic boots.&lt;/li&gt;
&lt;li&gt;That is 1-way; changes done on mic0 are not copied back.&lt;/li&gt;
&lt;li&gt;Changes to parallel:/var/mpss/mic0/ are not visible to the mic until it reboots.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Accounts on parallel have home dirs on mic0.&lt;/li&gt;
&lt;li&gt;I add the accounts themselves by copying the relevant lines from parallel:/etc/{passwd,shadow,group} to /var/mpss/mic0/etc/{passwd,shadow,group}.&lt;/li&gt;
&lt;li&gt;parallel is running an old linux kernel, 4.4.55, because the required kernel module, mic.ko, wouldn't compile with a newer kernel, even 4.8.&lt;ol class="loweralpha"&gt;
&lt;li&gt;The proximate problem is that newer kernels have secure boot, where modules need to be validated.&lt;/li&gt;
&lt;li&gt;This can allegedly be disabled, but that didn't work.&lt;/li&gt;
&lt;li&gt;I also tried to create a certificate authorizing mic.ko, but that didn't work.&lt;/li&gt;
&lt;li&gt;It's possible that sufficient work might get a newer kernel to work.  However I spent far too much time getting it to this point.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;FYI, parallel:/opt/mpss/ has some relevant files.&lt;/li&gt;
&lt;li&gt;parallel:/opt/intel has useful Intel compilers and tools, most of which I haven't gotten working yet.   If anyone wants to do this, welcome.&lt;/li&gt;
&lt;li&gt;On parallel, &lt;strong&gt;micctrl -s&lt;/strong&gt; gives the mic's state.&lt;/li&gt;
&lt;li&gt;micctrl has other rooty options.&lt;/li&gt;
&lt;li&gt;On parallel, root controls the mic with &lt;strong&gt;service mpss start/stop&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The mpss service should start automatically when parallel reboots.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="boinc"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id9"&gt;5   Boinc&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;I've installed &lt;a class="reference external" href="https://boinc.berkeley.edu/"&gt;BOINC&lt;/a&gt; on parallel.ecse.&lt;/li&gt;
&lt;li&gt;Currently, it's running &lt;a class="reference external" href="https://milkyway.cs.rpi.edu/milkyway/"&gt;MilkyWay@Home&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The command &lt;strong&gt;boincmgr&lt;/strong&gt; gives its status.&lt;/li&gt;
&lt;li&gt;Would you like to play with it?&lt;/li&gt;
&lt;li&gt;If you want to run timing tests for other SW on parallel, tell me; I'll disable it.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="quantum-physics-talk-at-4pm-today"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/#id10"&gt;6   Quantum physics talk at 4pm today&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;The Fascinating Quantum World of Two-dimensional Materials: Symmetry, Interaction and Topological Effects&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Symmetry, interaction and topological effects, as well as environmental screening, dominate many of the quantum properties of reduced-dimensional systems and nanostructures. These effects often lead to manifestation of counter-intuitive concepts and phenomena that may not be so prominent or have not been seen in bulk materials.  In this talk, I present some fascinating physical phenomena discovered in recent studies of atomically thin two-dimensional (2D) materials.  A number of highly interesting and unexpected behaviors have been found – e.g., strongly bound excitons (electron-hole pairs) with unusual energy level structures and new topology-dictated optical selection rules, massless excitons, tunable magnetism and plasmonic properties, electron supercollimation, novel topological phases, etc. – adding to the promise of these 2D materials for exploration of new science and valuable applications.&lt;/p&gt;
&lt;p&gt;Steven G. Louie, Physics Department, University of California at Berkeley, and Lawrence Berkeley National Lab&lt;/p&gt;
&lt;p&gt;Darrin Communications Center (DCC) 337 4:00 pm&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://science.rpi.edu/physics/events/kodosky-lecture-series"&gt;Announcement&lt;/a&gt; (link will decay soon.)&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class11/</guid><pubDate>Thu, 29 Mar 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 10, Wed 2018-03-28</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#parallel-class" id="id1"&gt;1   /parallel-class&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#openmp-cpu-time-vs-wall-clock-elapsed-time" id="id2"&gt;2   OpenMP: CPU time vs wall clock (elapsed) time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#more-thrust-info" id="id3"&gt;3   More Thrust info&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#linux-programming-tip" id="id4"&gt;4   Linux programming tip&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#thrust" id="id5"&gt;5   Thrust&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#examples" id="id6"&gt;5.1   Examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#backends" id="id7"&gt;5.2   Backends&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#intel-compilers-on-parallel" id="id8"&gt;6   Intel compilers on parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#programming-the-mic-ctd" id="id9"&gt;7   Programming the MIC (ctd)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#openmp-on-the-mic" id="id10"&gt;8   OpenMP on the mic&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="parallel-class"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id1"&gt;1   /parallel-class&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;This dir on parallel.ecse has some of my modified programs that are not on geoxeon.ecse, such as tiled_range2.cu (mentioned below).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="openmp-cpu-time-vs-wall-clock-elapsed-time"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id2"&gt;2   OpenMP: CPU time vs wall clock (elapsed) time&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;On parellel.ecse,  /parallel-class/openmp/rpi/cpuvs.wall.cc is a simple program that shows how wallclock time shrinks as the number of threads rises (up to a point) but CPU time rises from the start.&lt;/p&gt;
&lt;p&gt;That is because one way that a thread waits to run is to burn cycles.   For small delays, this is easier and has less overhead than explicitly suspending.   The GNU version of OpenMP lets you choose the crossover point between buring cycles and syspending a thread.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="more-thrust-info"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id3"&gt;3   More Thrust info&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;/parallel-class/thrust/doc/An_Introduction_To_Thrust.pdf&lt;/li&gt;
&lt;li&gt;GTC_2010_Part_2_Thrust_By_Example.pdf&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="linux-programming-tip"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id4"&gt;4   Linux programming tip&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I like to use emacs on my local machine to edit files on parallel.ecse.  It's more responsive than running a remote editor, and lets me copy to and from local files.
Here are 2 ways to do that.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;In emacs on my local machine, I do find-file thus:&lt;/p&gt;
&lt;p&gt;/parallel.ecse.rpi.edu:/parallel-class/thrust/rpi/tiled_range2.cu&lt;/p&gt;
&lt;p&gt;Emacs transparently runs scp to read and write the remote file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;I can mount remote directories on my local machine with something like ssh-fs and then edit remote files as if they were local.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In either case, I have to compile the files on parallel.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="thrust"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id5"&gt;5   Thrust&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="examples"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id6"&gt;5.1   Examples&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .&lt;/p&gt;
&lt;p&gt;It is now much shorter and much clearer.  All the work is done here:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gather(make_transform_iterator(make_counting_iterator(0),   _1%N),
make_transform_iterator(make_counting_iterator(N*C), _1%N),
data.begin(),
V.begin());&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...&lt;/li&gt;
&lt;li&gt;_1%N is a function computing modulo N.&lt;/li&gt;
&lt;li&gt;make_transform_iterator(make_counting_iterator(0),   _1%N) returns pointers to the sequence 0%N, 1%N, ...&lt;/li&gt;
&lt;li&gt;gather  populates V.  The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;tiled_range3.cu is even shorter.  Instead of writing an output vector, it constructs an iterator for a virtual output vector:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;auto output=make_permutation_iterator(data,
make_transform_iterator(make_counting_iterator(0), _1%N));&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;*(output+i) is *(data+(i%N)).&lt;/li&gt;
&lt;li&gt;You can get as many tiles as you want by iterating.&lt;/li&gt;
&lt;li&gt;tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;tiled_range5.cu shows how to use a lambda instead of the _1 notation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;auto output=make_permutation_iterator(data,
make_transform_iterator(make_counting_iterator(0),
[](const int i){return i%N;}
));&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;You have to compile with  &lt;em&gt;--std c++11&lt;/em&gt; .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;This can be rewritten thus:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;auto f = [](const int i){return i%N;};
auto output = make_permutation_iterator(data,
make_transform_iterator(make_counting_iterator(0), f));&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The shortest lambda is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;auto f = [](){};&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;repeated_range2.cu is my improvement on repeated_range.cu:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;auto output=make_permutation_iterator(data.begin(),
make_transform_iterator(make_counting_iterator(0), _1/3));&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Unmodified thrust examples:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;expand.cu&lt;/strong&gt; takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1].  Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40].  The process is as follows.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exclusive_scan&lt;/strong&gt; C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scatter_if&lt;/strong&gt; the nonzero counts into their corresponding output positions.  A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;inclusive_scan&lt;/strong&gt; with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gather&lt;/strong&gt; uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;set_operations.cu&lt;/strong&gt;.  This shows methods of handling an operation whose
output is of unpredictable size.  The question is, is space or time more
important?&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;If the maximum possible output size is reasonable, then construct an
output vector of that size, use it, and then erase it down to its
actual size.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Or, run the operation twice.  The 1st time, write to a
&lt;strong&gt;discard_iterator&lt;/strong&gt;, and remember only the size of the written data.
Then, construct an output vector of exactly the right size, and run the
operation again.&lt;/p&gt;
&lt;p&gt;I use this technique a lot with ragged arrays in sequential programs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;sparse_vector.cu&lt;/strong&gt; represents and sums sparse vectors.&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;A sparse vector has mostly 0s.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The representation is a vector of element indices and another vector of values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Adding two sparse vectors goes as follows.&lt;/p&gt;
&lt;ol class="lowerroman"&gt;
&lt;li&gt;&lt;p class="first"&gt;Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Catenate the input vectors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Sort by index.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Find the number of unique indices by applying &lt;strong&gt;inner_product&lt;/strong&gt; with addition and not-equal-to-next-element to the indices, then adding one.&lt;/p&gt;
&lt;p&gt;E.g., applied to these indices:  [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Allocate exactly enough space for the output.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Apply &lt;strong&gt;reduce_by_key&lt;/strong&gt; to the indices and elements to add elements with the same keys.&lt;/p&gt;
&lt;p&gt;The size of the output is the number of unique keys.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;What's the best way to sort 16000 sets of 1000 numbers each?  E.g., sort the rows of a 16000x1000 array?  On geoxeon, &lt;strong&gt;/pc/thrust/rpi/comparesorts.cu&lt;/strong&gt;, which I copied from &lt;a class="reference external" href="http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a"&gt;http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a&lt;/a&gt;-matrix|stackoverflow, compares three methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Call the thrust sort 16000 times, once per set.   That took 10 secs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Sort the whole list of 16,000,000 numbers together.  Then sort it again by key, with the keys being the set number, to bring the elements of each set together.  Since the sort is stable, this maintains the order within each set.  (This is also how radix sort works.)  That took 0.04 secs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Call a thrust function (to sort a set) within another thrust function (that applies to each set).  This is new in Thrust 1.8.  That took 0.3 secs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is a surprising and useful paradigm.   It works because&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;There's an overhead to starting each thrust function, and&lt;/li&gt;
&lt;li&gt;Radix sort, which thrust uses for ints, takes linear time.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="backends"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id7"&gt;5.2   Backends&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;The Thrust device can be CUDA, OpenMP, TBB, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;You can spec it in 2 ways:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;by adding an extra arg at the start of a function's arg list.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;with an envar&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/thrust/thrust/wiki/Host-Backends"&gt;https://github.com/thrust/thrust/wiki/Host-Backends&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/thrust/thrust/wiki/Device-Backends"&gt;https://github.com/thrust/thrust/wiki/Device-Backends&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="intel-compilers-on-parallel"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id8"&gt;6   Intel compilers on parallel&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Intel Parallel Studio XE Cluster 2017 is now installed on &lt;tt class="docutils literal"&gt;parallel&lt;/tt&gt;, in
&lt;tt class="docutils literal"&gt;/opt/intel/&lt;/tt&gt; .  It is a large package with compilers, debuggers, analyzers,
MPI, etc, etc.  There is is extensive doc on Intel's web site.  Have fun.&lt;/p&gt;
&lt;p&gt;Students and free SW developers can get also free licenses for their
machines.  Commercial licenses cost $thousands.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://software.intel.com/en-us/videos/whats-new-in-intel-parallel-studio-xe-2017"&gt;What’s Inside Intel Parallel Studio XE 2017&lt;/a&gt;.  There're &lt;a class="reference external" href="https://software.intel.com/sites/default/files/managed/2a/1f/intel-parallel-studio-xe-2017-create-faster-code-faster.pdf"&gt;PDF slides&lt;/a&gt;, a webinar, and &lt;a class="reference external" href="https://software.intel.com/en-us/intel-parallel-studio-xe-support/training"&gt;training stuff&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/"&gt;https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Before using the compiler, you should initialize some envars thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
source /opt/intel/bin/iccvars.sh -arch intel64
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Then you can compile a C or C++ program thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icc -openmp -O3 foo.c -o foo
icpc -qopenmp -O3 foo.cc -o foo
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;On my simple tests, not using the mic, icpc and g++ produced equally good code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Compile a C++ program with OpenMP thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icpc -qopenmp -std=c++11    omp_hello.cc   -o omp_hello
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Test it thus, it is in /parallel-class/mic&lt;/p&gt;
&lt;pre class="literal-block"&gt;
OMP_NUM_THREADS=4 ./omp_hello
&lt;/pre&gt;
&lt;p&gt;Note how the output from the various threads is mixed up.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="programming-the-mic-ctd"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id9"&gt;7   Programming the MIC (ctd)&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Dr.Dobb's: &lt;a class="reference external" href="http://www.drdobbs.com/parallel/programming-the-xeon-phi/240152106"&gt;Programming the Xeon Phi&lt;/a&gt; by Rob Farber.&lt;/li&gt;
&lt;li&gt;Some &lt;a class="reference external" href="http://www.hpc.cineca.it/content/mic-tutorial"&gt;MIC demos&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;To cross compile with icc and icpc, see the &lt;a class="reference external" href="http://registrationcenter-download.intel.com/akdlm/irc_nas/11194/mpss_users_guide.pdf"&gt;MPSS users guide&lt;/a&gt;,    Section 8.1.4.   Use the -mmic flag.&lt;/li&gt;
&lt;li&gt;Intel &lt;a class="reference external" href="https://software.intel.com/en-us/node/694285"&gt;OpenMP* Support Overview&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://software.intel.com/en-us/videos/new-era-for-openmp-beyond-traditional-shared-memory-parallel-programming"&gt;New Era for OpenMP*: Beyond Traditional Shared Memory Parallel Programming&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Book:  &lt;a class="reference external" href="http://www.colfax-intl.com/nd/xeonphi/book.aspx?DXP#"&gt;Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors, 2nd Edition [508 Pages]&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;See also &lt;a class="reference external" href="https://www.cilkplus.org/"&gt;https://www.cilkplus.org/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Rolling Out the New Intel Xeon Phi Processor at ISC 2016    &lt;a class="reference external" href="https://www.youtube.com/watch?v=HDPYymREyV8"&gt;https://www.youtube.com/watch?v=HDPYymREyV8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Supermicro Showcases Intel Xeon Phi and Nvidia P100 Solutions at ISC 2016
&lt;a class="reference external" href="https://www.youtube.com/watch?v=nVWqSjt6hX4"&gt;https://www.youtube.com/watch?v=nVWqSjt6hX4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Insider Look: New Intel® Xeon Phi™ processor on the Cray® XC™ Supercomputer &lt;a class="reference external" href="https://www.youtube.com/watch?v=lkf3U_5QG_4"&gt;https://www.youtube.com/watch?v=lkf3U_5QG_4&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="openmp-on-the-mic"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/#id10"&gt;8   OpenMP on the mic&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;OpenMP is now running on the mic (Xeon Phi).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Setup envars thus (assuming you're using bash or zsh):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
source /opt/intel/bin/iccvars.sh arch intel64
export SINK_LD_LIBRARY_PATH=/opt/intel/lib/mic
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;To compile on parallel for running on parallel:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icpc -fopenmp sum_reduc2.cc -o sum_reduc2
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Run it on parallel thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
./sum_reduc2
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;To cross compile on parallel for running on mic0:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icpc -mmic -fopenmp sum_reduc2.cc -o sum_reduc2-mic
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Run it natively on mic0 thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
micnativeloadex sum_reduc2-mic
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It is also possible to have an OpenMP program on parallel.ecse
execute parallel threads on mic0:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
#pragma offload target (mic)
  {
  #pragma omp parallel
    {
    ...
    }
  }
&lt;/pre&gt;
&lt;p&gt;See /parallel-class/mic/stackoverflow/hello.c&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Compile it thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
icc -fopenmp hello.c -o hello
&lt;/pre&gt;
&lt;p&gt;Note that there is no -mmic.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Run it thus:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
./hello
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;/parallel-class/mic/stackoverflow/hello_omp.c shows a different syntax.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Degrees of parallelism: See slide 12 of    &lt;a class="reference external" href="https://software.intel.com/sites/default/files/managed/87/c3/using-nested-parallelism-in-openMP-r1.pdf"&gt;https://software.intel.com/sites/default/files/managed/87/c3/using-nested-parallelism-in-openMP-r1.pdf&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class10/</guid><pubDate>Tue, 27 Mar 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 9, Wed 2018-03-21</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#parallel-ecse-hardware-details" id="id1"&gt;1   parallel.ecse hardware details&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#nvidia-gpu-summary" id="id2"&gt;2   Nvidia GPU summary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#more-cuda" id="id3"&gt;3   More CUDA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#thrust" id="id4"&gt;4   Thrust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#unionfs-linux-trick-of-the-day" id="id5"&gt;5   Unionfs: Linux trick of the day&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="parallel-ecse-hardware-details"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#id1"&gt;1   parallel.ecse hardware details&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I put the invoice on parallel.ecse in /parallel-class/ .   It gives the hardware specifics.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="nvidia-gpu-summary"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#id2"&gt;2   Nvidia GPU summary&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it.  It's more
compact than I've found elsewhere.  I'll add to it from time to time.  Some numbers are probably wrong.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;The &lt;strong&gt;host&lt;/strong&gt; is the CPU.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The &lt;strong&gt;device&lt;/strong&gt; is the GPU.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The device contains 20 &lt;strong&gt;streaming multiprocessors&lt;/strong&gt; (SMs).&lt;/p&gt;
&lt;p&gt;Different GPU generations have used the terms SMX or SMM.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A &lt;strong&gt;thread&lt;/strong&gt; is a sequential program with private and shared memory, program counter, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Threads are grouped, 32 at a time, into &lt;strong&gt;warps&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Warps of threads are grouped into &lt;strong&gt;blocks&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.&lt;/p&gt;
&lt;p&gt;That abstract hides details that may be important; see below.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Blocks of threads are grouped into a &lt;strong&gt;grid&lt;/strong&gt;, which is all the threads in the kernel.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A &lt;strong&gt;kernel&lt;/strong&gt; is a parallel program executing on the device.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;The kernel runs potentially thousands of &lt;strong&gt;threads&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;A kernel can create other kernels and wait for their completion.&lt;/li&gt;
&lt;li&gt;There may be a limit, e.g., 5 seconds, on a kernel's run time.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Thread-level resources:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;Each thread can use up to 255 fast &lt;strong&gt;registers&lt;/strong&gt;.  Registers are private to the thread.&lt;/p&gt;
&lt;p&gt;All the threads in one block have their registers allocated from a fixed pool of 65536 registers.  The more registers that each thread uses, the fewer warps in the block  can run simultaneously.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each thread has 512KB slow &lt;strong&gt;local memory&lt;/strong&gt;, allocated from the global memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Local memory is used when not enough registers are available, and to
store thread-local arrays.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Warp-level resources:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;Threads are grouped, 32 at a time, into &lt;strong&gt;warps&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each warp executes as a SIMD, with one instruction register.  At each cycle,
every thread in a warp is either executing the same instruction, or is disabled.
If the 32 threads want to execute 32 different instructions, then they will
execute one after the other, sequentially.&lt;/p&gt;
&lt;p&gt;If you read in some NVidia doc that threads in a warp run independently, then
continue reading the next page to get the info mentioned in the previous paragraph.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;If successive instructions in a warp do not depend on each other, then,
if there are enough warp schedulers available, they may be executed in
parallel.   This is called &lt;strong&gt;Instruction Level Parallelism (ILP)&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;For an array in local memory, which means that each thread will have
its private copy, the elements for all the threads in a warp are
&lt;strong&gt;interleaved&lt;/strong&gt; to potentially increase the I/O rate.&lt;/p&gt;
&lt;p&gt;Therefore your program should try to have successive threads read successive
words of arrays.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A thread can read variables from other threads in the same warp, with the
&lt;strong&gt;shuffle&lt;/strong&gt; instruction.  Typical operation are to read from the K-th next
thread, to do a butterfly permutation, or to do an indexed read.  This happens in
parallel for the whole warp, and does not use shared memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A &lt;strong&gt;warp vote&lt;/strong&gt; combines a bit computed by each thread to report
results like &lt;em&gt;all&lt;/em&gt; or &lt;em&gt;any&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Block-level resources:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;A block may contain up to 1024 threads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each block has access to 65536 fast 32-bit &lt;strong&gt;registers&lt;/strong&gt;,
for the use of its threads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each block can use up to 49152 bytes of the SM's fast &lt;strong&gt;shared&lt;/strong&gt;
&lt;strong&gt;memory&lt;/strong&gt;.  The block's shared memory is shared by all the threads in
the block, but is hidden from other blocks.&lt;/p&gt;
&lt;p&gt;Shared memory is basically a user-controllable cache of some global
data.  The saving comes from reusing that shared data several times
after you loaded it from global memory once.&lt;/p&gt;
&lt;p&gt;Shared memory is interleaved in banks so that some access patterns are faster than others.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Warps in a block run asynchronously and run different instructions.  They
are scheduled and executed as resources are available.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The threads in a block can be synchonized with &lt;strong&gt;__syncthreads()&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Because of how warps are scheduled, that can be slow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The threads in a block can be arranged into a 3D array, up to
1024x1024x64.&lt;/p&gt;
&lt;p&gt;That is for convenience, and does not increase performance (I think).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;I'll talk about &lt;strong&gt;textures&lt;/strong&gt; later.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Streaming Multiprocessor (SM) - level resources:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has 128 single-precision CUDA cores, 64
double-precision units, 32 special function units, and
32 load/store units.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;In total, the GPU has 2560 CUDA cores.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A &lt;strong&gt;CUDA core&lt;/strong&gt; is akin to an ALU.  The cores, and all the units, are
pipelined.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;A CUDA core is much less powerful than one core of an Intel Xeon.  My
guess is 1/20th.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an
SM a core.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The limited number of, e.g., double precision units means that an DP
instruction will need to be scheduled several times for all the threads
to execute it.  That's why DP is slower.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has 4 warp schedulers and 8 instruction dispatch units.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;64 warps can simultaneously reside in an SM.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Therefore up to 32x64=2048 threads can be executed in parallel by an
SM.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Up to 16 blocks that can simultaneously be resident in an SM.&lt;/p&gt;
&lt;p&gt;However, if each block uses too many resources, like shared memory,
then this number is reduced.&lt;/p&gt;
&lt;p&gt;Each block sits on only one SM; no block is split.  However a block's
warps are executed asynchronously (until synced).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has 64KiB (?) fast memory to be divided between &lt;strong&gt;shared&lt;/strong&gt; memory and an &lt;strong&gt;L1 cache&lt;/strong&gt;.  Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The 48KB L1 cache can cache local or global memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has a read-only data cache of 48KB to cache the
global constant memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has 8 texture units, and many other graphics capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Each SM has 256KB of L2 cacha.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Grid-level resources:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;The blocks in a grid can be arranged into a 3D array.
up to &lt;span class="math"&gt;\((2^{31}-1, 2^{16}-1, 2^{16}-1)\)&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Blocks in a grid might run on different SMs.&lt;/li&gt;
&lt;li&gt;Blocks in a grid are queued and executed as resources are
available, in an unpredictable parallel or serial order.
Therefore they should be independent of each other.&lt;/li&gt;
&lt;li&gt;The number of instructions in a kernel is limited.&lt;/li&gt;
&lt;li&gt;Any thread can stop the kernel by calling &lt;strong&gt;assert&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Device-level resources:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;There is a large and slow 8GB &lt;strong&gt;global memory&lt;/strong&gt;, which
persists from kernel to kernel.&lt;/p&gt;
&lt;p&gt;Transactions to global memory are 128 bytes.&lt;/p&gt;
&lt;p&gt;Host memory can also be memory-mapped into global memory, although the
I/O rate will be lower.&lt;/p&gt;
&lt;p&gt;Reading from global memory can take hundreds of cycles.  A warp that
does this will be paused and another warp started.  Such context
switching is very efficient.  Therefore device throughput stays high,
although there is a latency.  This is called &lt;strong&gt;Thread Level
Parallelism (TLP)&lt;/strong&gt; and is a major reason for GPU performance.&lt;/p&gt;
&lt;p&gt;That assumes that an SM has enough active warps that there is always
another warp available for execution.  That is a reason for having
warps that do not use all the resources (registers etc) that they're
allowed to.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;There is a 2MB L2 cache, for sharing data between SMs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;There is a 64KiB Small and fast global &lt;strong&gt;constant memory&lt;/strong&gt;, ,
which also persists from kernel to kernel.  It is implemented as a
piece of the global memory, made fast with caches.&lt;/p&gt;
&lt;p&gt;(Again, I'm still resolving this apparent contradiction).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Grid Management Unit (GMU)&lt;/strong&gt; schedules (pauses, executes, etc) grids on
the device.  This is more important because grids can start other
grids &lt;strong&gt;(Dynamic Parallelism)&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Hyper-Q&lt;/strong&gt;: 32 simultaneous CPU tasks can launch kernels into the
queue; they don't block each other.  If one kernel is waiting, another runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;CUDA Work Distributor (CWD)&lt;/strong&gt; dispatches 32 active grids at
a time to the SMs.  There may be 1000s of grids queued and waiting.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;GPU Direct&lt;/strong&gt;: Other devices can DMA the GPU memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The base clock is 1607MHz.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;GFLOPS: 8873.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Memory bandwidth: 320GB/s&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;GPU-level resources:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Being a Geforce product, there are many graphics facilities that we're not using.&lt;/li&gt;
&lt;li&gt;There are 4 &lt;strong&gt;Graphics processing clusters&lt;/strong&gt; (GPCs) to do graphics stuff.&lt;/li&gt;
&lt;li&gt;Several perspective projections can be computed in parallel, for systems with several displays.&lt;/li&gt;
&lt;li&gt;There's HW for texture processing.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Generational changes:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;With each new version, Nvidia tweaks the numbers.   Some get higher, others get lower.&lt;ol class="lowerroman"&gt;
&lt;li&gt;E.g., Maxwell had little HW for double precision, and so that was slow.&lt;/li&gt;
&lt;li&gt;Pascal's clock speed is much higher.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Refs:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;The CUDA program deviceDrv.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf"&gt;http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf"&gt;http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf"&gt;Better Performance at Lower Occupancy&lt;/a&gt;,
Vasily Volkov, UC Berkeley, 2010.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.pgroup.com/lit/articles/insider/v2n1a5.htm"&gt;https://www.pgroup.com/lit/articles/insider/v2n1a5.htm&lt;/a&gt; - well written but old.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;(I'll keep adding to this. Suggestions are welcome.)&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="more-cuda"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#id3"&gt;3   More CUDA&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;CUDA function qualifiers:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;em&gt;__global__&lt;/em&gt;   device function called from host, starting a kernel.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;__device__&lt;/em&gt; device function called from device function.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;__host__&lt;/em&gt; (default)  host function called from host function.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;CUDA variable qualifiers:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;em&gt;__shared__&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;__device__&lt;/em&gt; global&lt;/li&gt;
&lt;li&gt;&lt;em&gt;__device__ __managed__&lt;/em&gt; automatically paged between host and device.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;__constant__&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;(nothing) register if scalar, or local if array or if no more registers
available.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;If installing CUDA on your machine, this repository seems best:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64"&gt;http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That includes the Thrust headers but not example programs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="thrust"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#id4"&gt;4   Thrust&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Thrust is an API that looks like STL. Its backend can be CUDA,
OpenMP, or sequential host-based code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The online Thrust directory structure is a mess.  Three main
sites appear to be these:&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://github.com/thrust"&gt;https://github.com/thrust&lt;/a&gt; -&lt;/p&gt;
&lt;ol class="lowerroman simple"&gt;
&lt;li&gt;The best way to install it is to clone from here.&lt;/li&gt;
&lt;li&gt;The latest version of the examples is also here.&lt;/li&gt;
&lt;li&gt;The wiki has a lot of doc.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://thrust.github.io/"&gt;https://thrust.github.io/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This points to the above site.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://developer.nvidia.com/thrust"&gt;https://developer.nvidia.com/thrust&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This has links to other Nvidia docs, some of which are obsolete.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="http://docs.nvidia.com/cuda/thrust/index.html"&gt;http://docs.nvidia.com/cuda/thrust/index.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;easy-to-read, thorough, obsolete, doc&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://code.google.com/"&gt;https://code.google.com/&lt;/a&gt;  - no longer exists.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The latest version is 1.8.3.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Functional-programming philosophy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Many possible backends:  host, GPU, OpenMP, TBB...&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Easier programming, once you get used to it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Code is efficient.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Uses some unusual C++ techniques, like overloading &lt;strong&gt;operator()&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Since the Stanford slides were created, Thrust has adopted
unified addressing, so that pointers know whether they are
host or device.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;CUDACast videos on Thrust:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=mZJEbO9Eros"&gt;CUDACast #.15 - Introduction to Thrust&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://www.youtube.com/watch?v=xtWJCL7LMqU"&gt;CUDACast #.16 - Thrust Algorithms and Custom Operators&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;Thrust is fast because&lt;/strong&gt; the functions that look like they
would need linear time really take only log time in parallel.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;In functions like &lt;strong&gt;reduce&lt;/strong&gt; and &lt;strong&gt;transform&lt;/strong&gt;, you often see an argument like &lt;strong&gt;thrust::multiplies&amp;lt;float&amp;gt;()&lt;/strong&gt;.  The syntax is as follows:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;&lt;strong&gt;thrust::multiplies&amp;lt;float&amp;gt;&lt;/strong&gt; is a class.&lt;/li&gt;
&lt;li&gt;It overloads &lt;strong&gt;operator()&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;However, in the call to reduce, &lt;strong&gt;thrust::multiplies&amp;lt;float&amp;gt;()&lt;/strong&gt; is calling the default
constructor to construct a variable of class
&lt;strong&gt;thrust::multiplies&amp;lt;float&amp;gt;&lt;/strong&gt;, and passing it to &lt;strong&gt;reduce&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reduce&lt;/strong&gt; will treat its argument as a function name and call it with an argument, triggering &lt;strong&gt;operator()&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You may also create your own variable of that class, e.g., &lt;strong&gt;thrust::multiplies&amp;lt;float&amp;gt; foo&lt;/strong&gt;.   Then you just say &lt;strong&gt;foo&lt;/strong&gt; in the argument list, not &lt;strong&gt;foo()&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The optimizing compiler will replace the &lt;strong&gt;operator()&lt;/strong&gt; function call
with the defining expression and then continue optimizing.  So, there
is no overhead, unlike if you passed in a pointer to a function.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Sometimes, e.g., in &lt;strong&gt;saxpy.cu&lt;/strong&gt;, you see &lt;strong&gt;saxpy_functor(A)&lt;/strong&gt;.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;The class &lt;strong&gt;saxpy_functor&lt;/strong&gt; has a constructor taking one argument.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;saxpy_functor(A)&lt;/strong&gt; constructs and returns a variable of class &lt;strong&gt;saxpy_functor&lt;/strong&gt; and stores &lt;strong&gt;A&lt;/strong&gt; in the variable.&lt;/li&gt;
&lt;li&gt;The class also overloads &lt;strong&gt;operator()&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;(Let's call the new variable &lt;strong&gt;foo&lt;/strong&gt;).  &lt;strong&gt;foo()&lt;/strong&gt; calls &lt;strong&gt;operator()&lt;/strong&gt; for
&lt;strong&gt;foo&lt;/strong&gt;; its execution uses the stored &lt;strong&gt;A&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Effectively, we did a &lt;strong&gt;closure&lt;/strong&gt; of &lt;strong&gt;saxpy_functor&lt;/strong&gt;; this is, we
bound a property and returned a new, more restricted, variable or
class.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The Thrust &lt;a class="reference external" href="https://github.com/thrust/thrust/tree/master/"&gt;examples&lt;/a&gt; teach several non-intuitive paradigms.  As I figure them out, I'll describe a few.  My descriptions are modified and expanded versions of the comments in the programs.  This is not a list of all the useful programs, but only of some where I am adding to their comments.&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;arbitrary_transformation.cu&lt;/strong&gt; and &lt;strong&gt;dot_products_with_zip.cu&lt;/strong&gt;. show
the very useful zip_iterator.  Using it is a 2-step process.&lt;/p&gt;
&lt;ol class="lowerroman simple"&gt;
&lt;li&gt;Combine the separate iterators into a tuple.&lt;/li&gt;
&lt;li&gt;Construct a zip iterator from the tuple.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Note that operator() is now a template.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;boundingbox.cu&lt;/strong&gt; finds the bounding box around a set of 2D points.&lt;/p&gt;
&lt;p&gt;The main idea is to do a reduce.  However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;bucket_sort2d.cu&lt;/strong&gt; overlays a grid on a set of 2D points and finds
the points in each grid cell (bucket).&lt;/p&gt;
&lt;ol class="lowerroman"&gt;
&lt;li&gt;&lt;p class="first"&gt;The tuple is an efficient class for a short vector of fixed length.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Note how random numbers are generated.  You combine an engine that produces random output with a distribution.&lt;/p&gt;
&lt;p&gt;However you might need more complicated coding to make the numbers good when executing in parallel.  See &lt;strong&gt;monte_carlo_disjoint_sequences.cu&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The problem is that the number of points in each cell is unpredictable.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Then &lt;strong&gt;lower_bound&lt;/strong&gt; and &lt;strong&gt;upper_bound&lt;/strong&gt; are used to find each bucket in that sorted vector of points.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;See this &lt;a class="reference external" href="http://thrust.github.io/doc/group__vectorized__binary__search.html"&gt;lower_bound description&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;mode.cu&lt;/strong&gt; shows:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Counting the number of unique keys in a vector.&lt;ol class="lowerroman"&gt;
&lt;li&gt;Sort the vector.&lt;/li&gt;
&lt;li&gt;Do an &lt;strong&gt;inner_product&lt;/strong&gt;.  However, instead of the operators being &lt;strong&gt;times&lt;/strong&gt; and &lt;strong&gt;plus&lt;/strong&gt;, they are &lt;strong&gt;not equal to the next element&lt;/strong&gt; and &lt;strong&gt;plus&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Counting their multiplicity.&lt;ol class="lowerroman"&gt;
&lt;li&gt;Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.&lt;/li&gt;
&lt;li&gt;Do a &lt;strong&gt;reduce_by_keys&lt;/strong&gt; on a constant_iterator using the sorted vector as the keys.  For each range of identical keys, it sums the constant_iterator.  That is, it counts the number of identical keys.&lt;/li&gt;
&lt;li&gt;Write a vector of unique keys and a vector of the counts.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Finding the most used key (the mode).&lt;ol class="lowerroman"&gt;
&lt;li&gt;Do &lt;strong&gt;max_element&lt;/strong&gt; on the counts vector.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;repeated_range.cu&lt;/strong&gt; repeats each element of an N-vector K times:
repeated_range([0, 1, 2, 3], 2) -&amp;gt; [0, 0, 1, 1, 2, 2, 3, 3].  It's a lite
version of &lt;strong&gt;expand.cu&lt;/strong&gt;, but uses a different technique.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Here, N=4 and K=2.&lt;/li&gt;
&lt;li&gt;The idea is to construct a new iterator, &lt;strong&gt;repeated_range&lt;/strong&gt;, that, when read and
incremented, will return the proper output elements.&lt;/li&gt;
&lt;li&gt;The construction stores the relevant info in structure components of the variable.&lt;/li&gt;
&lt;li&gt;Treating its value like a subscript in the range [0,N*K), it divides
that value by K and returns that element of its input.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See also &lt;strong&gt;strided_range.cu&lt;/strong&gt; and &lt;strong&gt;tiled_range.cu&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="unionfs-linux-trick-of-the-day"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/#id5"&gt;5   Unionfs: Linux trick of the day&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;aka overlay FS, translucent FS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;If a, b are directories, and m is an empty directory, then&lt;/p&gt;
&lt;p&gt;unionfs -o cow a=RW:b m&lt;/p&gt;
&lt;p&gt;makes m to be a combo of a and b, with a being higher priority&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Writing a file into m writes it in a.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Changing a file in b writes the new version into a&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Deleting a file in b causes a white-out note to be stored in a.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Unmount it thus:&lt;/p&gt;
&lt;p&gt;fusermount -u m&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;None of this requires superuser.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Application: making a read-only directory into a read-write directory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class09/</guid><pubDate>Wed, 21 Mar 2018 04:00:00 GMT</pubDate></item><item><title>PAR Class 8, Wed 2018-03-07</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class08/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;p&gt;Class cancelled because of weather.&lt;/p&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class08/</guid><pubDate>Wed, 07 Mar 2018 05:00:00 GMT</pubDate></item><item><title>PAR Class 7, Wed 2018-02-28</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#term-project-progress" id="id1"&gt;1   Term project progress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#parallel-programs" id="id2"&gt;2   Parallel programs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#computer-factoid" id="id3"&gt;3   Computer factoid&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#cuda-doc" id="id4"&gt;4   CUDA Doc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#stanford-lectures" id="id5"&gt;5   Stanford lectures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#thrust" id="id6"&gt;6   Thrust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#ibm-s-quantum-computer" id="id7"&gt;7   IBM's quantum computer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="term-project-progress"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id1"&gt;1   Term project progress&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;How's this going?   Send me a progress report.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="parallel-programs"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id2"&gt;2   Parallel programs&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;It shows a really quick way to use OpenMP, which has a 10x speedup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It shows a really quick way to use CUDA, which has a 15x speedup.   This just uses one thread block per input matrix A row, and one thread per output element.   That's 1M threads.  The data is read from managed memory.  Note how easy it is.   There is no storing tiles into fast shared memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory.   The latter is 2.5x slower.    Dunno why.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;matmul2 also shows some of my utility functions and macros.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;InitCUDA prints a description of the GPU etc.&lt;/li&gt;
&lt;li&gt;cout &amp;lt;&amp;lt; PRINTC(expr) prints an expression's name and then its value and a comma.&lt;/li&gt;
&lt;li&gt;PRINTN is like PRINTC but ends with endl.&lt;/li&gt;
&lt;li&gt;TIME(expr) evals an expression then prints its name and total and delta&lt;ol class="lowerroman"&gt;
&lt;li&gt;CPU time,&lt;/li&gt;
&lt;li&gt;elapsed time,&lt;/li&gt;
&lt;li&gt;their ratio.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;CT evals and prints the elapsed time of a CUDA kernel.&lt;/li&gt;
&lt;li&gt;Later I may add new tests to this.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;/parallel-class/cuda/checksum.cc shows a significant digits problem when
you add many small numbers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;sum_reduction.cu is Stanford's program.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;sum_reduction2.cu is my modification to use managed memory.&lt;/p&gt;
&lt;p&gt;Note how both sum_reduction and sum_reduction2 give different answers
for the serial and the parallel computation.  That is bad.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;sum_reduction3.cu is a mod to try to find the problem.  One problem is
insufficient precision in the sum.  Using double works.  However there
might be other problems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="computer-factoid"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id3"&gt;3   Computer factoid&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Unrelated to this course, but perhaps interesting:&lt;/p&gt;
&lt;p&gt;All compute servers have for decades had a microprocessor frontend that controls the boot process.   The current iteration is called an
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface"&gt;IPMI&lt;/a&gt;.   It has a separate ethernet port, and would allow remote bios configs and booting.   On parallel, that port is not connected (I don't trust the security).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cuda-doc"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id4"&gt;4   CUDA Doc&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The start of Nvidia's &lt;a class="reference external" href="https://docs.nvidia.com/cuda/parallel-thread-execution/index.html"&gt;Parallel thread execution&lt;/a&gt; has useful info.&lt;/p&gt;
&lt;p&gt;This is one of a batch of &lt;a class="reference external" href="https://docs.nvidia.com/cuda/"&gt;CUDA docs&lt;/a&gt;.  Browse as you wish.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="stanford-lectures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id5"&gt;5   Stanford lectures&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;All the lectures and associated files are on geoxeon and parallel.ecse at /parallel-class/stanford/ .&lt;/p&gt;
&lt;p&gt;They are also online &lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/files/stanford/"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/files/stanford/lectures/lecture_6/parallel_patterns_1.pdf"&gt;Lecture 6 parallel patterns 1&lt;/a&gt; presents some paradigms of parallel programming.   These are generally useful building blocks for parallel algorithms.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="thrust"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id6"&gt;6   Thrust&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Stanford lecture 8: Thrust. This is a functional frontend to various backends, such as CUDA.&lt;/p&gt;
&lt;p&gt;Programs are in /parallel-class/thrust/ .&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="ibm-s-quantum-computer"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/#id7"&gt;7   IBM's quantum computer&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As mentioned by Narayanaswami Chandrasekhar last week.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/IBM_Quantum_Experience"&gt;https://en.wikipedia.org/wiki/IBM_Quantum_Experience&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://techcrunch.com/2017/11/10/ibm-passes-major-milestone-with-20-and-50-qubit-quantum-computers-as-a-service/"&gt;https://techcrunch.com/2017/11/10/ibm-passes-major-milestone-with-20-and-50-qubit-quantum-computers-as-a-service/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.technologyreview.com/s/609451/ibm-raises-the-bar-with-a-50-qubit-quantum-computer/"&gt;https://www.technologyreview.com/s/609451/ibm-raises-the-bar-with-a-50-qubit-quantum-computer/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://www.research.ibm.com/ibm-q/"&gt;https://www.research.ibm.com/ibm-q/&lt;/a&gt; - points to lots of info, e.g., QISKit on github,
beginners guide, etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You to do:  learn this and present it to me in class.&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class07/</guid><pubDate>Tue, 27 Feb 2018 05:00:00 GMT</pubDate></item><item><title>PAR Class 6, Wed 2018-02-21</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#narayanaswami-chandrasekhar-talk-on-blockchains" id="id1"&gt;1   Narayanaswami Chandrasekhar talk on Blockchains&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#optional-homework-bring-your-answers-and-discuss-next-week" id="id2"&gt;2   Optional Homework - bring your answers and discuss next week&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#paper-questions" id="id3"&gt;2.1   Paper questions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#programming-questions" id="id4"&gt;2.2   Programming questions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#stanford-lectures" id="id5"&gt;3   Stanford lectures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#misc-cuda" id="id6"&gt;4   Misc CUDA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#managed-variables" id="id7"&gt;5   Managed Variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#doc" id="id8"&gt;6   Doc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#managed-memory-issues" id="id9"&gt;7   Managed memory issues&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#misc-hints" id="id10"&gt;8   Misc hints&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#vim" id="id11"&gt;8.1   Vim&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="narayanaswami-chandrasekhar-talk-on-blockchains"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id1"&gt;1   Narayanaswami Chandrasekhar talk on Blockchains&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Class will be short today because of this.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="optional-homework-bring-your-answers-and-discuss-next-week"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id2"&gt;2   Optional Homework - bring your answers and discuss next week&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="paper-questions"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id3"&gt;2.1   Paper questions&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Research and then describe the main changes from NVidia Maxwell to Pascal.&lt;/li&gt;
&lt;li&gt;Although a thread can use 255 registers, that might be bad for performance.  Why?&lt;/li&gt;
&lt;li&gt;Give a common way that the various threads in a block can share data with each other.&lt;/li&gt;
&lt;li&gt;Reading a word from global memory might take 400 cycles.  Does that mean that a thread that reads many words from global memory will always take hundreds of times longer to complete?&lt;/li&gt;
&lt;li&gt;Since the threads in a warp are executed in a SIMD fashion, how can an if-then-else block be executed?&lt;/li&gt;
&lt;li&gt;What is unified virtual addressing and how does it make CUDA programming easier?&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="programming-questions"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id4"&gt;2.2   Programming questions&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Repeat homework 2's matrix multiplication problem, this time in CUDA.  Report how much parallel speedup you get.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Look at the dataset &lt;strong&gt;/parallel-class/data/bunny&lt;/strong&gt;.  It contains 35947 points for the Stanford bunny.&lt;/p&gt;
&lt;p&gt;Assuming that each point has a mass of 1, and is gravitationally attracted to the others, compute the potential energy of the system.  The formula is this:&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\(U = - \sum_{i=1}^{N-1} \sum_{j=i+1}^N \frac{1}{r_{ij}}\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;where &lt;span class="math"&gt;\(r_{ij}\)&lt;/span&gt; is the distance between points &lt;span class="math"&gt;\(i\)&lt;/span&gt; and    &lt;span class="math"&gt;\(j\)&lt;/span&gt; .  (This assumes that G=1).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Now look at the dataset &lt;strong&gt;/parallel-class/data/blade&lt;/strong&gt;, which contains 882954 points for a turbine blade.  Can you process it?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="stanford-lectures"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id5"&gt;3   Stanford lectures&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/lecture_5/performance_considerations.pdf"&gt;Lecture 5 performance considerations&lt;/a&gt; shows how to fine tune your program once it's already working, if you need the extra speed.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/lecture_6/parallel_patterns_1.pdf"&gt;Lecture 6 parallel patterns 1&lt;/a&gt; presents some paradigms of parallel programming.   These are generally useful building blocks for parallel algorithms.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="misc-cuda"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id6"&gt;4   Misc CUDA&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;The demo programs are in &lt;strong&gt;/local/cuda/samples/&lt;/strong&gt; .  Their coding style is suboptimal.  However, in &lt;strong&gt;/local/cuda/samples/1_Utilities/&lt;/strong&gt; , &lt;strong&gt;bandwidthTest&lt;/strong&gt; and &lt;strong&gt;deviceQuery&lt;/strong&gt; are interesting.&lt;/p&gt;
&lt;p&gt;For your convenience, &lt;strong&gt;/parallel-class/deviceQuery&lt;/strong&gt; is a link.  Run it to see the GPU's capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The program &lt;strong&gt;nvidia-smi&lt;/strong&gt; shows the current load on the GPU.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;My &lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/tutorials"&gt;web copy&lt;/a&gt; of the tutorial programs from Stanford's parallel course notes is also on parallel at &lt;strong&gt;/parallel-class/stanford/tutorials/&lt;/strong&gt; .&lt;/p&gt;
&lt;ol class="loweralpha"&gt;
&lt;li&gt;&lt;p class="first"&gt;I've edited some of them, and put the originals in &lt;strong&gt;orig/&lt;/strong&gt; , and created new ones.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;To compile them, you need &lt;strong&gt;/local/cuda/bin&lt;/strong&gt; in your &lt;strong&gt;PATH&lt;/strong&gt; and
&lt;strong&gt;/local/cuda/lib64&lt;/strong&gt; in your &lt;strong&gt;LD_LIBRARY_PATH&lt;/strong&gt; .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Name your source program &lt;strong&gt;foo.cu&lt;/strong&gt;  for some &lt;strong&gt;foo&lt;/strong&gt; .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Compile it thus:  &lt;strong&gt;nvcc foo.cu -o foo&lt;/strong&gt; .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;hello_world.cu&lt;/strong&gt; shows a simple CUDA program and uses a hack to print from a device function.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;hello_world2.cu&lt;/strong&gt; shows printing from several threads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;global_functions.cu&lt;/strong&gt; shows some basic CUDA stuff.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;device_functions.cu&lt;/strong&gt; extends it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;vector_addition.cu&lt;/strong&gt; does (you figure it out).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;vector_addition2.cu&lt;/strong&gt; is my modification to use unified memory, per &lt;a class="reference external" href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html"&gt;http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html&lt;/a&gt; .   I also cleaned up the code and shrank the number of lines for better display.&lt;/p&gt;
&lt;p&gt;IMO, unified memory makes programming a lot easier.&lt;/p&gt;
&lt;p&gt;Notes:&lt;/p&gt;
&lt;ol class="lowerroman"&gt;
&lt;li&gt;&lt;p class="first"&gt;In linux, what's the easiest way to find the smallest prime larger than a given number?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;To find the number of blocks needed for N threads, you can do it the Stanford way:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;grid_size = num_elements / block_size;
if(num_elements % block_size) ++grid_size;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;or you can do it the RPI (i.e., my) way:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;grid_size = (num_elements + block_size - 1) / block_size;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="managed-variables"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id7"&gt;5   Managed Variables&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Last time we saw 2 ways to create managed variables.  They can be accessed by either the host or the device and are paged automatically.  This makes programming much easier.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Create static variables with &lt;strong&gt;__device__ __managed__&lt;/strong&gt;.  See &lt;strong&gt;/parallel-class/stanford/tutorials/vector_addition2.cu&lt;/strong&gt; on parallel.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;cudaMallocManaged&lt;/strong&gt;.  See &lt;strong&gt;/parallel-class/stanford/tutorials/vector_addition3.cu&lt;/strong&gt; on parallel.&lt;/li&gt;
&lt;li&gt;In either case, you need to call &lt;strong&gt;cudaDeviceSynchronize();&lt;/strong&gt; on the host after starting a parallel kernel before reading the data on the host.  The reason is that the kernel is started asynchonously and control returns while it is still executing.&lt;/li&gt;
&lt;li&gt;When the linux kernel gets HMM (heterogeneous memory management), all data on the heap will automatically be managed.&lt;/li&gt;
&lt;li&gt;The reason is that virtual addresses are long enough to contain a tag saying what device they are on.  The VM page mapper will read and write pages to various devices, not just swap files.&lt;/li&gt;
&lt;li&gt;Any CUDA example using &lt;strong&gt;cudaMemcpy&lt;/strong&gt; is now obsolete (on Pascal GPUs).&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="doc"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id8"&gt;6   Doc&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Nvidia's &lt;a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/"&gt;CUDA programming guide&lt;/a&gt; is excellent, albeit obsolescent in places.  The Pascal info looks like it's been tacked onto an older document.&lt;/p&gt;
&lt;p&gt;The whitepaper &lt;a class="reference external" href="http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf"&gt;NVIDIA GeForce GTX 1080&lt;/a&gt; describes, from a gaming point of view, the P104 GPU, which is in the GTX 1080, the card in parallel.ecse.&lt;/p&gt;
&lt;p&gt;NVIDIA now has a higher level GPU, the P100, described in the &lt;a class="reference external" href="https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf"&gt;P100 whitepaper&lt;/a&gt;
and &lt;a class="reference external" href="http://images.nvidia.com/content/tesla/pdf/nvidia-teslap100-techoverview.pdf"&gt;P100 technical overview&lt;/a&gt;.   Note that the P100 is a Tesla (scientific computing) not a GeForce (gaming).    This description is much more technical.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="managed-memory-issues"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id9"&gt;7   Managed memory issues&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;I'm sometimes seeing a 2.5x speed reduction using managed memory on the host, compared to using unmanaged memory.  Dunno what's happening.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="misc-hints"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id10"&gt;8   Misc hints&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="vim"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/#id11"&gt;8.1   Vim&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;To get &lt;strong&gt;vim&lt;/strong&gt; to show line numbers, create a file &lt;strong&gt;~/.exrc&lt;/strong&gt; containing this line:&lt;/p&gt;
&lt;blockquote&gt;
:se nu&lt;/blockquote&gt;
&lt;p&gt;It will be read everytime vim starts, and will set line number mode.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class06/</guid><pubDate>Thu, 15 Feb 2018 05:00:00 GMT</pubDate></item><item><title>PAR Class 5, Wed 2018-02-14</title><link>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/</link><dc:creator>W Randolph Franklin, RPI</dc:creator><description>&lt;div&gt;&lt;style&gt; .red {color:red} &lt;/style&gt;
&lt;style&gt; .blue {color:blue} &lt;/style&gt;&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title first"&gt;Table of contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#optional-homework-bring-your-answers-and-discuss-next-week" id="id1"&gt;1   Optional Homework - bring your answers and discuss next week&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#nvidia-conceptual-hierarchy" id="id2"&gt;2   Nvidia conceptual hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#gpu-range-of-speeds" id="id3"&gt;3   GPU range of speeds&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#cuda" id="id4"&gt;4   CUDA&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#versions" id="id5"&gt;4.1   Versions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#stanford-lectures" id="id6"&gt;4.2   Stanford lectures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#misc" id="id7"&gt;4.3   Misc&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div class="section" id="optional-homework-bring-your-answers-and-discuss-next-week"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id1"&gt;1   Optional Homework - bring your answers and discuss next week&lt;/a&gt;&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Write a program to multiply two 100x100 matrices. Do it the conventional way,
not using anything fancy like Schonhage-Strassen. Now, see how much
improvement you can get with OpenMP. Measure only the elapsed time for the
multiplication, not for the matrix initialization.&lt;/p&gt;
&lt;p&gt;Report these execution times.&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;W/o openmp enabled (Don't use -fopenmp.  Comment out the pragmas.)&lt;/li&gt;
&lt;li&gt;With openmp, using only 1 thread.&lt;/li&gt;
&lt;li&gt;Using 2, 4, 8, 16, 32, 64 threads.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Write programs to test the effect of the reduction pragma:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Create an array of 1,000,000,000 floats and fill it with pseudorandom numbers from 0 to 1.&lt;/li&gt;
&lt;li&gt;Do the following tests with 1, 2, 4, 8, 16, and 32 threads.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Programs to write and test:&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;Sum it with a simple for loop.  This will give a wrong answer with more than 1 thread, but is fast.&lt;/li&gt;
&lt;li&gt;Sum it with the subtotal variable protected with a &lt;strong&gt;atomic&lt;/strong&gt; pragma.&lt;/li&gt;
&lt;li&gt;Sum it with the subtotal variable protected with a &lt;strong&gt;critical&lt;/strong&gt; pragma.&lt;/li&gt;
&lt;li&gt;Sum it with a reduction loop.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Devise a test program to estimate the time to execute a task pragma. You might start with use taskfib.cc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="nvidia-conceptual-hierarchy"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id2"&gt;2   Nvidia conceptual hierarchy&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As always, this is as I understand it, and could be wrong.    Nvidia uses their own terminology inconsistently.   They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator).    As time progresses, they change their terminology.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;At the bottom is the hardware &lt;strong&gt;micro-architecture&lt;/strong&gt;.  This is an API that defines things like the available operations.  The last several Nvidia micro-architecture generations are, in order, &lt;strong&gt;Tesla&lt;/strong&gt; (which introduced unified shaders), &lt;strong&gt;Fermi&lt;/strong&gt;, &lt;strong&gt;Kepler&lt;/strong&gt;, &lt;strong&gt;Maxwell&lt;/strong&gt; (introduced in 2014), &lt;strong&gt;Pascal&lt;/strong&gt; (2016), and &lt;strong&gt;Volta&lt;/strong&gt; (2018).&lt;/li&gt;
&lt;li&gt;Each micro-architecture is implemented in several different &lt;strong&gt;microprocessors&lt;/strong&gt;.  E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc.  Pascal is GP104 etc.  The second letter describes the micro-architecture.  Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.&lt;/li&gt;
&lt;li&gt;To be used, microprocessors are embedded in &lt;strong&gt;graphics cards&lt;/strong&gt;, aka &lt;strong&gt;modules&lt;/strong&gt; or &lt;strong&gt;accelerators&lt;/strong&gt;, which are grouped into series such as GeForce, Quadro, etc.  Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures.  Two different modules using the same microprocessor may have different amounts of memory and other resources.  These are the components that you buy and insert into a computer.  A typical name is &lt;strong&gt;GeForce GTX1080&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...&lt;/li&gt;
&lt;li&gt;The same accelerator may be manufactured by different vendors, as well as by Nvidia.  These different versions may have slightly different parameters.  Nvidia's &lt;strong&gt;reference version&lt;/strong&gt; may be relatively low performance.&lt;/li&gt;
&lt;li&gt;The term &lt;strong&gt;GPU&lt;/strong&gt; sometimes refers to the microprocessor and sometimes to the module.&lt;/li&gt;
&lt;li&gt;There are four families of modules: &lt;strong&gt;GeForce&lt;/strong&gt; for gamers, &lt;strong&gt;Quadro&lt;/strong&gt; for professionals, &lt;strong&gt;Tesla&lt;/strong&gt; for computation, and &lt;strong&gt;Tegra&lt;/strong&gt; for mobility.&lt;/li&gt;
&lt;li&gt;Nvidia uses the term &lt;strong&gt;Tesla&lt;/strong&gt; in two unrelated ways.  It is an obsolete architecture generation and a module family.&lt;/li&gt;
&lt;li&gt;Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm.  Parallel has a (Pascal) GeForce GTX 1080.  We also have an unused (Kepler) Quadro K5000.&lt;/li&gt;
&lt;li&gt;Since the highest-end (Tesla) modules don't have video out, they are also called something like &lt;strong&gt;compute modules&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="gpu-range-of-speeds"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id3"&gt;3   GPU range of speeds&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.&lt;/p&gt;
&lt;p&gt;The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory.
matrixMulCUBLAS runs at 3136 GFlops.  However the reported time (0.063 msec) is so small that it may be inaccurate.   The quoted speed of the 1080 is about triple that.   I'm impressed that the measured performance is so close.&lt;/p&gt;
&lt;p&gt;The Quadro K2100M in  my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory.  matrixMulCUBLAS runs at 320 GFlops.   The time on the GPU was about .7 msec, and on the CPU 600 msec.&lt;/p&gt;
&lt;p&gt;It's nice that the performance almost scaled with the number of cores and clock speed.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cuda"&gt;
&lt;h2&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id4"&gt;4   CUDA&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="versions"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id5"&gt;4.1   Versions&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;CUDA has a &lt;strong&gt;capability version&lt;/strong&gt;, whose major number corresponds to the micro-architecture generation.  Kepler is 3.x.  The K20xm is 3.5.  The GTX 1080 is 6.1.  Here is a table of the &lt;a class="reference external" href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities"&gt;properties of different compute capabilities&lt;/a&gt;.  However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;strong&gt;nvcc&lt;/strong&gt;, the CUDA compiler, can be told which capabilities (aka architectures) to compile for.   They can be given as a real  architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.&lt;/p&gt;
&lt;p&gt;Just use the option &lt;strong&gt;-arch=compute_61&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;The CUDA driver and runtime also have a software version, defining things like available C++ functions.  The latest is 9.1.   This is unrelated to the capability version.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="stanford-lectures"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id6"&gt;4.2   Stanford lectures&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/"&gt;On the web server&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;On geoxeon:  /parallel-class/stanford/lectures/&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/stanford/lectures/lecture_4/cuda_memories.pdf"&gt;Lecture 4&lt;/a&gt;: how to&lt;ol class="arabic"&gt;
&lt;li&gt;cache data into shared memory for speed, and&lt;/li&gt;
&lt;li&gt;use hierarchical sync.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="misc"&gt;
&lt;h3&gt;&lt;a class="toc-backref" href="https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/#id7"&gt;4.3   Misc&lt;/a&gt;&lt;/h3&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;With CUDA, the dominant problem in program optimization is optimizing the data flow.  Getting the data quickly to the cores is harder than processing it.  It helps big to have regular arrays, where each core reads or writes a successive entry.&lt;/p&gt;
&lt;p&gt;This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;&lt;a class="reference external" href="https://developer.nvidia.com/cuda-faq"&gt;Nvidia CUDA FAQ&lt;/a&gt;&lt;/p&gt;
&lt;ol class="loweralpha simple"&gt;
&lt;li&gt;has links to other Nvidia docs.&lt;/li&gt;
&lt;li&gt;is a little old.  Kepler and Fermi are 2 and 3 generations old.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/div&gt;</description><category>class</category><guid>https://wrf.ecse.rpi.edu/Teaching/parallel-s2018/posts/class05/</guid><pubDate>Wed, 14 Feb 2018 05:00:00 GMT</pubDate></item></channel></rss>