parallel programiming

Posted on at



 New standard of C++ has been ratified◦ “C++0x” ==> “C++11” Lots of new features We’ll focus on concurrency features
3 Async programming: Better responsiveness… GUIs (desktop, web, mobile) Cloud Windows 8  Parallel programming: Better performance… Financials Pharma Engineering Big dataGoing Parallel with C++11 Mandelbrot Set…Going Parallel with C++11 45MainthreadMain<<start Work>>if…while…WorkStmt1;Stmt2;Stmt3;Main<<start Work1>><<start Work2>>if…while…Work1Stmt1;Stmt2;Stmt3;Work2Stmt4;Stmt5;Stmt6;WorkerthreadWorkerthreadWorkerthreadMainthread Single core:  Multicore:Going Parallel with C++11 Numerous threading models are available:◦ POSIX (aka Pthreads)◦ Win32 (aka Windows)◦ Boost◦ Java◦ .NET◦ …Going Parallel with C++11 6 C++11 threads are the new kid on the block ◦ std::thread class now part of standard C++ library◦ std::thread is an abstraction ─ maps to local platform threads(POSIX, Windows, etc.)Going Parallel with C++11 78#include <thread>#include <iostream>void func(){std::cout << "**Inside thread "<< std::this_thread::get_id() << "!" << std::endl;}int main(){std::thread t( func );t.join();return 0;}A simple function for thread to do…Create & schedule thread to execute func…Wait for thread to finish… Hello world…Going Parallel with C++11 910#include <thread>#include <iostream>void func(){std::cout << "**Hello world...\n";}int main(){std::thread t( func );t.join();return 0;}(1) Thread function must do exception handling;unhandled exceptions ==> error termination…void func(){try{// computation:}catch(...){// do something:}}(2) Must join with thread *before* handle goes outof scope, otherwise error termination…NOTE: avoid use of detach( ) in C++11, difficult to usesafely with general resource cleanup. But it’s an option.Going Parallel with C++11 std::thread written to *prevent* copying of threads11int main(){std::thread t( func );std::thread t2(t);std::thread t3 = t; X std // NOTE: t is no longer valid! assert( ::thread t.joinable t2( std () == false ::move(t) ); );std::thread& t3 = t2;. . .t2.join(); // or t3.join();compilation errors…But you can move, and reference…std::move tells compiler to invoke move constructor: thread( thread&& other ) Constructors:Going Parallel with C++11 12class thread{thread(); // creates new thread object that does *not* represent a thread (i.e. not joinable)thread( std::Function&& f, Args&&... args ); // creates new thread to execute fthread( thread&& other); // *move* constructorthread( thread& other); // *copy* constructor --- not availabletemplate<class std::Function, class... Args>explicit thread(std::Function&& f, Args&&... args); Old school:◦ thread functions (what we just saw) Middle school:◦ function objects New school:◦ C++11 now offers lambda expressions aka anonymous functionsGoing Parallel with C++11 13class FuncObject{public:void operator() (void){ cout << this_thread::get_id() << endl; }};int main(){FuncObject f;std::thread t( f ); Type inference Lambda expressionsGoing Parallel with C++11 14auto lambda = [&] () -> int{int sum = 0;for (int i=0; i<N; ++i)sum += A[i];return sum;};lambda expression = code + dataClosure semantics:[ ]: none, [&]: by ref, [=]: by val, …infer variable type lambda arguments == parametersreturn type… Saxpy == Scalar Alpha X Plus Y◦ Scalar multiplication and vector addition15for (int i=0; i<n; i++)z[i] = a * x[i] + y[i];auto code = [&](int start, int end) -> void{for (int i = start; i < end; i++)z[i] = a * x[i] + y[i];};thread t1(code, 0 /*start*/, N/2 /*end*/);thread t2(code, N/2 /*start*/, N /*end*/);Parallel Lambdas:◦ Easier and more readable -- code remains inline◦ Potentially more dangerous ([&] captures everything by ref) Functions:◦ More efficient -- lambdas involve class, function objects◦ Potentially safer -- requires explicit variable scoping◦ More cumbersome and less readableGoing Parallel with C++11 16 Matrix multiply…Going Parallel with C++11 17Going Parallel with C++11 18//// Naïve, triply-nested sequential solution://for (int i = 0; i < N; i++){for (int j = 0; j < N; j++){C[i][j] = 0.0;for (int k = 0; k < N; k++)C[i][j] += (A[i][k] * B[k][j]);}} A common pattern when creating multiple threads19forkjoin#include <vector>std::vector<std::thread> threads;int cores = std::thread::hardware_concurrency();for (int i=0; i<cores; ++i) // 1 per core:{auto code = []() { DoSomeWork(); };threads.push_back( thread(code) );}for (std::thread& t : threads) // new range-based for:t.join();Going Parallel with C++1120int rows = N / numthreads;int extra = N % numthreads;int start = 0; // each thread does [start..end)int end = rows;auto code = [N, &C, &A, &B](int start, int end) -> void{for (int i = start; i < end; i++)for (int j = 0; j < N; j++){C[i][j] = 0.0;for (int k = 0; k < N; k++)C[i][j] += (A[i][k] * B[k][j]);}};vector<thread> workers;for (int t = 1; t <= numthreads; t++){if (t == numthreads) // last thread does extra rows:end += extra;workers.push_back( thread(code, start, end) );start = end;end = start + rows;}for (thread& t : workers)t.join();// 1 thread per core:numthreads = thread::hardware_concurrency(); Parallelism alone is not enough…21HPC == Parallelism + Memory Hierarchy ─ ContentionExpose parallelismMaximize data locality:• network• disk• RAM• cache• coreMinimize interaction:• false sharing• locking• synchronizationGoing Parallel with C++11 Cache-friendly MM…22XGoing Parallel with C++11 Significantly-better caching, and performance…Going Parallel with C++11 23workers.push_back( thread([start, end, N, &C, &A, &B](){for (int i = start; i < end; i++)for (int j = 0; j < N; j++)C[i][j] = 0;for (int i = start; i < end; i++)for (int k = 0; k < N; k++)for (int j = 0; j < N; j++)C[i][j] += (A[i][k] * B[k][j]);})); Most common types:◦ Data◦ Task◦ Embarrassingly parallel◦ Dataflow24 Def: same operation executed in parallel on different data.25Customersfor(i=0; i<N; i++)for(j=0; j<N; j++)A[i,j] = sqrt(c * A[i,j]);foreach(Customer c)UpdatePortfolio(c); Def: different operations executed in parallel.26MarketDataUpdatePortfolios(); // task1:PredictMarkets(); // task2:AssessRisks(); // task3: Def: a problem is embarrassingly parallel if thecomputations are independent of one another.27for(i=0; i<N; i++)for(j=0; j<N; j++)A[i,j] = sqrt(c * A[i,j]);Not embarrassing at all, butin fact yields the best results."Delightfully parallel" Def: when operations depend on one another.◦ data "flows" from one operation to another…28parallel execution requirescommunication / coordinationDepending on nature ofdataflow, may notparallelize well… Image processing…29for(r=1; r<Rows-1; r++)for(c=1; c<Cols-1; c++)image[r,c] = Avg(image[ r-1, c ], // N:image[ r+1, c ], // S:image[ r, c+1 ], // E:image[ r, c-1 ]); // W:Going Parallel with C++11 30 Most compilers fully implement C++11 gcc 4.8.1 has complete support◦ gcc.gnu.org/projects/cxx0x.html clang 3.3 has complete support◦ clang.llvm.org/cxx_status.html Visual C++ 2012 has reasonable support◦ Near complete support for concurrency◦ Most of the major features of C++11 are there as well…◦ Will be nearly complete in VC++ 2013, but not 100%Going Parallel with C++11 3132# makefile# threading library: one of these should work# tlib=threadtlib=pthread# gcc 4.6:# ver=c++0x# gcc 4.7 and 4.8:ver=c++11build:g++ -std=$(ver) -Wall main.cpp -l$(tlib)Going Parallel with C++1133Concept Header SummaryThreads <thread> Standard, low-level, type-safe; good basis forbuilding HL systems (futures, tasks, …)Futures <future> Via async function; hides threading, betterharvesting of return value & exception handlingLocking <mutex> Standard, low-level locking primitivesCondition Vars <condition_variable> Low-level synchronization primitivesAtomics <atomic> Predictable, concurrent access without data raceMemory Model “Catch Fire” semantics; if program contains a datarace, behavior of memory is undefinedThread Local Thread-local variables [ problematic => avoid ]Going Parallel with C++11 34 Futures provide a higher level of abstraction◦ you start an asynchronous / parallel operation◦ you are returned a handle to wait for the result◦ thread creation, join, and exceptions are handled for youGoing Parallel with C++11 35 Use async to start asynchronous operation Use returned future to wait upon result / exception36#include <future>std::future<int> f = std::async( []() -> int{int result = PerformLongRunningOperation();return result;});..try{int x = f.get(); // wait if necessary, harvest result:cout << x << endl;}catch(exception &e){cout << "**Exception: " << e.what() << endl;}STARTWAITlambda return type…Going Parallel with C++11 Run on current thread *or* a new thread By default, system decides…◦ based on current load, available cores, etc.37// runs on current thread when you “get” value (i.e. lazy execution):future<T> f1 = std::async( std::launch::deferred, []() -> T {...} );// runs now on a new, dedicated thread:future<T> f2 = std::async( std::launch::async, []() -> T {...} );// let system decide (e.g. maybe you created enough work to keep system busy?):future<T> f3 = std::async( []() -> T {...} );optional argument missingGoing Parallel with C++11 Netflix data-mining…38NetflixMovieReviews(.txt)Netflix DataMining AppAverage rating…Going Parallel with C++11Going Parallel with C++11 39cin >> movieID;vector<string> ratings = readFile("ratings.txt");tuple<int,int> results = dataMine(ratings, movieID);int numRatings = std::get<0>(results);int sumRatings = std::get<1>(results);double avgRating = double(numRatings) / double(sumRatings);cout << numRatings << endl;cout << avgRating << endl; dataMine(vector<string> &ratings, int id){foreach ratingif ids match num++, sum += rating;return tuple<int,int>(num, sum);}40int chunksize = ratings.size() / numthreads;int leftover = ratings.size() % numthreads;int begin = 0; // each thread does [start..end)int end = chunksize;vector<future<tuple<int,int>>> futures;for (int t = 1; t <= numthreads; t++){if (t == numthreads) // last thread does extra rows:end += leftover;futures.push_back(async([&ratings, movieID, begin, end]() -> tuple<int,int>{return dataMine(ratings, movieID, begin, end);}));begin = end;end = begin + chunksize;}for (future<tuple<int,int>> &f: futures){tuple<int, int> t = f.get();numRatings += std::get<0>(t);sumRatings += std::get<1>(t);}Going Parallel with C++11dataMine(..., int begin, int end){foreach rating in begin..endif ids match num++, sum += rating;return tuple<int,int>(num, sum);} Futures provide a way to check if result is available◦ this way we don't"wait" unless thereis data to process…Going Parallel with C++11 41for (...){.. // create futures, add to vector:.}// WaitAll: wait and process futures in order they complete, versus// the order they appear in vector. This is O(N), N = vector size.size_t cur = 0;size_t running = futures.size();while (running > 1) { // poll vector of futures for one that is ready:std::future<std::tuple<int,int>> &f = futures[cur];auto status = f.wait_for(std::chrono::milliseconds(10));if (status == std::future_status::ready) {std::tuple<int, int> t = f.get();numRatings += std::get<0>(t);sumRatings += std::get<1>(t);running--;futures.erase(futures.begin() + cur);}cur++; if (cur >= running) cur = 0;}std::tuple<int, int> t = futures[0].get(); // last one, just wait:numRatings += std::get<0>(t);sumRatings += std::get<1>(t);Going Parallel with C++11 42 Beware the many dangers of concurrency: Most common pitfall for application developers?43race conditions Livelock DeadlockStarvationOptimizing Compilersoptimizing hardwareRace conditions…Going Parallel with C++11 Consider 2 threads accessing a shared variable…44thread t1([&](){int r = compute();sum = sum + r;});int sum = 0;thread t2([&](){int s = compute();sum = sum + s;});Error! Racecondition…Going Parallel with C++11 C++ committee thought long and hard on memorymodel semantics…◦ “You Don’t Know Jack About Shared Variables or Memory Models”,Boehm and Adve, CACM, Feb 2012 Conclusion:◦ No suitable definition in presence of race conditions Result in C++11:◦ Predictable memory model *only* in data-race-free codes◦ Computer may “catch fire” in presence of data racesGoing Parallel with C++11 45 A program is data-race-free (DRF) if no sequentiallyconsistent execution results in a data race. Avoidanything else.46Def: two memory accesses conflict if they1. access the same scalar object or contiguous sequence of bit fields, and2. at least one access is a store.Def: two memory accesses participate in a data race if they1. conflict, and2. can occur simultaneously.Going Parallel with C++11 Various solutions… redesign to eliminate (e.g. reduction) use thread-safe entities (e.g. parallel collections) use synchronization (e.g. locking)47most preferredleast preferredGoing Parallel with C++11 Redesign to eliminate shared resource…48auto f1 = async([&]() -> int{int r = compute();return r;});int sum = 0;auto f2 = async([&]() -> int{int s = compute();return s;});sum = f1.get() + f2.get();Going Parallel with C++11 Use std::mutex (aka "lock") to control access to critical section…49thread t1([&](){int r = compute();m.lock();sum = sum + r;m.unlock();});#include <mutex>std::mutex m;int sum = 0;thread t2([&](){int s = compute();m.lock();sum = sum + s;m.unlock();});Def: a critical section is thesmallest region of codeinvolved in a race condition.criticalsection Prime numbers…Going Parallel with C++11 50 “Resource Acquisition Is Initialization”◦ Advocated by B. Stroustrup for resource management◦ Uses constructor & destructor to properly manage resources (files, threads,locks, …) in presence of exceptions, etc.51thread t([&](){int r = compute();m.lock();sum += r;m.unlock();});thread t([&](){int r = compute();{lock_guard<mutex> lg(m);sum += r;}});should be written as…Locks m in constructorUnlocks m in destructorGoing Parallel with C++11 Can also use std::atomic to prevent data races…◦ Lighter-weight than locking, but much more limited in applicability52thread t1([&](){count++;});#include <atomic>std::atomic<int> count(0);thread t2([&](){count++;});thread t3([&](){count = count + 1;}); Xnot safe…Going Parallel with C++11Going Parallel with C++11 53vector<long> primes;for (long p = 2; p <= N; p++){if (isPrime(p))primes.push_back(p);}vector<long> primes;vector<thread> workers;mutex m;atomic<long> candidate = 2;for (int t = 1; t <= numthreads; t++){workers.push_back( thread([&]() -> void{while (true){int p = candidate++;if (p > N) break;if (isPrime(p)) {lock_guard<mutex> _(m);primes.push_back(p);}}}));}for (thread& t : workers)t.join();sort(primes.begin(), primes.end());Going Parallel with C++11 54 Tasks are a higher-level abstraction◦ Idea: developers identify work run-time system deals with load-balancing, execution details, etc.Going Parallel with C++11 55Task: a unit of work; an object denoting anongoing operation or computation. Matrix multiply using Microsoft PPL…56for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)C[i, j] = 0.0;Concurrency::parallel_for(0, N, [&](int i)for (int i = 0; i < N; i++){for (int k = 0; k < N; k++)for (int j = 0; j < N; j++)C[i, j] += (A[i, k] * B[k, j]);});57Windows ProcessThread Poolworkerthreadworkerthreadworkerthreadworkerthreadparallel_for( ... );task task task taskglobal work queueParallel Patterns LibraryResource ManagerTask SchedulerWindows PPL based on Microsoft’s ConcRT (Concurrent Run-Time) C++11 implemented on top of ConcRT58WindowsVC++ Parallel DeveloperMicrosoft ConcRT(Concurrency Runtime)C++11 STLParallel Patterns LibraryAsync Agents Librarystd::future == ConcRT taskstd::thread == Windows threadGoing Parallel with C++11Going Parallel with C++11 59 C++11 provides basic concurrency support◦ threads◦ futures◦ locking◦ condition variables◦ a foundation for platform-neutral parallel libraries C++11 provides lots of additional features◦ lambda expressions, type inference, range-based for, … Beware of data races◦ most common error in parallel programming◦ program behavior is undefined◦ whenever possible, redesign to eliminate…



About the author

160