Get-started with MPI ======================= .. include:: /global.rst Applications ----------------- .. _tutor-app-distributed-sum: Distributed Summation """""""""""""""""""""""""" To demonstrate the usage of MPI collective calls, we implement the parallel summation of: .. math:: S = \sum_{i=0}^{N-1} i. This is an over-simplied example, but its computation pattern (i.e., task decomposition and result reduction) is frequently used in parallel computation. The code can be downloaded at :download:`app-distributed-sum.cpp `. We start with defining the "tasks" - each process is responsible for the summation in a subrange of ``[0, N)``. We compute the subrange boundaries at the root process, and ``scatter`` them to all processes:: typedef std::array range_t; range_t range; vector ranges; if( rank == 0 ) { int N = 10000, N_each = N / n_procs; for(int i=0; i`. The process ranked ``rank`` is responsible for the intervals indexed ``rank, rank+n_procs, rank+2*n_procs, ...``. The local integration value is simply evaluated by:: int N = 10000; double h = 1.0 / N, sum = 0.; for(int i=rank; i`. The matrix-vector multiplication is defined as, for each row :math:`i`, .. math:: c_i = \sum_{j=0}^{n_{\rm cols}} A_{ij} b_j. The simplest task-decomposition scheme is to distribute rows of :math:`A` among processes and let each process compute on a subset of rows. In real applications, sub-tasks may consum various amount of times. To balance the works on processes, self-scheduling algorithm is necessary. In this example, we use a single master process (also called root process, ranked ``root=n_procs-1``) to control and assign tasks to other worker processes (ranked in ``[0, n_procs-1)``). The steps for this algorithm are - The master broadcasts all shared data to workers. - The master assigns each worker a initial task. - The worker waits for a job, finishes it, and replies the answer to the master. - The master waits for reply. Once he gets a reply, stores the answer and sends another task to the worker. - If all tasks finished, the master sends a stop signal to each worker. Here, we use HIPP's :class:`~HIPP::NUMERICAL::DArray` to describe these numerical arrays:: #include using HIPP::NUMERICAL::DArray; typedef DArray arr_t; // matrix typedef DArray vec_t; // vector To implement the algorithm, we begin with defining the algorithm of a worker:: // if rank != root vec_t a({n_cols}), b({n_cols}); comm.bcast(b, root); while(1) { int tag = comm.recv(root, ANY_TAG, a).tag(); if( tag == n_rows ) break; double res = (a*b).sum(); comm.send(root, tag, res); } Here, the worker first gets the shared data, i.e., :math:`b`, then waits the assignment repeatedly. Each assignment is described by a row :math:`a` of :math:`A`. The row number is described by the communication tag for clarity. The stop signal comes simply when the row number approaches the actual number of rows of :math:`A`. Once get the tasks specification, the worker finishes the job (i.e., by a vector-vector inner product ``(a*b).sum()``) and sends back the result. The master process has to first initialize the shared data, :math:`b`, broadcasts them to workers. The matrix :math:`A` is also initialized, but gets sent later. Here we simply set them to constant ``1.0``:: // if rank == root arr_t A({n_rows, n_cols}, 1.0); vec_t b({n_cols}, 1.0), c({n_rows}); comm.bcast(b, root); The master assigns initial tasks to workers by sending the first ``n_workers`` rows to them, respectively:: int row = 0; while(row < n_workers && row < n_rows) { comm.send(row, row, &A(row, 0), n_cols); // send initial rows to workers ++ row; } Then, the master waits for reply, stores the replied answer, and sends the next available row to worker. Finally, empty massages with ``tag = n_rows`` are sent as stop signal:: double res; for(int i=0; i