One-Side Communications

MPI-2 introduced and MPI-3 enhanced the remote memory access (RMA) subroutines. RMA does not fall into the message-passing communication models, but they have advantages in some cases:

  • RMA does not always require the remote process explicitly participating the communication, and therefore, it is easier to use and more expressive. Due to this feature, RMA is also called one-side communication.

  • RMA has better performance if supported by specific hardware. The traditional message-passing communication requires message matching, which possibly introduces unnecessary time-ordering.

Due to these advantages, MPI decides to support RMA operations. The key developing strategy is to make the RMA subroutines as similar as the message-passing subroutines if possible. MPI uses communicators, datatypes in the RMA operations, just like in the message-passing operations. MPI standard requires the RMA subroutines available on any hardware condition, not matter whether RMA is directly supported by the hardware, and no matter whether the coherent memory is used in the system. So, if you already have a MPI-3 compliant library, just try RMA.

Active-target RMA with Fence

RMA operations are managed by a “window” object typed HIPP::MPI::Win. A RMA window is created from an existing communicator. As usual, we may get the global “world” communicator from the MPI environmental object:

HIPP::MPI::Env env;
auto comm = env.world();
int rank = comm.rank(), n_procs = comm.size();

If not all of the processes would participate RMA, just create a sub communicator from the global one.

To create a RMA window, each process in the communicator needs to attach a memory buffer to it. The RMA operations are then allowed to be posted on these buffers. The method win_create of the communicator accepts the base address and size in bytes of the buffer and a displacement unit which is used in the RMA operations. It returns the window object created for RMA. In the following, we create a RMA window with each process attaching a buffer of two integers:

vector<int> buff(2);
void *base = buff.data();
int disp_unit = sizeof(int), buff_size = disp_unit*buff.size();
auto win = comm.win_create(base, buff_size, disp_unit);

Once we have a window object, methods get, put, accumulate, and other variants are allowed to call. These RMA operations get, put or accumulate data from/to remote buffer. In MPI, RMA operations are not finished automatically. So, a pair of fence calls are necessary to signal the begining and the end of RMA operations between them. In the following, we put the rank of each process to its two neighbors in a cyclic way:

int prev = (rank != 0) ? (rank-1) : (n_procs-1),
    next = (rank != n_procs-1) ? (rank+1) : 0;
win.fence();
win.put(next, rank, 0);
win.put(prev, rank, 1);
win.fence();

{
    HIPP::MPI::SeqBlock seq(comm);
    HIPP::pout << "rank=", rank, ", buff=(", buff, ")", endl;
}

Note that we print the content of the buffer in a sequential block guarded by a HIPP::MPI::SeqBlock instance. This sequentializes the printings from different processes. The use of fence is the easiest way to synthesize the RMA operations. We will explore more features of RMA in the following section.

The output of above codes is

rank=0, buff=(3,1)
rank=1, buff=(0,2)
rank=2, buff=(1,3)
rank=3, buff=(2,0)

We see that every process gets the ranks of its two neighbor processes. Full code sample can be found at mpi/rma-win-creation.cpp.

Passive-target RMA with Lock/Unlock

Passive-target RMA operations are real “one-side”, because they do not require the participation of remote process. Passive-target RMA uses the same data access subroutines as active-target RMA, like get, put and accumulate. The difference is the synchronization - passive-target RMA uses lock/unlock operation to synthesize the data access.

In the following example, we use passive-target RMA to transfer data multiple times from one process to another. The code can be found at mpi/rma-passive-target.cpp.

We start with initialization of MPI environment, get the rank of self. We will pass 5 pieces of data from procece 0 to procece 1 (n_RMAs=5):

HIPP::MPI::Env env;
auto comm = env.world();
int rank = comm.rank(), n_RMAs = 5;

Process 0 and process 1 have different things to do, we separate them into two parts. In process 0, we just create the window by win_create method of the communicator, where the memory of one integer is attached. Then we repeatedly assign a value to the local window, each protected with a pair of lock and unlock methods of the window object. The lock/unlock operations ensure the content in private and public memories are synchronized. After the first barrier, process 1 will get the data, and process 0 does not need to do anything. After the second barrier, the current loop of RMA ends and the next begins. The codes process 0 are:

int val, buff_size = sizeof(int), disp_unit = 1;
auto win = comm.win_create(&val, buff_size, disp_unit);

for(int i=0; i<n_RMAs; ++i){
    win.lock(win.LOCK_EXCLUSIVE, 0);
    val = i;
    win.unlock(0);
    comm.barrier();
    comm.barrier();
}

In process 1, we create the window but attach no data, because procece 1 just visits the memory of process 1 but does not share any of its own data. Then, it repeatedly get a integer from process 0 (i.e., from process ranked src_rank=0, memory address started at offset=0). Each get must be called after the first barrier to ensure the new value has been set by process 0. The get call should also be protected by a pair of lock and unlock calls which start and end the access epoch, respectively, to the remote window. The value got by process 1 is printed at each loop. The codes for process 1 are:

auto win = comm.win_create(NULL, 0, 1);
int val, src_rank = 0, offset = 0;
for(int i=0; i<n_RMAs; ++i){
    comm.barrier();
    win.lock(win.LOCK_EXCLUSIVE, src_rank);
    win.get(src_rank, val, offset);
    win.unlock(src_rank);
    comm.barrier();
    HIPP::pout << "Get ", val, endl;

The output of the code is:

Get 0
Get 1
Get 2
Get 3
Get 4