MPI¶

Message Passing Interface

Lectures by: Ana Gainaru

Parallel Programming¶

Architecture

Types of Parallel Computing Models¶

Implicit Parallelism: Parallelizing Compilers

Explicit Parallelism: Data Parallel | Task parallel | Message Passing (MPI)

Data Parallel - the same instructions are carried out simultaneously on multiple data items SIMD
Task Parallel - different instructions on different data MIMD
SPMD (single program, multiple data) not synchronized at individual operation level
- SPMD is equivalent to MIMD since each MIMD program can be made of SPMD

Message passing (MPI) is for MIMD/SPMD parallelism.¶

The Message Passing Model¶

Consist of separate processes, each with its own address space
- Programmer manages memory by placing data in a particular process
Data is sent explicitly between processes
- Programmer manages memory motion
Multi-process communication (Collective operations)
- Algorithm (performance) defined by the programming language
- Programmer selects subset of processes

The Message Passing Model¶

A process is
- a program counter
- address space
Processes may have multiple threads (program counters and associated stacks) sharing a single address space.

MPI is for communication among processes, with separate address spaces

Interprocess communication consists of
- Moving data from one process’s address space to another’s
- Synchronization

In These Lectures¶



1. Introduction to MPI 
    - Library and implementation characteristics
    - MPI for python
    - Error handling

2. Blocking point-to-point communication
    - Sending and receiving with MPI_Send and MPI_Recv
    - Dynamic receiving with MPI_Probe and MPI_Status
    - Groups and communicators

3. Non-blocking communication

4. Collective communication
    - Synchronization (Barrier)
    - Data movement (Broadcast, Gather, Scatter, Allgather, Alltoall)
    - Collective computation (Reduce, AllReduce, Scan, Exscan, Reduce_scatter)
    - Define your own collective routine
    - Non-Blocking Collective Operations

5. Topology mapping and neighborhood collectives

6. Profiling MPI applications

Programming With MPI¶

MPICH / OpenMPI

Programming With MPI¶

MPI is a library
- All operations are performed with routine calls
- Basic definitions in
  - mpi.h for C
  - MPI or MPI_F08 module for Fortran
- Implementation
  - MPICH: high-quality reference implementation of the latest MPI standard
  - OpenMPI: common case, both in terms of usage and network conduits

MPI for Python
- mpi4py module using pickle under the hood
  - implements binary protocols for serializing and de-serializing a Python object structure

Launching MPI jobs

mpiexec –np ${RANKS} python ${PATH_TO_SCRIPT}

In [ ]:

from mpi4py import MPI

print("Hello world")

mpiexec -n 4 python script.py

Hello world
Hello world
Hello world
Hello world

MPI Environment¶

Two important issues
- Initializing all MPI’s global and internal variables
  - e.g. Global communicator that allows all processors to communicate
- Giving processes enough information to communicate with each other
  - Number of total processes participating in this computation
  - Which process am I from all processes

MPI provides functions for all issues:
- MPI_Initialize/MPI_Finalize
- MPI_Comm_size reports the number of processes
- MPI_Comm_rank reports the rank, a number between 0 and size-1, identifying the calling process

#include “mpi.h”

int main(int argc, char *argv[])
{
 int rank, size, i, provided;
 float A(10);

 MPI_Init_thread(&argc, &argv, MPI_THREAD_SINGLE, &provided);

 MPI_Comm_size(MPI_COMM_WORLD, &size);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 for (i=0; i<10; i++)
     A[i] = i * rank;

 printf("My rank %d of %d\n", rank, size );

 printf("Here are my values for A\n");
 for (i=0; i<10; i++) printf("%f ", A[i]);
 printf("\n");

 MPI_Finalize();
}

In [ ]:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size() 

A = [i * rank for i in range(5)]

print("My rank %d of %d" %(rank, size));
print("My values for A: %s" %(A));

My rank 0 of 4
My rank 1 of 4
My rank 2 of 4
My values for A: [0, 0, 0, 0, 0]
My values for A: [0, 2, 4, 6, 8]
My values for A: [0, 1, 2, 3, 4]
My rank 3 of 4
My values for A: [0, 3, 6, 9, 12]

Main MPI characteristics¶

All MPI programs begin with MPI_Init_thread and end with MPI_Finalize

MPI_COMM_WORLD is defined by mpi.h (in C) or the MPI module
- includes all processes in the MPI “job”
Each statement executes independently in each process
- including the print/printf statements

I/O to standard output is not part of MPI
- output order undefined (may be interleaved by character, line, or blocks of characters)

MPI Initialization¶

Programmer can request the level of thread safety required for the program
- E.g. Needed to use OpenMP with MPI
MPI_THREAD_SINGLE gives the same behavior as MPI_Init
- one thread will execute
MPI_THREAD_FUNNELED The process may be multi-threaded
- only the main thread will make MPI calls
- all MPI calls are funneled to the main thread
MPI_THREAD_SERIALIZED The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time
- MPI calls are not made concurrently from two distinct threads
- all MPI calls are serialized
MPI_THREAD_MULTIPLE Multiple threads may call MPI, with no restrictions.

MPI_Init_thread is used since MPI-2

MPI for Python¶

At import time (import mpi4py)
- the module initializes the MPI execution environment calling MPI_Init_thread()
- installs an exit hook to automatically call MPI_Finalize() just before the Python process terminates
You can set runtime configuration options (e.g. automatic initialization, thread_level) via mpi4py.rc()
By default importing the MPI submodule calls MPI_Init()
- calling Init() or Init_thread()more than once violates the MPI standard
- This will lead to a Python exception or an abort in C/C++
- use Is_initialized() to test for initialization
  - similarly for Finilize: Is_finalized())

Error handling¶

By default, an error causes all processes to abort.

The user can cause routines to return (with an error code) instead.
A user can also write and install custom error handlers.

Error handling¶

MPI has error codes and classes
- MPI routines return error codes
- Each code belongs to an error class
- The returned error code is implementation specific
- The only error code that MPI standard itself defines is MPI_SUCCESS, i.e., no error
- The meaning of an error code can be extracted by calling function MPI_Error_string.
An MPI implementation can use error codes to return instance-specific information on the error
- MPICH does this, providing more detailed and specific messages
- There are routines to convert an error code to text and to find the class for a code.

Error handling¶

MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

error_code = MPI_Send(send_buffer, strlen(send_buffer) + 1,
                      MPI_CHAR, addressee, tag, MPI_COMM_WORLD);

if (error_code != MPI_SUCCESS){
    char error_string[BUFSIZ];
    int length_of_error_string;

    MPI_Error_string(error_code, error_string, &length_of_error_string);

    fprintf(stderr, "%3d: %s\n", my_rank, error_string);
    send_error = TRUE;
}

MPI for python¶

mpi4py overrides the default MPI.ERRORS_ARE_FATAL error handler in favor of MPI.ERRORS_RETURN
- which allows translating MPI errors in Python exceptions.
MPI errors as Python exceptions, can be dealt with the common try…except…finally clauses
- Unhandled MPI exceptions will print a traceback which helps in locating problems in source code.

Automatic MPI finalization combined with unhandled exceptions can lead to deadlocks.

MPI for python¶

Automatic MPI finalization + unhandled exceptions can lead to deadlocks.¶

Example

In [ ]:

from mpi4py import MPI

assert MPI.COMM_WORLD.Get_size() > 1
rank = MPI.COMM_WORLD.Get_rank()

if rank == 0:
    1/0
    MPI.COMM_WORLD.send(None, dest=1, tag=42)
elif rank == 1:
    MPI.COMM_WORLD.recv(source=0, tag=42)

MPI for python error handling¶

Handle exceptions in your code manually

Run Python code passing -m mpi4py in the command line
- In case of unhandled exceptions, the finalizer hook will call MPI_Abort() on the MPI_COMM_WORLD communicator
  - This will abort the MPI execution environment.

mpiexec -n ${NUM_PROCS} python -m mpi4py ${PYTHON_FILE} [${ARG}]

MPI Basic Send/Receive¶

Send Receive

Identify the process you want to communicate with
Describe the data
Know when the send/receive operations are complete
Recognize messages sent to you by other processes

1. Identify the process to communicate with¶

Processes can be collected into groups.
Each message is sent in a context, and must be received in the same context.

A group and context together form a communicator (*)

(*) the context (or ID) that differentiates one communicator from another and the group of processes contained by the communicator

A process is identified by its rank in the group associated with a communicator.
- MPI_COMM_WORLD is enough for most applications

MPI_Comm_split(
    MPI_Comm comm,
    int color,             // for split
    int key,               // for ordering
    MPI_Comm* newcomm)

In [ ]:

# Get the rank and size in the original communicator
from mpi4py import MPI

world_rank = MPI.COMM_WORLD.Get_rank()
world_size = MPI.COMM_WORLD.Get_size()

# Determine color based on even/odd ranks
color = world_rank % 2; 

# Split the communicator based on the color and use the
# original rank for ordering
new_comm = MPI.COMM_WORLD.Split(color, world_rank)

new_rank = new_comm.Get_rank()
new_size = new_comm.Get_size()

print("World rank/size: %d/%d -- New rank/size: %d/%d" %(
            world_rank, world_size, new_rank, new_size));

new_comm.Free()

World rank/size: 0/4 -- New rank/size: 0/2
World rank/size: 1/4 -- New rank/size: 0/2
World rank/size: 2/4 -- New rank/size: 1/2
World rank/size: 3/4 -- New rank/size: 1/2

2. Describe the data¶

The message buffer is described by `(start, count, datatype)`¶

Datatype can be:
- Predefined corresponding to a data type from the language (MPI_CHAR, MPI_INT, MPI_DOUBLE, etc)
- a contiguous array of MPI datatypes
- a strided block of datatypes
- an indexed array of blocks of datatypes
- an arbitrary structure of datatypes
There are MPI functions to construct custom datatypes
- e.g. an array of (int, float) pairs, or a row of a matrix stored columnwise

mpi4py uses protocols for packing and unpacking a Python object structure

Why datatypes?¶

Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).

Specifying application-oriented layout of data in memory
- can reduce memory-to-memory copies in the implementation
- allows the use of special hardware (scatter/gather) when available
Specifying application-oriented layout of data on a file
- can reduce system calls and physical disk I/O

How to create a new datatype¶

Create an MPI datatype from a general set of datatypes, displacements, and block sizes

int MPI_Type_create_struct(
  int count,
  int array_of_blocklengths[],
  MPI_Aint array_of_displacements[],
  MPI_Datatype array_of_types[],
  MPI_Datatype *newtype
);

In [ ]:

def definetype():
    dtypes = [MPI.INT, MPI.DOUBLE]
    dt_size = MPI.INT.size + MPI.DOUBLE.size
    displ = [0, MPI.INT.size]

    new_dtype = MPI.Datatype.Create_struct([1, 1], displ, dtypes)
    new_dtype = new_dtype.Create_resized(0, dt_size)
    new_dtype.Commit()
    return new_dtype

3. Recognize messages sent to you by other processes¶

Messages are sent with an accompanying user-defined integer tag
- Used to assist the receiving process in identifying the message.
- Messages can be screened at the receiving end by specifying a specific tag, or
- not screened by specifying MPI_ANY_TAG as the tag in a receive.
Some non-MPI message-passing systems have called tags “message types”.
- MPI calls them tags to avoid confusion with datatypes.

In [ ]:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    sdata = {'a': 7, 'b': 3.14}
    comm.send(sdata, dest=1, tag=11)
elif rank == 1:
    rdata = comm.recv(source=0, tag=11)
    print(rdata)

{'a': 7, 'b': 3.14}

4. Know when the send/receive operations are complete¶

Blocking / Non-blocking

The message-passing approach makes the exchange of data cooperative.
Data is explicitly sent by one process and received by another.
An advantage is that any change in the receiving process’s memory is made with the receiver’s explicit participation.
Communication and synchronization are combined.

One sided communication

One-sided operations between processes include remote memory reads and writes
Only one process needs to explicitly participate.
An advantage is that communication and synchronization are decoupled

MPI Basic (Blocking) Send¶

MPI_SEND (start, count, datatype, dest, tag, comm)

The message buffer is described by (start, count, datatype).
The target process is specified by dest, which is the rank of the target process in the communicator specified by comm.
When this function returns, the data has been delivered to the system and the buffer can be reused
- The message may not have been received by the target process.

MPI Basic (Blocking) Receive¶

MPI_RECV(start, count, datatype, source, tag, comm, status)

Waits until a matching (on source and tag) message is received from the system, and the buffer can be used.
source is rank in communicator specified by comm, or MPI_ANY_SOURCE.
status contains further information
Receiving fewer than count occurrences of datatype is OK, but receiving more is an error.

Blocking Send/Receive Summary¶

Send Receive

Blocking Send/Receive Summary¶

Send Receive

...

if(rank == 0){
    // generate data
    MPI_Send(&sdata, 1, MPI_INT, 1, 100,
             MPI_COMM_WORLD);
} 
if(rank == 1){
    MPI_Recv(&rdata, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, 
             MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    //do somthing with data
}

Send array of (int32, float64) tuples¶

In [ ]:

def definetype():
    dtypes = [MPI.INT, MPI.DOUBLE]
    dt_size = MPI.INT.size + MPI.DOUBLE.size
    displ = [0, MPI.INT.size]

    new_dtype = MPI.Datatype.Create_struct([1, 1], displ, dtypes)
    new_dtype = new_dtype.Create_resized(0, dt_size)
    new_dtype.Commit()
    return new_dtype

def fill_data(data, num):
    for i in range(num):
        data[i][0] = i + 1
        data[i][1] = 3.14 * (i + 1)

mytype = definetype()
data = np.zeros(2, dtype="int32, double")
if rank == 0:
    comm.Recv([data, mytype], source=1, tag=0)
    print(data)
elif rank == 1:
    fill_data(data, 2)
    comm.Send([data, mytype], dest=0, tag=0)

[(1, 3.14) (2, 6.28)]

Non-blocking communication¶

Blocking vs nonblocking

Why Non-blocking?¶

Overlapping communication and computation
- It can increase performance of some applications (not always)
Avoiding deadlocks
- Related to message ordering
- Due to completion depending on size of message and amount of system buffering

Send Receive

Non blocking communication
- Send/Recv routines return before transfer is complete
- and we wait later.

Deadlock example¶

Send a large message from process 0 to process 1
- If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive)
What happens with this code?

Deadlock

This is called unsafe because it depends on the availability of system buffers

Deadlock solution¶

Order the operations more carefully

Supply receive buffer at same time as send (MPI_Sendrecv)
Supply own buffer space (MPI_Bsend)

Use non-blocking operations
- Safe, but
- does not guarantee to be asynchronous or concurrent
- not always faster

MPI Non-blocking operations¶

Return immediately
- give request handles that can be tested and waited on for completion.



MPI_Request request;

MPI_Isend(start, count, datatype, dest, tag, comm, &request);

MPI_Irecv(start, count, datatype, dest, tag, comm, &request);

MPI_Wait(&request, &status);

One can also test without waiting:


MPI_Test(&request, &flag, &status);

MPI for python ping pong example¶

Using numpy arrays

In [ ]:

from mpi4py import MPI
import numpy

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
assert comm.size == 2

sdata = numpy.array([(rank+1) * i for i in range(10)])
rdata = numpy.array([0] * 10)

if comm.rank == 0:
    target = 1
else:
    target = 0

request = comm.Isend([sdata, MPI.INT], dest=target, tag=10)
comm.Recv([rdata, MPI.INT], source=target, tag=MPI.ANY_TAG)
request.Wait()

print("rank: %d received data %s" %(rank, rdata))

rank 1: received data [0 1 2 3 4 5 6 7 8 9]
rank 0: received data [0 2 4 6 8 10 12 14 16 18]

Multiple Completions¶

Waiting on multiple requests:


MPI_Waitall(count, array_of_requests, array_of_statuses);

MPI_Waitany(count, array_of_requests, &index, &status);

MPI_Waitsome(incount, array_of_requests,  &outcount, array_of_indices,
             array_of_statuses);

There are corresponding versions of test for each of these.

Modes for sending messages:¶

Synchronous mode (MPI_Ssend)
- The send does not complete until a matching receive has begun.
- Unsafe programs deadlock.
Buffered mode (MPI_Bsend)
- The user supplies a buffer to the system for its use.
- User can allocate enough memory to make an unsafe program safe.
Ready mode (MPI_Rsend)
- User guarantees that a matching receive has been posted.
- Allows access to fast protocols
- Undefined behavior if matching receive not posted
Non-blocking versions for all operations (MPI_Issend, etc.)

MPI_Recv receives messages sent in any mode.

MPI_Sendrecv¶

Simultaneous send and receive
- Send and receive datatypes (even type signatures) may be different.
- Can use Sendrecv with plain Send or Recv (or Irecv or Ssend_init, ...)
Intended for sending data in a "chain"
- For instance, 0 sends to 1 while 1 sends to 2 while 2 sends to 0.

int MPI_Sendrecv(
  void *sendbuf,
  int sendcount,
  MPI_Datatype sendtype,
  int dest,
  int sendtag,
  void *recvbuf,
  int recvcount,
  MPI_Datatype recvtype,
  int source,
  int recvtag,
  MPI_Comm comm,
  MPI_Status *status
);

Both send and receive must be successful for the operation to succeed.¶

MPI_sendrecv example¶

In [ ]:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

to = (rank + 1) % size
fr = (rank - 1) % size

sdata = rank
rdata = comm.sendrecv(sdata, dest=to, source=fr)

print("rank %d got data from %d" %(rank, rdata))

rank 0 got data from 3
rank 1 got data from 0
rank 2 got data from 1
rank 3 got data from 2

Collective communication¶

All communication in MPI is within a group of processes
- Collective communication is over all of the processes in that group

Collective patterns

Most algorithms are log(P)
They classify in 3 major communication patterns
- Scatter, Gather, Reduce
- Barrier, AllReduce, Allgather, Alltoall
- Scan, Exscan

Collective communication¶

From the point of view of functionalities, there are three classes of operations
- Synchronization: Barrier
- Data movement: Broadcast, Gather, Scatter, Allgather, Alltoall
- Collective computation: Reduce, AllReduce, Scan, Exscan, Reduce_scatter

Syncronization¶

MPI_Barrier( comm )

Blocks until all processes in the group of the communicator comm call it.
Programmers prefer not to use it in a parallel program
- Occasionally useful in measuring performance and load balancing
- It could increase performance by reducing network contention

In [ ]:

from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
print("rank %d: before barrier" %(rank))

MPI.COMM_WORLD.barrier()

print("rank %d: after barrier" %(rank))

rank 0: before barrier
rank 1: before barrier
rank 2: before barrier
rank 3: before barrier
rank 3: after barrier
rank 1: after barrier
rank 2: after barrier
rank 0: after barrier

Collective Data Movement¶

One to all
- Broadcast
- Scatter (personalized)
All to one
- Gather

All to all
- Allgather
- Alltoall (personalized)

All collective operations must be called by all processes in the communicator

Each process can get different data with the {function}v variant of each method¶

Collective communication semantics¶

Collective routines on the same communicator must be called in the same order on all participating processes
For multi-threaded processes (MPI_THREAD_MULTIPLE)
- it is the users responsibility to ensure order
Message tags are not used

Use different communicators to separate collective operations on the same process

Collective Data Movement¶

Collectives

MPI.COMM_WORLD.bcast(data, root=proc_id)
MPI.COMM_WORLD.gather(data, root=proc_id)
MPI.COMM_WORLD.scatter(data, root=proc_id)

# for numpy arrays
MPI.COMM_WORLD.Scatter(sendbuf, recvbuf, root=0)

Collective Data Movement¶

Collectives

# sendobj is iterable
MPI.COMM_WORLD.allgather(sendobj)
MPI.COMM_WORLD.alltoall(sendobj)

# for numpy arrays
MPI.COMM_WORLD.Allgather([x,  MPI.DOUBLE], [xg, MPI.DOUBLE])

Collective Data Movement - Notes¶

MPI_Allgather is equivalent to
- MPI_Gather followed by MPI_Bcast
- Algorithms for MPI_Allgather are usually faster (try on ACCRE)
MPI_Alltoall
- Performs a “transpose” of the data
- Tricky to implement efficiently
  - Implementation have a hard time scaling to the size of today's supercomputers
  - For large messages point to point communication is preferable

Variants¶

The basic routines send the same amount of data from each process

MPI_Scatter(sbuf, 1, MPI_INT, rbuf, 100, MPI_INT, root, comm);

(*) sends one integer to each process

Sending a different number of items to each process
- Example: MPI_Scatterv
- Routines with “v” (for vector) at the end
  - allow to specify a different number of elements for each destination/source

Efficient algorithms exist for these cases (not alltoall), not as fast as the simpler, basic routines

Variant alltoallw:

Allows different datatypes, counts, and displacements for each partner

Collective Computation¶

Combines communication with computation
- Reduce
  - All to one, with an operation to combine
- Scan, Exscan
  - All prior ranks to all, with combination
- Reduce_scatter
  - All to all, with combination

Combination operations¶

Can be either
- Predefined operations
- User defined operations

Reduction¶

Reduce

	MPI reduction operations include:
	MPI.MAX	MPI.MIN
	MPI.SUM
	MPI.PROD
	MPI.LAND	MPI.RAND
	MPI.LOR	MPI.ROR
	MPI.MAXLOC	MPI.MINLOC
	MPI_BXOR

MPI.COMM_WORLD.reduce(sendobj, op=MPI.SUM, root=0)
MPI.COMM_WORLD.allreduce(sendobj, op=MPI.SUM)

reduce is similar to gather but result is aggregated using different ops
allreduce is similar to allgather

Collective computation¶

Collectives

Allreduce, Exscan, Reduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions.

All collectives¶

All versions deliver results to all participating processes
Most routines accept both intra- and intercommunicators
- Intercommunicator versions are collective between two groups of processes

Define your own collective routine¶

MPI_Op_create(user_fcn, commutes, &op);
MPI_Op_free(&op);

user_fcn(invec, inoutvec, len, datatype);

Depending on the value of commute
- True: the operation should be both commutative and associative
- False: the order of operands is fixed
  - defined to be in ascending, process rank order, beginning with process zero

The user function should perform:

for(i=0; i<len; i++)
  inoutvec[i] = invec[i] op inoutvec[i];

The operation op is always assumed to be associative. The user function can be non-commutative. Example?

Define your own collective example¶

Product of complex number

user_fcn(invec, inoutvec, len, datatype);

python:

user_fcn(inelem1, inelem2, outelem)

In [ ]:

from Complex import Complex
# Complex class defined in a separate file

def myprod(a, b, c):
    c = Complex()
    c.real = a.real * b.real - a.img * b.img
    c.img = a.img * b.real + a.real * b.img
    return c

Define your own collective example¶

Product of complex number

In [ ]:

comm = size = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

a = Complex(rank + 1, rank)

print("Rank %d: Data %s" %(rank, a))

myop = MPI.Op.Create(myprod, True) # commute is True
b = comm.Reduce(a, op=myop, root=0)

if rank==0:
    print("Produc: %s" %(b))

Rank 0: Data (1.0  0.0)
Rank 1: Data (2.0  1.0)
Rank 3: Data (4.0  3.0)
Rank 2: Data (3.0  2.0)
Product: (-5.0 40.0)

For C version https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node80.html

Define your own collective example¶

Numpy arrays are trickier
- Programmer needs to manually pack/unpack data

In [ ]:

def unpack_array(array_mem, dt):
    a = array_mem.tobytes()
    itemsize = dt.Get_size()
    array_len = int(array_mem.shape[0])
    unpack_array = []
    for i in range(int(array_len/itemsize)):
        unpack_array.append(int(np.frombuffer(a[itemsize*i:itemsize*(i+1)],
                            dtype=np.int64)))
    return np.array(unpack_array)

def myprod(xmem, ymem, dt):
    a = unpack_array(xmem, dt)
    b = unpack_array(ymem, dt)

    c = np.array([0] * len(a))
    for i in range(len(a)):
        c[i] = a[i] % 2 | b[i] % 2

    ymem[:] = memoryview(bytearray(c))[:]

Define your own collective example¶

Numpy arrays are trickier
- Programmer needs to manually pack/unpack data

In [ ]:

a = np.array([rank * (i + 1) for i in range(3)])
b = np.array([0] * 3)
print("Rank %d: Data %s" %(rank, a))

myop = MPI.Op.Create(myprod, True) # commute is True
comm.Reduce([a, MPI.DOUBLE], [b, MPI.DOUBLE], op=myop, root=0)

if rank==0:
    print("Contains odd numbers: %s" %(b))

Rank 0: Data [0 0 0]
Rank 2: Data [2 4 6]
Rank 1: Data [1 2 3]
Rank 3: Data [3 6 9]
Contains odd numbers: [1 0 1]

Another example in the unit test code for mpi4py: https://github.com/mpi4py/mpi4py/blob/master/test/test_op.py

Non-Blocking Collective Operations¶

MPI-3 introduced nonblocking versions of collective operations
- All return an MPI_Request, use the usual MPI_Wait, MPI_Test, etc. to complete.
- May be mixed with point-to-point and other MPI_Requests
- Not necessary faster
  - Follow same ordering rules as blocking operations
MPI_Ibarrier
- Useful for distributed termination detection

Topologies and Neighborhood¶

Topology mapping

(*) Image from: Advanced MPI 2.2 and 3.0 Tutorial by Torsten Hoefler

Topologies¶

A virtual topology represents the way that MPI processes communicate
- Nearest neighbor exchange in a mesh
- Recursive doubling in an all-to-all exchange
A physical topology represents that connections between the cores, chips, and nodes in the hardware

Mapping of the virtual topology onto the physical topology.
Think of nodes of chips of cores - not trivial

Topology mapping basics¶

Allocation mapping
Rank reordering

Topology Mapping Basics¶

Allocation mapping¶

Up-front specification of communication pattern
Batch system picks good set of nodes for given topology

Properties:
- Not supported by current batch systems
- Either predefined allocation, random allocation, or “global bandwidth maximation”
- Not always possible to specify communication pattern upfront

Topology Mapping Basics¶

Rank reordering¶

Change numbering in a given allocation to reduce congestion
Sometimes automatic

Properties:
- Always possible, but effect may be limited (e.g., in a bad allocation)
- Allows manual shuffling through MPI process topologies
  - Network topology is not exposed
- Data has to be manual shuffled after remapping

MPI’s Topology Routines¶

MPI provides routines to create new communicators that order the process ranks
Two types of virtual topology supported:
- Cartesian (regular mesh)
- Graph (several ways to define in MPI)
  - Routine evolved in each version of MPI
Additional routines provide access to the defined virtual topology
(Virtual) topologies are properties of a communicator
- Topology routines all create a new communicator with properties of the specified virtual topology

MPI_Cart_create¶

Creates a new communicator newcomm from oldcomm
- representing an ndim dimensional mesh with sizes dims

MPI_Cart_create(MPI_Comm oldcomm, int ndim, int dims[],
                int qperiodic[], int qreorder, MPI_Comm *newcomm)

The mesh is periodic in coordinate direction i if qperiodic[i] is true (torus).
The ranks in the new communicator are reordered
- to better match the physical topology
- if qreorder is true

Some processes may return MPI_COMM_NULL.
Product sum of dims must be <= P

MPI_Dims_create¶

Fill in the dims array such that the product of dims[i] for i=0 to ndim-1 equals nnodes.

MPI_Dims_create(int nnodes, int ndim, int dims[])

Any value of dims[i] that is 0 on input will be replaced
- values that are > 0 will not be changed
Makes life easier
- Some problems might work better with non-square layouts

Example¶

3D torus

int p;

MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Dims_create(p, 3, dims);

int periods[3] = {1,1,1};

MPI_Comm cart_comm;
MPI_Cart_create(comm, 3, dims, periods, 0, &cart_comm);

Reorder argument allows for topology mapping
- Each calling process may have a new rank in the created communicator
- Data has to be remapped manually

Example python¶

In [ ]:

from mpi4py import MPI

comm = MPI.COMM_WORLD
world_rank = comm.Get_rank()
size = comm.Get_size()

dims = MPI.Compute_dims(size, [0]*2)
cart_comm = comm.Create_cart(
        dims, periods=[True,True], reorder=True)

new_rank = cart_comm.Get_rank()

print("World rank: %d -- Cart rank: %d" %(
    world_rank, new_rank));

cart_comm.Free()

World rank: 2 -- Cart rank: 2
World rank: 1 -- Cart rank: 1
World rank: 3 -- Cart rank: 3
World rank: 0 -- Cart rank: 0

Cartesian Query Functions¶

Determine neighbor ranks
- Can be computed from rank (in the cart_comm), dims, and periods
  - Ordering defined in MPI
  - See Section 7.5 in MPI-3 Standard

MPI_Cartdim_get()
- Gets dimensions of a Cartesian communicator
MPI_Cart_get()
- Gets size of dimensions
MPI_Cart_rank()
- Translate coordinates to rank
MPI_Cart_coords()
- Translate rank to coordinates

MPI_Cart_shift¶

Returns the ranks of the processes that are a shifted by disp steps in coordinate direction

MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
               int *rank_source, int *rank_dest)

Useful for nearest neighbor communication in the coordinate directions
- Dimensions are numbered from 0 to ndims-1
- Displacement indicates neighbor distance (-1, 1, 2)
- May return MPI_PROC_NULL
  - The source or destination for the shift is out of range

Example (8 processes)¶

In [ ]:

ndim = 2

dims = MPI.Compute_dims(size, [0]*ndim)
cart_comm = comm.Create_cart(dims, periods=[True,True], reorder=True)

new_rank = cart_comm.Get_rank()
if new_rank==0:
    print("Cartesian dim: %s" %(dims))

for i in range(ndim):
    for d in (-1, +1):
        source, dest = cart_comm.Shift(i, d)
        if new_rank == 0:
            print("Dir %d, disp %d - Src %d - Dest %d" %(i, d, source, dest));

MPI_shift

Cartesian dim: [4, 2]
Dir 0, disp -1 - Src 2 - Dest 6
Dir 0, disp 1 - Src 6 - Dest 2
Dir 1, disp -1 - Src 1 - Dest 1
Dir 1, disp 1 - Src 1 - Dest 1

Advanced topology concepts¶

Graph Topology¶

MPI provides routines to specify a general graph virtual topology
- Graph vertices represent MPI processes (usually one per process)
- Graph edges indicate important connections

Collectives on topologies¶

Since MPI-3
- Collective communications only cover some patterns
  - E.g., no stencil pattern
- Example: MPI_Neighbor_allgather

Reading¶

Generic Topology Mapping Strategies for Large-scale Parallel Architectures
Hoefler and Snir
http://dx.doi.org/10.1145/1995896.1995909

Profiling¶

Profiling interface: PMPI routines

Determinism in MPI¶

In exact arithmetic, you always get the same results
- but roundoff error, truncation can happen
Message-passing programming models are by default nondeterministic
- The arrival order of messages sent from two processes, A and B, to a third process, C, is not defined.
- MPI does guarantee that two messages sent from one process, A, to another process, B, will arrive in the order sent.
- It is the programmer's responsibility to ensure that a computation is deterministic when this is required.
How can we "ensure" determinism?

Use source and tag (not MPI_ANY_SOURCE or MPI_ANY_TAG)
Use synchronization

How Deterministic are Collective Computations?¶

Depending on the process to hardware mapping
- Messages can arrive in different order to processes
- Round-off error may cause slight differences
Reduction results might be different for non-associative operations
- Even if they are commutative
Allreduce does guarantee that the same value is received by all processes for each call
Why didn’t MPI mandate determinism?
- Not all applications need it
- Implementations of collective algorithms can use “deferred synchronization” ideas to provide better performance

Timing MPI Programs¶

The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:

double t1, t2;
 t1 = MPI_Wtime();
 ...
 t2 = MPI_Wtime();
 printf( "time is %d\n", t2 - t1 );

The value returned by a single call to MPI_Wtime has little value.
The resolution of the timer is returned by MPI_Wtick
Times in general are local, but an implementation might offer synchronized times.
- For advanced users: see the MPI attribute MPI_WTIME_IS_GLOBAL.

Example Bcast time¶

In [ ]:

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = np.array([rank * i for i in range(10)])

local_wt = MPI.Wtime();
comm.Bcast(data, root=0);
local_wt = MPI.Wtime() - local_wt;

print("Bcast time %f" %(wt))

Why is this not accurate?
What is the output?

Bcast time 0.003616
Bcast tiBcast time 0.005131
me 0.0020Bcast time 0.004574
81

Example Bcast time¶

In [ ]:

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = np.array([rank * i for i in range(10)])

comm.barrier();

local_wt = MPI.Wtime();
comm.Bcast(data, root=0);
local_wt = MPI.Wtime() - local_wt;

wt = comm.reduce(local_wt, op=MPI.MAX, root=0);

if rank == 0:
    print("Bcast time %f" %(wt))

Bcast time 0.004041

How accurate is this measurement?

PMPI routines¶

PMPI allows selective interception of MPI routines at link time
- No need to recompile (for C and Fortran)
- Interposition library contains wrapper functions

PMPI

Program is linked to interposition library
- All functions are available under two names: MPI_xxx and PMPI_xxx
  - MPI_xxx symbols are weak, can be over-written
- Measurement code in the interposition library measures

Not all MPI functions need to be instrumented

PMPI example (C/C++)¶

Count number of calls to the Send routine for each rank

In [ ]:

#include <stdio.h>
#include "mpi.h"

static int numsend = 0;

int MPI_Send(void *buf, int count, MPI_Datatype type,
             int dest, int tag, MPI_Comm comm) {
    numsend++;
    return PMPI_Send(buf, count, type, dest, tag, comm);
}

int MPI_Finalize() {
    int me;
    PMPI_Comm_rank(MPI_COMM_WORLD, &me);
    printf("%d sent %d messages.\n", me, numsend);
    return PMPI_Finalize();
}

Writing PMPI wrappers¶

Use rank in MPI_COMM_WORLD as process identifier
- Data volume: #elements * PMPI_Type_size(type)
MPI does not provide a message ID
- Custom functions to record complete traffic
  - if you need to match send and recv events
Wildcard message source and tag
- You can record real values

Profiling tools¶

Tracing¶

Recoding of individual time-stamped program events as opposed to aggregated information
- Entering and leaving a function
- Sending and receiving a message

Vampir

Intel Trace Analyzer: https://researchcomputing.princeton.edu/faq/using-intel-trace-analyze

VAMPIR: https://vampir.eu/

Profiling¶

ParaProf: http://www.cs.uoregon.edu/research/tau/vihps14/ParaProf.pdf

cprofile module in python

Code¶

All the code from this slides can be downloaded from https://github.com/anagainaru/MPI_Lectures ¶

Scalability examples
- Computing pi,
- Merge sort
- Vector matrix multiplication
- You are encouraged to run the examples on ACCRE

For questions: ana.gainaru@vanderbilt.edu

Useful links¶

For questions: ana.gainaru@vanderbilt.edu

Examples¶

Example 1: Calculating Pi¶

calculating pi

Monte Carlo simulation¶

calculating pi

Image downloaded from: https://commons.wikimedia.org/wiki/File:Pi_30K.gif

Example 1: Calculating Pi¶

Calculating pi via numerical integration

Divide interval up into subintervals
Assign subintervals to processes
Each process calculates partial sum
Add all the partial sums

Code¶

For dividing the interval in n slices

In [ ]:

w = 1.0 / n
mypi = 0.0

for i in range(rank + 1, n + 1, size):
    mypi += w * np.sqrt(1 - np.power(i / n, 2))

pi = comm.reduce(mypi, op=MPI.SUM, root=0)

if rank==0:
    print("PI: %f" %(4 * pi))

ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 10
PI: 2.904518
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 100
PI: 3.120417
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 1000
PI: 3.139555
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 10000
PI: 3.141391

Performance¶

Strong scaling
- Keep the problem size the same
- Increase the number of processes

Example: 6.4 mil slices (2 to 64 processes)

Weak scaling
- Increase the problem size proportional with the number of processes

Example: keep 100k slices per processes

Performance¶

pi performance

Running on Knight's Landing - Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz
- 64 cores, 4 threads per core
- 6.4 million slices

Example 2: Merge Sort¶

merge sort

Divide an unsorted listed into sublists¶

Until each sublist contains only one element
One element sublists are merged together
Stop when only one sublist remains

Parallel Merge Sort¶

Hybrid strategy (not necessary the best)
- The sublists are sorted by a sequential algorithm
- The merging of sublists is done in parallel between processes

The number of processes is a power of two so that all processes are doing roughly the same amount of work

TODO at home: change the code to work with non-power of two processes¶

Part 1: Divide list into unsorted sublists¶

merge sort

The list is scattered to all of the processes
- Each process has an equal chunk of the list

Figure shows 4 processes and a list containing 8 integers

Part 2: Sort local sublists¶

merge sort

Sequential sort

merge sort

Part 3: Merge sublists¶

merge sort

Send data
Merge locally

Part 1: Divide list into unsorted sublists¶

In [ ]:

    global_array_size = get_input_data_size()
    # calculate total height of tree
    height = np.log2(size);

    if rank==0:
        # create input array
        global_data = get_input_data()
        print("Global data %s" %(global_data))
    else:
        global_data = None

    local_array_size = global_array_size / size
    local_data = np.array([0] * local_array_size)
    comm.Scatter([global_data, MPI.INT], [local_data, MPI.INT], root=0)

    print("Rank %d: data %s" %(rank, local_data))

merge sort

Global data [1 3 15 12 14 5 7 0 8 9 4 6 13 11 2 10]
Rank 0: data [ 1  3 15 12]
Rank 1: data [14  5  7  0]
Rank 3: data [13 11  2 10]
Rank 2: data [ 8  9  4  6]

Part 2: Sort local sublists¶

In [ ]:

    local_data.sort()
    print("Rank %d: local data sorted %s" %(rank, local_data))

merge sort

Global data [1 3 15 12 14 5 7 0 8 9 4 6 13 11 2 10]
Rank 0: data [1  3 12 15]
Rank 1: data [0  5  7 14]
Rank 3: data [2 10 11 13]
Rank 2: data [4  6  8  9]

Part 3: What is the communication pattern¶

In [ ]:

for height in range(max_height):
    parent = rank & (~(1 << height))
    if rank == parent:
        child = rank | (1 << height)
        print("At height %d, rank %d receive from %d" %(height, rank, child))
    else:
        print("At height %d, rank %d sends to %d" %(height, rank, parent))
        break

merge sort

At height 0, rank 0 receive from 1
At height 0, rank 1 sends to 0
At height 0, rank 2 receive from 3
At height 0, rank 3 sends to 2
At height 0, rank 4 receive from 5
At height 0, rank 5 sends to 4
At height 0, rank 6 receive from 7
At height 0, rank 7 sends to 6

At height 1, rank 0 receive from 2
At height 1, rank 2 sends to 0
At height 1, rank 4 receive from 6
At height 1, rank 6 sends to 4

At height 2, rank 0 receive from 4
At height 2, rank 4 sends to 0

Part 3: Merge sublists¶

In [ ]:

    for height in range(max_height):
        step = 1 << height
        parent = rank & (~(1 << height))
        
        if rank == parent:
            child = rank | (1 << height)
            # receive data from the right child
            data = np.array([0] * len(local_array))
            comm.Recv([data, MPI.INT], source=child, tag=height)
            
            # merge the array with your own data
            local_array = np.array(list(merge(local_array, data)))
        else:
            # send your data to the parent
            comm.Send([local_array, MPI.INT], dest=parent, tag=height)
            break

Correctness¶

mpiexec -n 4 python3 merge_sort.py

Global data [ 1 7 12 2 13 15 14 11 6 5 10 8 3 9 0 4]
Sorted array [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]

mpiexec -n 8 python3 merge_sort.py

Global data [ 5 0 17 2 10 15 13 30 20 16 19 29 28 23 31 3 25 11 4 9 24 12 14 7 26 6 18 8 27 1 21 22]
Sorted array [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]

More formal, use asserts to check correctness

all(res[i] <= res[i+1] for i in range(len(res)-1))

Performance ?¶

Example 3: Matrix vector multiplication¶

Matrix vector multiplication

Dot product between rows and vector

Easiest way to split the multiplication?

Split by row¶

mvm

Each process gets one\multiple rows of A
- Is responsible for one\multiple elements in the res vector
All processes store the vect vector

Dostribute data from rank 0¶

In [ ]:

    my_row_count = number_rows // size
    local_A = np.zeros(shape=(my_row_count, number_rows), dtype='i')
    if rank==0:
        A = read_matrix()
        vect = read_vector()
        print("Multiply matrix %s with vector %s" %(A, vect))
    else:
        A = None
        vect = np.empty(number_rows, dtype='i')

    comm.Scatter([A, MPI.INT], [local_A, MPI.INT], root=0)
    comm.Bcast([vect, MPI.INT], root=0)

    print("rank %d: matrix %s" %(rank, local_A))

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix 
    [[0 1 2 3]
     [2 3 4 5]]
rank 1: matrix 
    [[4 5 6 7]
     [6 7 8 9]]

Local computation¶

In [ ]:

    # local computation
    local_res = np.zeros(my_row_count, dtype='i')
    for i in range(my_row_count):
        local_res[i] = np.dot(local_A[i], vect)

    print("rank %d: local result %s" %(rank, local_res))

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix 
    [[0 1 2 3]
     [2 3 4 5]]
rank 1: matrix 
    [[4 5 6 7]
     [6 7 8 9]]

rank 0: local result [4 8]
rank 1: local result [12 16]

Gather results on rank 0¶

In [ ]:

    res = None
    if rank == 0:
        res = np.empty(size * my_row_count, dtype='i')

    comm.Gather([local_res, MPI.INT], [res, MPI.INT], root=0)

    if rank==0:
        print("Multiplication result %s" %(res))

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
Multiplication result [ 4  8 12 16]

What are the limitations?

Split in blocks¶

mvm

Each process gets a block from one row of A
- Is responsible for a fraction of one element in the res vector
  - Need to use ADD reduce op on all processes responsible for the element
Processes store blocks of the vect vector

Distribute data to the corresponding processes¶

In [ ]:

if rank==0:
    A, vect = read_input_data()
    
    for i in range(size):
        offset = (i % chunk_per_row) * chunk_size
        row_offset = i // chunk_per_row
        comm.Send([vect[offset : offset + chunk_size], MPI.INT], dest=i, tag=1)
        comm.Send([A[row_offset, offset : offset + chunk_size], MPI.INT], dest=i, tag=2)

comm.Recv([local_vect, MPI.INT], source=0, tag=1)
comm.Recv([local_A, MPI.INT], source=0, tag=2)

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix [0 1], vect [0 1]
rank 2: matrix [2 3], vect [0 1]
rank 4: matrix [4 5], vect [0 1]
rank 3: matrix [4 5], vect [0 1]
rank 5: matrix [6 7], vect [0 1]
rank 7: matrix [8 9], vect [0 1]
rank 1: matrix [2 3], vect [0 1]
rank 6: matrix [6 7], vect [0 1]

Local computation¶

In [ ]:

local_res = np.dot(local_A, local_vect)

print("rank %d: local result %s" %(rank, local_res))

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix [0 1], vect [0 1]
rank 2: matrix [2 3], vect [0 1]
rank 4: matrix [4 5], vect [0 1]
rank 3: matrix [4 5], vect [0 1]
rank 5: matrix [6 7], vect [0 1]
rank 7: matrix [8 9], vect [0 1]
rank 1: matrix [2 3], vect [0 1]
rank 6: matrix [6 7], vect [0 1]
rank 6: local result 7
rank 4: local result 5
rank 2: local result 3
rank 3: local result 5
rank 5: local result 7
rank 0: local result 1
rank 7: local result 9
rank 1: local result 3

Reduce values for every row¶

In [ ]:

    color = rank // chunk_per_row
    new_comm = MPI.COMM_WORLD.Split(color, rank)
    local_res = new_comm.reduce(local_res, root=0)

    if new_comm.Get_rank() == 0:
        print("rank %d: row %d result %d" %(rank, color, local_res))

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 6: local result 7
rank 4: local result 5
rank 2: local result 3
rank 3: local result 5
rank 5: local result 7
rank 0: local result 1
rank 7: local result 9
rank 1: local result 3

rank 4: row 2 result 12
rank 2: row 1 result 8
rank 0: row 0 result 4
rank 6: row 3 result 16

Gather all rows at process 0¶

In [ ]:

    if new_comm.Get_rank() == 0:
        comm.send(local_res, dest=0, tag=3)

    res = np.zeros(number_rows, dtype='i')
    if rank == 0:
        for i in range(number_rows):
            status = MPI.Status()
            ret = comm.recv(source=MPI.ANY_SOURCE, tag=3, status=status)
            rnk = status.Get_source()
            res[rnk // chunk_per_row] = ret

        print("Multiplication result %s" %(res))

mpiexec -n 8 python3 matrix_vector_mult_block.py 4

Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
Multiplication result [ 4  8 12 16]

Performance¶

pi performance

Running on Knight's Landing - Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz
- 64 cores, 4 threads per core
- 640 x 640 matrix
- 5120 x 5120 matrix

Weak scaling¶

mvm performance

Problem size
- 320 x 320 for one process
- 640 x 640 for 2 processes
- ...
- 20480 x 20480 for 64 processes

How can we do it better?

MPI¶

Parallel Programming¶

Types of Parallel Computing Models¶

Message passing (MPI) is for MIMD/SPMD parallelism.¶

The Message Passing Model¶

The Message Passing Model¶

In These Lectures¶

Programming With MPI¶

Programming With MPI¶

MPI Environment¶

Main MPI characteristics¶

MPI Initialization¶

MPI for Python¶

Error handling¶

Error handling¶

Error handling¶

MPI for python¶

MPI for python¶

Automatic MPI finalization + unhandled exceptions can lead to deadlocks.¶

MPI for python error handling¶

MPI Basic Send/Receive¶

1. Identify the process to communicate with¶

2. Describe the data¶

The message buffer is described by (start, count, datatype)¶

Why datatypes?¶

How to create a new datatype¶

3. Recognize messages sent to you by other processes¶

4. Know when the send/receive operations are complete¶

MPI Basic (Blocking) Send¶

MPI Basic (Blocking) Receive¶

Blocking Send/Receive Summary¶

Blocking Send/Receive Summary¶

Send array of (int32, float64) tuples¶

Non-blocking communication¶

Why Non-blocking?¶

Deadlock example¶

Deadlock solution¶

MPI Non-blocking operations¶

MPI for python ping pong example¶

Multiple Completions¶

Modes for sending messages:¶

MPI_Sendrecv¶

Both send and receive must be successful for the operation to succeed.¶

MPI_sendrecv example¶

Collective communication¶

Collective communication¶

Syncronization¶

Collective Data Movement¶

Each process can get different data with the {function}v variant of each method¶

Collective communication semantics¶

Collective Data Movement¶

Collective Data Movement¶

Collective Data Movement - Notes¶

Variants¶

Collective Computation¶

Combination operations¶

Reduction¶

Collective computation¶

All collectives¶

Define your own collective routine¶

Define your own collective example¶

Define your own collective example¶

Define your own collective example¶

Define your own collective example¶

Non-Blocking Collective Operations¶

Topologies and Neighborhood¶

Topologies¶

Topology mapping basics¶

Topology Mapping Basics¶

Allocation mapping¶

Topology Mapping Basics¶

Rank reordering¶

MPI’s Topology Routines¶

MPI_Cart_create¶

MPI_Dims_create¶

Example¶

Example python¶

Cartesian Query Functions¶

MPI_Cart_shift¶

Example (8 processes)¶

The message buffer is described by `(start, count, datatype)`¶

All the code from this slides can be downloaded from https://github.com/anagainaru/MPI_Lectures ¶