MPI

Message Passing Interface

 

Lectures by: Ana Gainaru

Parallel Programming

 

Architecture

Types of Parallel Computing Models

Implicit Parallelism: Parallelizing Compilers

Explicit Parallelism: Data Parallel | Task parallel | Message Passing (MPI)

  • Data Parallel - the same instructions are carried out simultaneously on multiple data items SIMD
  • Task Parallel - different instructions on different data MIMD

  • SPMD (single program, multiple data) not synchronized at individual operation level

    • SPMD is equivalent to MIMD since each MIMD program can be made of SPMD

Message passing (MPI) is for MIMD/SPMD parallelism.

The Message Passing Model

  • Consist of separate processes, each with its own address space

    • Programmer manages memory by placing data in a particular process
  • Data is sent explicitly between processes

    • Programmer manages memory motion
  • Multi-process communication (Collective operations)

    • Algorithm (performance) defined by the programming language
    • Programmer selects subset of processes

The Message Passing Model

  • A process is

    • a program counter
    • address space
  • Processes may have multiple threads (program counters and associated stacks) sharing a single address space.

MPI is for communication among processes, with separate address spaces

  • Interprocess communication consists of
    • Moving data from one process’s address space to another’s
    • Synchronization

In These Lectures



1. Introduction to MPI 
    - Library and implementation characteristics
    - MPI for python
    - Error handling

2. Blocking point-to-point communication
    - Sending and receiving with MPI_Send and MPI_Recv
    - Dynamic receiving with MPI_Probe and MPI_Status
    - Groups and communicators

3. Non-blocking communication

4. Collective communication
    - Synchronization (Barrier)
    - Data movement (Broadcast, Gather, Scatter, Allgather, Alltoall)
    - Collective computation (Reduce, AllReduce, Scan, Exscan, Reduce_scatter)
    - Define your own collective routine
    - Non-Blocking Collective Operations

5. Topology mapping and neighborhood collectives

6. Profiling MPI applications



Programming With MPI

MPICH / OpenMPI

Programming With MPI

  • MPI is a library
    • All operations are performed with routine calls
    • Basic definitions in
      • mpi.h for C
      • MPI or MPI_F08 module for Fortran
    • Implementation
      • MPICH: high-quality reference implementation of the latest MPI standard
      • OpenMPI: common case, both in terms of usage and network conduits
  • MPI for Python
    • mpi4py module using pickle under the hood
      • implements binary protocols for serializing and de-serializing a Python object structure

Launching MPI jobs

mpiexec –np ${RANKS} python ${PATH_TO_SCRIPT}
In [ ]:
from mpi4py import MPI

print("Hello world")

mpiexec -n 4 python script.py

Hello world
Hello world
Hello world
Hello world

MPI Environment

  • Two important issues
    • Initializing all MPI’s global and internal variables
      • e.g. Global communicator that allows all processors to communicate
    • Giving processes enough information to communicate with each other
      • Number of total processes participating in this computation
      • Which process am I from all processes
  • MPI provides functions for all issues:
    • MPI_Initialize/MPI_Finalize
    • MPI_Comm_size reports the number of processes
    • MPI_Comm_rank reports the rank, a number between 0 and size-1, identifying the calling process
#include “mpi.h”

int main(int argc, char *argv[])
{
 int rank, size, i, provided;
 float A(10);

 MPI_Init_thread(&argc, &argv, MPI_THREAD_SINGLE, &provided);

 MPI_Comm_size(MPI_COMM_WORLD, &size);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 for (i=0; i<10; i++)
     A[i] = i * rank;

 printf("My rank %d of %d\n", rank, size );

 printf("Here are my values for A\n");
 for (i=0; i<10; i++) printf("%f ", A[i]);
 printf("\n");

 MPI_Finalize();
}
In [ ]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size() 

A = [i * rank for i in range(5)]

print("My rank %d of %d" %(rank, size));
print("My values for A: %s" %(A));
My rank 0 of 4
My rank 1 of 4
My rank 2 of 4
My values for A: [0, 0, 0, 0, 0]
My values for A: [0, 2, 4, 6, 8]
My values for A: [0, 1, 2, 3, 4]
My rank 3 of 4
My values for A: [0, 3, 6, 9, 12]

Main MPI characteristics

All MPI programs begin with MPI_Init_thread and end with MPI_Finalize

  • MPI_COMM_WORLD is defined by mpi.h (in C) or the MPI module
    • includes all processes in the MPI “job”
  • Each statement executes independently in each process
    • including the print/printf statements

 

  • I/O to standard output is not part of MPI
    • output order undefined (may be interleaved by character, line, or blocks of characters)

MPI Initialization

  • Programmer can request the level of thread safety required for the program

    • E.g. Needed to use OpenMP with MPI  
  • MPI_THREAD_SINGLE gives the same behavior as MPI_Init

    • one thread will execute
  • MPI_THREAD_FUNNELED The process may be multi-threaded
    • only the main thread will make MPI calls
    • all MPI calls are funneled to the main thread
  • MPI_THREAD_SERIALIZED The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time
    • MPI calls are not made concurrently from two distinct threads
    • all MPI calls are serialized
  • MPI_THREAD_MULTIPLE Multiple threads may call MPI, with no restrictions.

MPI_Init_thread is used since MPI-2

MPI for Python

  • At import time (import mpi4py)

    • the module initializes the MPI execution environment calling MPI_Init_thread()
    • installs an exit hook to automatically call MPI_Finalize() just before the Python process terminates
  • You can set runtime configuration options (e.g. automatic initialization, thread_level) via mpi4py.rc()

  • By default importing the MPI submodule calls MPI_Init()
    • calling Init() or Init_thread()more than once violates the MPI standard
    • This will lead to a Python exception or an abort in C/C++
    • use Is_initialized() to test for initialization
      • similarly for Finilize: Is_finalized())

Error handling

By default, an error causes all processes to abort.

  • The user can cause routines to return (with an error code) instead.
  • A user can also write and install custom error handlers.

Error handling

  • MPI has error codes and classes

    • MPI routines return error codes
    • Each code belongs to an error class
    • The returned error code is implementation specific
    • The only error code that MPI standard itself defines is MPI_SUCCESS, i.e., no error
    • The meaning of an error code can be extracted by calling function MPI_Error_string.
  • An MPI implementation can use error codes to return instance-specific information on the error

    • MPICH does this, providing more detailed and specific messages
    • There are routines to convert an error code to text and to find the class for a code.

Error handling

 

MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

error_code = MPI_Send(send_buffer, strlen(send_buffer) + 1,
                      MPI_CHAR, addressee, tag, MPI_COMM_WORLD);

if (error_code != MPI_SUCCESS){
    char error_string[BUFSIZ];
    int length_of_error_string;

    MPI_Error_string(error_code, error_string, &length_of_error_string);

    fprintf(stderr, "%3d: %s\n", my_rank, error_string);
    send_error = TRUE;
}

MPI for python

  • mpi4py overrides the default MPI.ERRORS_ARE_FATAL error handler in favor of MPI.ERRORS_RETURN

    • which allows translating MPI errors in Python exceptions.
  • MPI errors as Python exceptions, can be dealt with the common try…except…finally clauses

    • Unhandled MPI exceptions will print a traceback which helps in locating problems in source code.

 

Automatic MPI finalization combined with unhandled exceptions can lead to deadlocks.

MPI for python

Automatic MPI finalization + unhandled exceptions can lead to deadlocks.

Example

In [ ]:
from mpi4py import MPI

assert MPI.COMM_WORLD.Get_size() > 1
rank = MPI.COMM_WORLD.Get_rank()

if rank == 0:
    1/0
    MPI.COMM_WORLD.send(None, dest=1, tag=42)
elif rank == 1:
    MPI.COMM_WORLD.recv(source=0, tag=42)

MPI for python error handling

  • Handle exceptions in your code manually

OR

  • Run Python code passing -m mpi4py in the command line
    • In case of unhandled exceptions, the finalizer hook will call MPI_Abort() on the MPI_COMM_WORLD communicator
      • This will abort the MPI execution environment.

 

mpiexec -n ${NUM_PROCS} python -m mpi4py ${PYTHON_FILE} [${ARG}]

MPI Basic Send/Receive

Send Receive

  1. Identify the process you want to communicate with
  2. Describe the data
  3. Know when the send/receive operations are complete
  4. Recognize messages sent to you by other processes

1. Identify the process to communicate with

  • Processes can be collected into groups.
  • Each message is sent in a context, and must be received in the same context.

A group and context together form a communicator (*)

(*) the context (or ID) that differentiates one communicator from another and the group of processes contained by the communicator

  • A process is identified by its rank in the group associated with a communicator.
    • MPI_COMM_WORLD is enough for most applications

 

MPI_Comm_split(
    MPI_Comm comm,
    int color,             // for split
    int key,               // for ordering
    MPI_Comm* newcomm)
In [ ]:
# Get the rank and size in the original communicator
from mpi4py import MPI

world_rank = MPI.COMM_WORLD.Get_rank()
world_size = MPI.COMM_WORLD.Get_size()

# Determine color based on even/odd ranks
color = world_rank % 2; 

# Split the communicator based on the color and use the
# original rank for ordering
new_comm = MPI.COMM_WORLD.Split(color, world_rank)

new_rank = new_comm.Get_rank()
new_size = new_comm.Get_size()

print("World rank/size: %d/%d -- New rank/size: %d/%d" %(
            world_rank, world_size, new_rank, new_size));

new_comm.Free()
World rank/size: 0/4 -- New rank/size: 0/2
World rank/size: 1/4 -- New rank/size: 0/2
World rank/size: 2/4 -- New rank/size: 1/2
World rank/size: 3/4 -- New rank/size: 1/2

2. Describe the data

The message buffer is described by (start, count, datatype)

  • Datatype can be:

    • Predefined corresponding to a data type from the language (MPI_CHAR, MPI_INT, MPI_DOUBLE, etc)
    • a contiguous array of MPI datatypes
    • a strided block of datatypes
    • an indexed array of blocks of datatypes
    • an arbitrary structure of datatypes
  • There are MPI functions to construct custom datatypes

    • e.g. an array of (int, float) pairs, or a row of a matrix stored columnwise

mpi4py uses protocols for packing and unpacking a Python object structure

Why datatypes?

  • Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).

 

  • Specifying application-oriented layout of data in memory
    • can reduce memory-to-memory copies in the implementation
    • allows the use of special hardware (scatter/gather) when available
  • Specifying application-oriented layout of data on a file
    • can reduce system calls and physical disk I/O

How to create a new datatype

  • Create an MPI datatype from a general set of datatypes, displacements, and block sizes
int MPI_Type_create_struct(
  int count,
  int array_of_blocklengths[],
  MPI_Aint array_of_displacements[],
  MPI_Datatype array_of_types[],
  MPI_Datatype *newtype
);
In [ ]:
def definetype():
    dtypes = [MPI.INT, MPI.DOUBLE]
    dt_size = MPI.INT.size + MPI.DOUBLE.size
    displ = [0, MPI.INT.size]

    new_dtype = MPI.Datatype.Create_struct([1, 1], displ, dtypes)
    new_dtype = new_dtype.Create_resized(0, dt_size)
    new_dtype.Commit()
    return new_dtype

3. Recognize messages sent to you by other processes

  • Messages are sent with an accompanying user-defined integer tag

    • Used to assist the receiving process in identifying the message.
    • Messages can be screened at the receiving end by specifying a specific tag, or
    • not screened by specifying MPI_ANY_TAG as the tag in a receive.
  • Some non-MPI message-passing systems have called tags “message types”.

    • MPI calls them tags to avoid confusion with datatypes.
In [ ]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    sdata = {'a': 7, 'b': 3.14}
    comm.send(sdata, dest=1, tag=11)
elif rank == 1:
    rdata = comm.recv(source=0, tag=11)
    print(rdata)
{'a': 7, 'b': 3.14}

4. Know when the send/receive operations are complete

 

Blocking / Non-blocking

  • The message-passing approach makes the exchange of data cooperative.
  • Data is explicitly sent by one process and received by another.
  • An advantage is that any change in the receiving process’s memory is made with the receiver’s explicit participation.
  • Communication and synchronization are combined.

One sided communication

  • One-sided operations between processes include remote memory reads and writes
  • Only one process needs to explicitly participate.
  • An advantage is that communication and synchronization are decoupled

MPI Basic (Blocking) Send

 

MPI_SEND (start, count, datatype, dest, tag, comm)

  • The message buffer is described by (start, count, datatype).
  • The target process is specified by dest, which is the rank of the target process in the communicator specified by comm.
  • When this function returns, the data has been delivered to the system and the buffer can be reused
    • The message may not have been received by the target process.

MPI Basic (Blocking) Receive

 

MPI_RECV(start, count, datatype, source, tag, comm, status)

  • Waits until a matching (on source and tag) message is received from the system, and the buffer can be used.
  • source is rank in communicator specified by comm, or MPI_ANY_SOURCE.
  • status contains further information
  • Receiving fewer than count occurrences of datatype is OK, but receiving more is an error.

Blocking Send/Receive Summary

Send Receive

Blocking Send/Receive Summary

Send Receive

...

if(rank == 0){
    // generate data
    MPI_Send(&sdata, 1, MPI_INT, 1, 100,
             MPI_COMM_WORLD);
} 
if(rank == 1){
    MPI_Recv(&rdata, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, 
             MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    //do somthing with data
}

Send array of (int32, float64) tuples

In [ ]:
def definetype():
    dtypes = [MPI.INT, MPI.DOUBLE]
    dt_size = MPI.INT.size + MPI.DOUBLE.size
    displ = [0, MPI.INT.size]

    new_dtype = MPI.Datatype.Create_struct([1, 1], displ, dtypes)
    new_dtype = new_dtype.Create_resized(0, dt_size)
    new_dtype.Commit()
    return new_dtype

def fill_data(data, num):
    for i in range(num):
        data[i][0] = i + 1
        data[i][1] = 3.14 * (i + 1)

mytype = definetype()
data = np.zeros(2, dtype="int32, double")
if rank == 0:
    comm.Recv([data, mytype], source=1, tag=0)
    print(data)
elif rank == 1:
    fill_data(data, 2)
    comm.Send([data, mytype], dest=0, tag=0)
[(1, 3.14) (2, 6.28)]

Non-blocking communication

Blocking vs nonblocking

Why Non-blocking?

  1. Overlapping communication and computation

    • It can increase performance of some applications (not always)
  2. Avoiding deadlocks

    • Related to message ordering
    • Due to completion depending on size of message and amount of system buffering

Send Receive

  • Non blocking communication
    • Send/Recv routines return before transfer is complete
    • and we wait later.

Deadlock example

  • Send a large message from process 0 to process 1

    • If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive)
  • What happens with this code?

Deadlock

  • This is called unsafe because it depends on the availability of system buffers

Deadlock solution

  • Order the operations more carefully
  • Supply receive buffer at same time as send (MPI_Sendrecv)
  • Supply own buffer space (MPI_Bsend)
  • Use non-blocking operations
    • Safe, but
    • does not guarantee to be asynchronous or concurrent
    • not always faster

MPI Non-blocking operations

  • Return immediately
    • give request handles that can be tested and waited on for completion.


MPI_Request request;

MPI_Isend(start, count, datatype, dest, tag, comm, &request);

MPI_Irecv(start, count, datatype, dest, tag, comm, &request);

MPI_Wait(&request, &status);
  • One can also test without waiting:

MPI_Test(&request, &flag, &status); 

MPI for python ping pong example

  • Using numpy arrays
In [ ]:
from mpi4py import MPI
import numpy

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
assert comm.size == 2

sdata = numpy.array([(rank+1) * i for i in range(10)])
rdata = numpy.array([0] * 10)

if comm.rank == 0:
    target = 1
else:
    target = 0

request = comm.Isend([sdata, MPI.INT], dest=target, tag=10)
comm.Recv([rdata, MPI.INT], source=target, tag=MPI.ANY_TAG)
request.Wait()

print("rank: %d received data %s" %(rank, rdata))
rank 1: received data [0 1 2 3 4 5 6 7 8 9]
rank 0: received data [0 2 4 6 8 10 12 14 16 18]

Multiple Completions

  • Waiting on multiple requests:

MPI_Waitall(count, array_of_requests, array_of_statuses);

MPI_Waitany(count, array_of_requests, &index, &status);

MPI_Waitsome(incount, array_of_requests,  &outcount, array_of_indices,
             array_of_statuses);
  • There are corresponding versions of test for each of these.

Modes for sending messages:

  • Synchronous mode (MPI_Ssend)

    • The send does not complete until a matching receive has begun.
    • Unsafe programs deadlock.
  • Buffered mode (MPI_Bsend)

    • The user supplies a buffer to the system for its use.
    • User can allocate enough memory to make an unsafe program safe.
  • Ready mode (MPI_Rsend)

    • User guarantees that a matching receive has been posted.
    • Allows access to fast protocols
    • Undefined behavior if matching receive not posted
  • Non-blocking versions for all operations (MPI_Issend, etc.)

MPI_Recv receives messages sent in any mode.

MPI_Sendrecv

  • Simultaneous send and receive

    • Send and receive datatypes (even type signatures) may be different.
    • Can use Sendrecv with plain Send or Recv (or Irecv or Ssend_init, ...)
  • Intended for sending data in a "chain"

    • For instance, 0 sends to 1 while 1 sends to 2 while 2 sends to 0.
int MPI_Sendrecv(
  void *sendbuf,
  int sendcount,
  MPI_Datatype sendtype,
  int dest,
  int sendtag,
  void *recvbuf,
  int recvcount,
  MPI_Datatype recvtype,
  int source,
  int recvtag,
  MPI_Comm comm,
  MPI_Status *status
);

Both send and receive must be successful for the operation to succeed.

MPI_sendrecv example

In [ ]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

to = (rank + 1) % size
fr = (rank - 1) % size

sdata = rank
rdata = comm.sendrecv(sdata, dest=to, source=fr)

print("rank %d got data from %d" %(rank, rdata))
rank 0 got data from 3
rank 1 got data from 0
rank 2 got data from 1
rank 3 got data from 2

Collective communication

  • All communication in MPI is within a group of processes
    • Collective communication is over all of the processes in that group

Collective patterns

  • Most algorithms are log(P)
  • They classify in 3 major communication patterns
    • Scatter, Gather, Reduce
    • Barrier, AllReduce, Allgather, Alltoall
    • Scan, Exscan

Collective communication

  • From the point of view of functionalities, there are three classes of operations
    • Synchronization: Barrier
    • Data movement: Broadcast, Gather, Scatter, Allgather, Alltoall
    • Collective computation: Reduce, AllReduce, Scan, Exscan, Reduce_scatter

Syncronization

MPI_Barrier( comm )

  • Blocks until all processes in the group of the communicator comm call it.
  • Programmers prefer not to use it in a parallel program
    • Occasionally useful in measuring performance and load balancing
    • It could increase performance by reducing network contention
In [ ]:
from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
print("rank %d: before barrier" %(rank))

MPI.COMM_WORLD.barrier()

print("rank %d: after barrier" %(rank))
rank 0: before barrier
rank 1: before barrier
rank 2: before barrier
rank 3: before barrier
rank 3: after barrier
rank 1: after barrier
rank 2: after barrier
rank 0: after barrier

Collective Data Movement

  • One to all
    • Broadcast
    • Scatter (personalized)
  • All to one
    • Gather
  • All to all
    • Allgather
    • Alltoall (personalized)

All collective operations must be called by all processes in the communicator

Each process can get different data with the {function}v variant of each method

Collective communication semantics

  • Collective routines on the same communicator must be called in the same order on all participating processes

  • For multi-threaded processes (MPI_THREAD_MULTIPLE)

    • it is the users responsibility to ensure order
  • Message tags are not used

Use different communicators to separate collective operations on the same process

Collective Data Movement

Collectives

MPI.COMM_WORLD.bcast(data, root=proc_id)
MPI.COMM_WORLD.gather(data, root=proc_id)
MPI.COMM_WORLD.scatter(data, root=proc_id)

# for numpy arrays
MPI.COMM_WORLD.Scatter(sendbuf, recvbuf, root=0)

Collective Data Movement

Collectives

# sendobj is iterable
MPI.COMM_WORLD.allgather(sendobj)
MPI.COMM_WORLD.alltoall(sendobj)

# for numpy arrays
MPI.COMM_WORLD.Allgather([x,  MPI.DOUBLE], [xg, MPI.DOUBLE])

Collective Data Movement - Notes

  • MPI_Allgather is equivalent to

    • MPI_Gather followed by MPI_Bcast
    • Algorithms for MPI_Allgather are usually faster (try on ACCRE)
  • MPI_Alltoall

    • Performs a “transpose” of the data
    • Tricky to implement efficiently
      • Implementation have a hard time scaling to the size of today's supercomputers
      • For large messages point to point communication is preferable

Variants

  • The basic routines send the same amount of data from each process

MPI_Scatter(sbuf, 1, MPI_INT, rbuf, 100, MPI_INT, root, comm);

(*) sends one integer to each process
  • Sending a different number of items to each process
    • Example: MPI_Scatterv
    • Routines with “v” (for vector) at the end
      • allow to specify a different number of elements for each destination/source

 

Efficient algorithms exist for these cases (not alltoall), not as fast as the simpler, basic routines

 

Variant alltoallw:

  • Allows different datatypes, counts, and displacements for each partner

Collective Computation

  • Combines communication with computation
    • Reduce
      • All to one, with an operation to combine
    • Scan, Exscan
      • All prior ranks to all, with combination
    • Reduce_scatter
      • All to all, with combination

Combination operations

  • Can be either
    • Predefined operations
    • User defined operations

Reduction

Reduce

 
 

MPI reduction operations include:
MPI.MAX MPI.MIN
MPI.SUM
MPI.PROD
MPI.LAND MPI.RAND
MPI.LOR MPI.ROR
MPI.MAXLOC MPI.MINLOC
MPI_BXOR
MPI.COMM_WORLD.reduce(sendobj, op=MPI.SUM, root=0)
MPI.COMM_WORLD.allreduce(sendobj, op=MPI.SUM)
  • reduce is similar to gather but result is aggregated using different ops
  • allreduce is similar to allgather

Collective computation

Collectives

Allreduce, Exscan, Reduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions.

All collectives

  • All versions deliver results to all participating processes
  • Most routines accept both intra- and intercommunicators
    • Intercommunicator versions are collective between two groups of processes

Define your own collective routine

MPI_Op_create(user_fcn, commutes, &op);
MPI_Op_free(&op);

user_fcn(invec, inoutvec, len, datatype);
  • Depending on the value of commute

    • True: the operation should be both commutative and associative
    • False: the order of operands is fixed
      • defined to be in ascending, process rank order, beginning with process zero
  • The user function should perform:

    for(i=0; i<len; i++)
      inoutvec[i] = invec[i] op inoutvec[i];
    

The operation op is always assumed to be associative. The user function can be non-commutative. Example?

Define your own collective example

Product of complex number

user_fcn(invec, inoutvec, len, datatype);

python:

user_fcn(inelem1, inelem2, outelem)
In [ ]:
from Complex import Complex
# Complex class defined in a separate file

def myprod(a, b, c):
    c = Complex()
    c.real = a.real * b.real - a.img * b.img
    c.img = a.img * b.real + a.real * b.img
    return c

Define your own collective example

Product of complex number

In [ ]:
comm = size = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

a = Complex(rank + 1, rank)

print("Rank %d: Data %s" %(rank, a))

myop = MPI.Op.Create(myprod, True) # commute is True
b = comm.Reduce(a, op=myop, root=0)

if rank==0:
    print("Produc: %s" %(b))
Rank 0: Data (1.0  0.0)
Rank 1: Data (2.0  1.0)
Rank 3: Data (4.0  3.0)
Rank 2: Data (3.0  2.0)
Product: (-5.0 40.0)

For C version https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node80.html

Define your own collective example

  • Numpy arrays are trickier
    • Programmer needs to manually pack/unpack data
In [ ]:
def unpack_array(array_mem, dt):
    a = array_mem.tobytes()
    itemsize = dt.Get_size()
    array_len = int(array_mem.shape[0])
    unpack_array = []
    for i in range(int(array_len/itemsize)):
        unpack_array.append(int(np.frombuffer(a[itemsize*i:itemsize*(i+1)],
                            dtype=np.int64)))
    return np.array(unpack_array)

def myprod(xmem, ymem, dt):
    a = unpack_array(xmem, dt)
    b = unpack_array(ymem, dt)

    c = np.array([0] * len(a))
    for i in range(len(a)):
        c[i] = a[i] % 2 | b[i] % 2

    ymem[:] = memoryview(bytearray(c))[:]

Define your own collective example

  • Numpy arrays are trickier
    • Programmer needs to manually pack/unpack data
In [ ]:
a = np.array([rank * (i + 1) for i in range(3)])
b = np.array([0] * 3)
print("Rank %d: Data %s" %(rank, a))

myop = MPI.Op.Create(myprod, True) # commute is True
comm.Reduce([a, MPI.DOUBLE], [b, MPI.DOUBLE], op=myop, root=0)

if rank==0:
    print("Contains odd numbers: %s" %(b))
Rank 0: Data [0 0 0]
Rank 2: Data [2 4 6]
Rank 1: Data [1 2 3]
Rank 3: Data [3 6 9]
Contains odd numbers: [1 0 1]
Another example in the unit test code for mpi4py: https://github.com/mpi4py/mpi4py/blob/master/test/test_op.py

Non-Blocking Collective Operations

  • MPI-3 introduced nonblocking versions of collective operations

    • All return an MPI_Request, use the usual MPI_Wait, MPI_Test, etc. to complete.
    • May be mixed with point-to-point and other MPI_Requests
    • Not necessary faster
      • Follow same ordering rules as blocking operations
  • MPI_Ibarrier

    • Useful for distributed termination detection

Topologies and Neighborhood

Topology mapping

(*) Image from: Advanced MPI 2.2 and 3.0 Tutorial by Torsten Hoefler

Topologies

  • A virtual topology represents the way that MPI processes communicate

    • Nearest neighbor exchange in a mesh
    • Recursive doubling in an all-to-all exchange
  • A physical topology represents that connections between the cores, chips, and nodes in the hardware

Mapping of the virtual topology onto the physical topology.
Think of nodes of chips of cores - not trivial

Topology mapping basics

  • Allocation mapping
  • Rank reordering

Topology Mapping Basics

Allocation mapping

  • Up-front specification of communication pattern
  • Batch system picks good set of nodes for given topology
  • Properties:
    • Not supported by current batch systems
    • Either predefined allocation, random allocation, or “global bandwidth maximation”
    • Not always possible to specify communication pattern upfront

Topology Mapping Basics

Rank reordering

  • Change numbering in a given allocation to reduce congestion
  • Sometimes automatic
  • Properties:
    • Always possible, but effect may be limited (e.g., in a bad allocation)
    • Allows manual shuffling through MPI process topologies
      • Network topology is not exposed
    • Data has to be manual shuffled after remapping

MPI’s Topology Routines

  • MPI provides routines to create new communicators that order the process ranks

  • Two types of virtual topology supported:

    • Cartesian (regular mesh)
    • Graph (several ways to define in MPI)
      • Routine evolved in each version of MPI
  • Additional routines provide access to the defined virtual topology

  • (Virtual) topologies are properties of a communicator
    • Topology routines all create a new communicator with properties of the specified virtual topology

MPI_Cart_create

  • Creates a new communicator newcomm from oldcomm
    • representing an ndim dimensional mesh with sizes dims

 

MPI_Cart_create(MPI_Comm oldcomm, int ndim, int dims[],
                int qperiodic[], int qreorder, MPI_Comm *newcomm)
  • The mesh is periodic in coordinate direction i if qperiodic[i] is true (torus).

  • The ranks in the new communicator are reordered

    • to better match the physical topology
    • if qreorder is true

 

Some processes may return MPI_COMM_NULL.
Product sum of dims must be <= P

MPI_Dims_create

  • Fill in the dims array such that the product of dims[i] for i=0 to ndim-1 equals nnodes.

 

MPI_Dims_create(int nnodes, int ndim, int dims[])
  • Any value of dims[i] that is 0 on input will be replaced

    • values that are > 0 will not be changed
  • Makes life easier

    • Some problems might work better with non-square layouts

Example

3D torus

int p;

MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Dims_create(p, 3, dims);

int periods[3] = {1,1,1};

MPI_Comm cart_comm;
MPI_Cart_create(comm, 3, dims, periods, 0, &cart_comm);
  • Reorder argument allows for topology mapping
    • Each calling process may have a new rank in the created communicator
    • Data has to be remapped manually

Example python

In [ ]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
world_rank = comm.Get_rank()
size = comm.Get_size()

dims = MPI.Compute_dims(size, [0]*2)
cart_comm = comm.Create_cart(
        dims, periods=[True,True], reorder=True)

new_rank = cart_comm.Get_rank()

print("World rank: %d -- Cart rank: %d" %(
    world_rank, new_rank));

cart_comm.Free()
World rank: 2 -- Cart rank: 2
World rank: 1 -- Cart rank: 1
World rank: 3 -- Cart rank: 3
World rank: 0 -- Cart rank: 0

Cartesian Query Functions

  • Determine neighbor ranks
    • Can be computed from rank (in the cart_comm), dims, and periods
      • Ordering defined in MPI
      • See Section 7.5 in MPI-3 Standard
  • MPI_Cartdim_get()
    • Gets dimensions of a Cartesian communicator
  • MPI_Cart_get()
    • Gets size of dimensions
  • MPI_Cart_rank()
    • Translate coordinates to rank
  • MPI_Cart_coords()
    • Translate rank to coordinates

MPI_Cart_shift

  • Returns the ranks of the processes that are a shifted by disp steps in coordinate direction

 

MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
               int *rank_source, int *rank_dest)
  • Useful for nearest neighbor communication in the coordinate directions
    • Dimensions are numbered from 0 to ndims-1
    • Displacement indicates neighbor distance (-1, 1, 2)
    • May return MPI_PROC_NULL
      • The source or destination for the shift is out of range

Example (8 processes)

In [ ]:
ndim = 2

dims = MPI.Compute_dims(size, [0]*ndim)
cart_comm = comm.Create_cart(dims, periods=[True,True], reorder=True)

new_rank = cart_comm.Get_rank()
if new_rank==0:
    print("Cartesian dim: %s" %(dims))

for i in range(ndim):
    for d in (-1, +1):
        source, dest = cart_comm.Shift(i, d)
        if new_rank == 0:
            print("Dir %d, disp %d - Src %d - Dest %d" %(i, d, source, dest));

MPI_shift

Cartesian dim: [4, 2]
Dir 0, disp -1 - Src 2 - Dest 6
Dir 0, disp 1 - Src 6 - Dest 2
Dir 1, disp -1 - Src 1 - Dest 1
Dir 1, disp 1 - Src 1 - Dest 1

Advanced topology concepts

Graph Topology

  • MPI provides routines to specify a general graph virtual topology
    • Graph vertices represent MPI processes (usually one per process)
    • Graph edges indicate important connections

Collectives on topologies

  • Since MPI-3
    • Collective communications only cover some patterns
      • E.g., no stencil pattern
    • Example: MPI_Neighbor_allgather

Reading

Generic Topology Mapping Strategies for Large-scale Parallel Architectures
Hoefler and Snir
http://dx.doi.org/10.1145/1995896.1995909

Profiling

Profiling interface: PMPI routines

Determinism in MPI

  • In exact arithmetic, you always get the same results

    • but roundoff error, truncation can happen
  • Message-passing programming models are by default nondeterministic

    • The arrival order of messages sent from two processes, A and B, to a third process, C, is not defined.
    • MPI does guarantee that two messages sent from one process, A, to another process, B, will arrive in the order sent.
    • It is the programmer's responsibility to ensure that a computation is deterministic when this is required.
  • How can we "ensure" determinism?

  1. Use source and tag (not MPI_ANY_SOURCE or MPI_ANY_TAG)
  2. Use synchronization

How Deterministic are Collective Computations?

  • Depending on the process to hardware mapping

    • Messages can arrive in different order to processes
    • Round-off error may cause slight differences
  • Reduction results might be different for non-associative operations

    • Even if they are commutative
  • Allreduce does guarantee that the same value is received by all processes for each call

  • Why didn’t MPI mandate determinism?

    • Not all applications need it
    • Implementations of collective algorithms can use “deferred synchronization” ideas to provide better performance

Timing MPI Programs

  • The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:
double t1, t2;
 t1 = MPI_Wtime();
 ...
 t2 = MPI_Wtime();
 printf( "time is %d\n", t2 - t1 );
  • The value returned by a single call to MPI_Wtime has little value.
  • The resolution of the timer is returned by MPI_Wtick
  • Times in general are local, but an implementation might offer synchronized times.
    • For advanced users: see the MPI attribute MPI_WTIME_IS_GLOBAL.

Example Bcast time

In [ ]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = np.array([rank * i for i in range(10)])

local_wt = MPI.Wtime();
comm.Bcast(data, root=0);
local_wt = MPI.Wtime() - local_wt;

print("Bcast time %f" %(wt))
  • Why is this not accurate?
  • What is the output?
Bcast time 0.003616
Bcast tiBcast time 0.005131
me 0.0020Bcast time 0.004574
81

Example Bcast time

In [ ]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = np.array([rank * i for i in range(10)])

comm.barrier();

local_wt = MPI.Wtime();
comm.Bcast(data, root=0);
local_wt = MPI.Wtime() - local_wt;

wt = comm.reduce(local_wt, op=MPI.MAX, root=0);

if rank == 0:
    print("Bcast time %f" %(wt))
Bcast time 0.004041

How accurate is this measurement?

PMPI routines

  • PMPI allows selective interception of MPI routines at link time
    • No need to recompile (for C and Fortran)
    • Interposition library contains wrapper functions

PMPI

  • Program is linked to interposition library
    • All functions are available under two names: MPI_xxx and PMPI_xxx
      • MPI_xxx symbols are weak, can be over-written
    • Measurement code in the interposition library measures

Not all MPI functions need to be instrumented

PMPI example (C/C++)

  • Count number of calls to the Send routine for each rank
In [ ]:
#include <stdio.h>
#include "mpi.h"

static int numsend = 0;

int MPI_Send(void *buf, int count, MPI_Datatype type,
             int dest, int tag, MPI_Comm comm) {
    numsend++;
    return PMPI_Send(buf, count, type, dest, tag, comm);
}

int MPI_Finalize() {
    int me;
    PMPI_Comm_rank(MPI_COMM_WORLD, &me);
    printf("%d sent %d messages.\n", me, numsend);
    return PMPI_Finalize();
}

Writing PMPI wrappers

  • Use rank in MPI_COMM_WORLD as process identifier
    • Data volume: #elements * PMPI_Type_size(type)
  • MPI does not provide a message ID
    • Custom functions to record complete traffic
      • if you need to match send and recv events
  • Wildcard message source and tag
    • You can record real values

Profiling tools

Tracing

  • Recoding of individual time-stamped program events as opposed to aggregated information
    • Entering and leaving a function
    • Sending and receiving a message

Vampir

Intel Trace Analyzer: https://researchcomputing.princeton.edu/faq/using-intel-trace-analyze

VAMPIR: https://vampir.eu/

Profiling

ParaProf: http://www.cs.uoregon.edu/research/tau/vihps14/ParaProf.pdf

cprofile module in python

Code

All the code from this slides can be downloaded from https://github.com/anagainaru/MPI_Lectures

  • Scalability examples
    • Computing pi,
    • Merge sort
    • Vector matrix multiplication
    • You are encouraged to run the examples on ACCRE

 

For questions: ana.gainaru@vanderbilt.edu

Useful links

mpi4py

MPI

Advanced MPI

For questions: ana.gainaru@vanderbilt.edu

Examples

Example 1: Calculating Pi

calculating pi

Monte Carlo simulation

pi

calculating pi

Image downloaded from: https://commons.wikimedia.org/wiki/File:Pi_30K.gif

Example 1: Calculating Pi

pi

Calculating pi via numerical integration

  • Divide interval up into subintervals
  • Assign subintervals to processes
  • Each process calculates partial sum
  • Add all the partial sums

Code

For dividing the interval in n slices

In [ ]:
w = 1.0 / n
mypi = 0.0

for i in range(rank + 1, n + 1, size):
    mypi += w * np.sqrt(1 - np.power(i / n, 2))

pi = comm.reduce(mypi, op=MPI.SUM, root=0)

if rank==0:
    print("PI: %f" %(4 * pi))
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 10
PI: 2.904518
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 100
PI: 3.120417
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 1000
PI: 3.139555
ana@webserver:~/mpi_lecture$ mpiexec -n 2 python3 compute_pi.py 10000
PI: 3.141391

Performance

  • Strong scaling
    • Keep the problem size the same
    • Increase the number of processes

Example: 6.4 mil slices (2 to 64 processes)

  • Weak scaling
    • Increase the problem size proportional with the number of processes

Example: keep 100k slices per processes

Performance

pi performance

  • Running on Knight's Landing - Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz
    • 64 cores, 4 threads per core
    • 6.4 million slices

Example 2: Merge Sort

 

merge sort

Divide an unsorted listed into sublists

  • Until each sublist contains only one element
  • One element sublists are merged together
  • Stop when only one sublist remains

Parallel Merge Sort

  • Hybrid strategy (not necessary the best)
    • The sublists are sorted by a sequential algorithm
    • The merging of sublists is done in parallel between processes

 

The number of processes is a power of two so that all processes are doing roughly the same amount of work

 

TODO at home: change the code to work with non-power of two processes

Part 1: Divide list into unsorted sublists

 

merge sort

  • The list is scattered to all of the processes
    • Each process has an equal chunk of the list

Figure shows 4 processes and a list containing 8 integers

Part 2: Sort local sublists

merge sort

  • Sequential sort

merge sort

Part 3: Merge sublists

merge sort

  • Send data
  • Merge locally

Part 1: Divide list into unsorted sublists

In [ ]:
    global_array_size = get_input_data_size()
    # calculate total height of tree
    height = np.log2(size);

    if rank==0:
        # create input array
        global_data = get_input_data()
        print("Global data %s" %(global_data))
    else:
        global_data = None

    local_array_size = global_array_size / size
    local_data = np.array([0] * local_array_size)
    comm.Scatter([global_data, MPI.INT], [local_data, MPI.INT], root=0)

    print("Rank %d: data %s" %(rank, local_data))

merge sort

Global data [1 3 15 12 14 5 7 0 8 9 4 6 13 11 2 10]
Rank 0: data [ 1  3 15 12]
Rank 1: data [14  5  7  0]
Rank 3: data [13 11  2 10]
Rank 2: data [ 8  9  4  6]

Part 2: Sort local sublists

In [ ]:
    local_data.sort()
    print("Rank %d: local data sorted %s" %(rank, local_data))

merge sort

Global data [1 3 15 12 14 5 7 0 8 9 4 6 13 11 2 10]
Rank 0: data [1  3 12 15]
Rank 1: data [0  5  7 14]
Rank 3: data [2 10 11 13]
Rank 2: data [4  6  8  9]

Part 3: What is the communication pattern

In [ ]:
for height in range(max_height):
    parent = rank & (~(1 << height))
    if rank == parent:
        child = rank | (1 << height)
        print("At height %d, rank %d receive from %d" %(height, rank, child))
    else:
        print("At height %d, rank %d sends to %d" %(height, rank, parent))
        break

merge sort

At height 0, rank 0 receive from 1
At height 0, rank 1 sends to 0
At height 0, rank 2 receive from 3
At height 0, rank 3 sends to 2
At height 0, rank 4 receive from 5
At height 0, rank 5 sends to 4
At height 0, rank 6 receive from 7
At height 0, rank 7 sends to 6

At height 1, rank 0 receive from 2
At height 1, rank 2 sends to 0
At height 1, rank 4 receive from 6
At height 1, rank 6 sends to 4

At height 2, rank 0 receive from 4
At height 2, rank 4 sends to 0

Part 3: Merge sublists

In [ ]:
    for height in range(max_height):
        step = 1 << height
        parent = rank & (~(1 << height))
        
        if rank == parent:
            child = rank | (1 << height)
            # receive data from the right child
            data = np.array([0] * len(local_array))
            comm.Recv([data, MPI.INT], source=child, tag=height)
            
            # merge the array with your own data
            local_array = np.array(list(merge(local_array, data)))
        else:
            # send your data to the parent
            comm.Send([local_array, MPI.INT], dest=parent, tag=height)
            break

Correctness

mpiexec -n 4 python3 merge_sort.py

Global data [ 1 7 12 2 13 15 14 11 6 5 10 8 3 9 0 4]
Sorted array [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]

mpiexec -n 8 python3 merge_sort.py

Global data [ 5 0 17 2 10 15 13 30 20 16 19 29 28 23 31 3 25 11 4 9 24 12 14 7 26 6 18 8 27 1 21 22]
Sorted array [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]

More formal, use asserts to check correctness

all(res[i] <= res[i+1] for i in range(len(res)-1))

Performance ?

Example 3: Matrix vector multiplication

Matrix vector multiplication

Dot product between rows and vector

  • Easiest way to split the multiplication?

Split by row

mvm

  • Each process gets one\multiple rows of A
    • Is responsible for one\multiple elements in the res vector
  • All processes store the vect vector

Dostribute data from rank 0

In [ ]:
    my_row_count = number_rows // size
    local_A = np.zeros(shape=(my_row_count, number_rows), dtype='i')
    if rank==0:
        A = read_matrix()
        vect = read_vector()
        print("Multiply matrix %s with vector %s" %(A, vect))
    else:
        A = None
        vect = np.empty(number_rows, dtype='i')

    comm.Scatter([A, MPI.INT], [local_A, MPI.INT], root=0)
    comm.Bcast([vect, MPI.INT], root=0)

    print("rank %d: matrix %s" %(rank, local_A))
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix 
    [[0 1 2 3]
     [2 3 4 5]]
rank 1: matrix 
    [[4 5 6 7]
     [6 7 8 9]]

Local computation

In [ ]:
    # local computation
    local_res = np.zeros(my_row_count, dtype='i')
    for i in range(my_row_count):
        local_res[i] = np.dot(local_A[i], vect)

    print("rank %d: local result %s" %(rank, local_res))
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix 
    [[0 1 2 3]
     [2 3 4 5]]
rank 1: matrix 
    [[4 5 6 7]
     [6 7 8 9]]

rank 0: local result [4 8]
rank 1: local result [12 16]

Gather results on rank 0

In [ ]:
    res = None
    if rank == 0:
        res = np.empty(size * my_row_count, dtype='i')

    comm.Gather([local_res, MPI.INT], [res, MPI.INT], root=0)

    if rank==0:
        print("Multiplication result %s" %(res))
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
Multiplication result [ 4  8 12 16]

What are the limitations?

Split in blocks

mvm

  • Each process gets a block from one row of A
    • Is responsible for a fraction of one element in the res vector
      • Need to use ADD reduce op on all processes responsible for the element
  • Processes store blocks of the vect vector

Distribute data to the corresponding processes

In [ ]:
if rank==0:
    A, vect = read_input_data()
    
    for i in range(size):
        offset = (i % chunk_per_row) * chunk_size
        row_offset = i // chunk_per_row
        comm.Send([vect[offset : offset + chunk_size], MPI.INT], dest=i, tag=1)
        comm.Send([A[row_offset, offset : offset + chunk_size], MPI.INT], dest=i, tag=2)

comm.Recv([local_vect, MPI.INT], source=0, tag=1)
comm.Recv([local_A, MPI.INT], source=0, tag=2)
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix [0 1], vect [0 1]
rank 2: matrix [2 3], vect [0 1]
rank 4: matrix [4 5], vect [0 1]
rank 3: matrix [4 5], vect [0 1]
rank 5: matrix [6 7], vect [0 1]
rank 7: matrix [8 9], vect [0 1]
rank 1: matrix [2 3], vect [0 1]
rank 6: matrix [6 7], vect [0 1]

Local computation

In [ ]:
local_res = np.dot(local_A, local_vect)

print("rank %d: local result %s" %(rank, local_res))
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 0: matrix [0 1], vect [0 1]
rank 2: matrix [2 3], vect [0 1]
rank 4: matrix [4 5], vect [0 1]
rank 3: matrix [4 5], vect [0 1]
rank 5: matrix [6 7], vect [0 1]
rank 7: matrix [8 9], vect [0 1]
rank 1: matrix [2 3], vect [0 1]
rank 6: matrix [6 7], vect [0 1]
rank 6: local result 7
rank 4: local result 5
rank 2: local result 3
rank 3: local result 5
rank 5: local result 7
rank 0: local result 1
rank 7: local result 9
rank 1: local result 3

Reduce values for every row

In [ ]:
    color = rank // chunk_per_row
    new_comm = MPI.COMM_WORLD.Split(color, rank)
    local_res = new_comm.reduce(local_res, root=0)

    if new_comm.Get_rank() == 0:
        print("rank %d: row %d result %d" %(rank, color, local_res))
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
rank 6: local result 7
rank 4: local result 5
rank 2: local result 3
rank 3: local result 5
rank 5: local result 7
rank 0: local result 1
rank 7: local result 9
rank 1: local result 3

rank 4: row 2 result 12
rank 2: row 1 result 8
rank 0: row 0 result 4
rank 6: row 3 result 16

Gather all rows at process 0

In [ ]:
    if new_comm.Get_rank() == 0:
        comm.send(local_res, dest=0, tag=3)

    res = np.zeros(number_rows, dtype='i')
    if rank == 0:
        for i in range(number_rows):
            status = MPI.Status()
            ret = comm.recv(source=MPI.ANY_SOURCE, tag=3, status=status)
            rnk = status.Get_source()
            res[rnk // chunk_per_row] = ret

        print("Multiplication result %s" %(res))
mpiexec -n 8 python3 matrix_vector_mult_block.py 4
Multiply matrix 
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]
 [6 7 8 9]] with vector [0 1 0 1]
Multiplication result [ 4  8 12 16]

Performance

pi performance

  • Running on Knight's Landing - Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz
    • 64 cores, 4 threads per core
    • 640 x 640 matrix
    • 5120 x 5120 matrix

Weak scaling

mvm performance

  • Problem size
    • 320 x 320 for one process
    • 640 x 640 for 2 processes
    • ...
    • 20480 x 20480 for 64 processes

How can we do it better?