Lab 12: Parallel dense matrix-matrix multiplication with MPI - ut

[Pages:6]Lab 12: Parallel dense matrix-matrix multiplication with MPI

Oleg Batrashev version 0.1

November 18, 2009

Introduction

There are many MPI implementations available: OpenMPI, MPICH, LAM-MPI. Most of them have API interfaces for C and Fortran. In this course we are going to use mpi4py, which adds Python interface on top of one existing MPI implementation.

Running `which mpirun` should give the location of MPI implementation files. In computer class and on the kuu cluster (kuu.mt.ut.ee) this is OpenMPI 1.3.2. You can get some info about it and its installed components by running `ompi_info`. You can get the location of mpi4py installation by running `python -c 'import mpi4py; print mpi4py.__file__`.

Test general MPI with `mpirun -np 4 hostname`, which should give the same result for all 4 run processes in computer class, because it only uses local machine. However, kuu cluster should give 2 different hostnames twice, because it consists of 8 dual core processors.

mpi4py interfaces

There is not much documentation on mpi4py, but the implementation seems to be mature. You can get and try out some example programs at .

There are 2 kind of functions in mpi4py:

1. MPI interface with arrays (memory buffers), which closely ressembles original MPI interface. Function names of this interface start with capital letter, e.g. Send().

2. MPI interface with general Python objects, which (de)serializes Python objects and exchanges them as MPI messages. Function names of this interface start with non-capital letter, e.g. send().

There are advantages and drawbacks for each interface, which you will discover during work on assignments. The first interface requires special care to be taken about memory ordering and alignment of arrays while the second interface uses extra CPU power and memory to do serialization and deserialization of Python objects. The second interface also limits usage to only blocking communication.

For more information read Design and Interface Overview.

1

Examples of both interfaces

As examples, lets look at Send() and Scatter() functions. You get the documentation for mpi4py methods by `pydoc mpi4py.m`. The C MPI interface is available at . docs/mpi22-report/mpi22-report.htm.

mpi4py methods

C MPI interface

comm.Send(buf, int dest=0, int tag=0)

comm.send(obj=None, int dest=0, int tag=0)

comm.Scatter(sendbuf, recvbuf, int root=0)

comm.scatter(sendobj=None, recvobj=None, int root=0)

int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

int MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype,

void* recvbuf, int recvcount, MPI_Datatype recvtype, int root,

MPI_Comm comm)

The basic differences are:

1. mpi4py uses comm object and communication is done by calling its methods

2. buf argument in comm.Send is a 2 or 3 element tuple which exactly matches to C MPI interface arguments buf, count, and datatype:

(a) buf=(array, count, datatype) where array is usually NumPy array or (b) buf=(array, datatype) in which case count is taken from array size

3. the same applies for comm.Scatter which packs 6 arguments of MPI_Scatter into 2 arguments: sendbuf and recvbuf

4. for the methods that work with general Python objects the cruicial difference is that sendbuf in comm.scatter is not NumPy array of arbitrary size, but the array (or list, tuple) of size N , where N is a number of processes! To each process will be sent 1 object.

You get used to these differences as you work through examples and assignments.

Parallel dense matrix-matrix multiplication using ring communication1

The explanation of the algorithm will given on the Lab session and added here later.

1 Assignment: parallel AB with Python objects

1.1 Introduction

Here I give some brief overview of object serialization in Python and MPI blocking communication. You do not need these materails in your solutions, it is only intended for better understanding of the problems that may arise.

1taken from the lab of year 2007 course, see Lab08 in 2007 assignment 3 (in Estonian)

2

1.1.1 Object serialization in Python

MPI methods that accept Python objects use serialization under the hood. Lets see how it looks like on the example of simple array serialization

>>> A = np.arange(6) >>> A array([0, 1, 2, 3, 4, 5])

Now we will serialize the array which is done with pickle Python module

>>> import pickle >>> ser_A1 = pickle.dumps(A[0:3]) >>> ser_A2 = pickle.dumps(A[3:]) >>> len(ser_A1) 283

The strings in ser_A1 and ser_A2 now contain object information and corresponding data from array A. The strings are then sent over the network using standard MPI functions and deserialized on the remote nodes

>>> A1 = pickle.loads(ser_A1) >>> A2 = pickle.loads(ser_A2) >>> A1 array([0, 1, 2]) >>> A2 array([3, 4, 5])

1.1.2 MPI blocking communication

Even if you have read about blocking MPI communication, I bet :), you still not fully understand what it implies. The send() or Send() functions block until it is safe to use send buffer, i.e. all the data is copied somewhere else. The data may still be unsent, for example written to some network packet buffers in operating system or MPI library internal buffers. So, the blocking only happens if the size of the data is larger than those buffers. This may result in such behavior of a program, where it works with small data but deadlocks with larger data.

The semantics of the blocking recv() and Recv() are similar ? they block until it is safe to use recieve buffer, i.e. all data has been received and written. Notice, that deserialization may only happen when the data is fully received, at least non-blocking behaviour is not trivial to implement in Python. That is why there is no isend() and irecv() methods available.

1.2 Task description

Implement parallel dense matrix-matrix multiplication using blocking send() and recv() methods with Python NumPy array objects. It means that you have to use MPI methods that start with lower-case letter.

Test, what must be the approximate size of the arrays for send() function to block?

3

1.3 Sketch of the algorithm

1. generate matrices A and B on process 0

2. use scatter() twice to distribute rows of A and columns of B to other processes into Alocal and Blocal

? see numpy.vsplit() and numpy.hsplit()

3. create matrix Clocal of required size

4. do N iterations, where N is the number of processes

(a) multiply Alocal by Blocal and save it to the corresponding part of Clocal (b) if not the last iteration

i. send Blocal object to the neighbor in the communication ring ii. receive Blocal object from the neighbor in the communication ring

5. use gather() to collect Clocal objects

6. combine Clocal objects into matrix C

? see numpy.vstack()

7. optionally, test that the algorithm is correct by comparing it with the C = AB performed locally

2 Assignment: parallel AB with arrays

There are several caveats with memory buffers in NumPy arrays. MPI functions expect elements of array in row-major order (C contiguous), but NumPy arrays usually satisfy this condition only on creation. The flags attribute of NumPy array contains the required information >>> A = np.arange(6) >>> A.flags

C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False Because the array is 1D it both C and Fortran contiguous. The actual data buffer is owned by this object. Lets now reshape it into 2D array >>> B = A.reshape(3,2) >>> B array([[0, 1],

[2, 3], [4, 5]])

4

>>> B.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False

As we see it does not own array data, but uses array A data. The reference to the owner is stored to B.base attribute and is None if the object owns the data. Lets now transpose the object

>>> C = B.transpose() >>> C array([[0, 2, 4],

[1, 3, 5]]) >>> C.flags

C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False

Now, the data in array is Fortran contiguous, i.e. in column-major order. As a last example lets take subarray using slices

>>> D = B[:,0] >>> D array([0, 2, 4]) >>> D.flags

C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False

It is not contiguous at all and the data buffer is still from the array A. If you pass this array to Send, Recv or other MPI function that use memory buffers you end up with the wrong result. This is because the A buffer will be actually sent or filled with values, but D matrix merely sees part of it and mpi4py is not smart enough to recognize it.

You can use numpy.ascontiguousarray() function which copies the values to the array with new buffer if the existing array is not C contiguous.

2.1 Task description

Implement parallel dense matrix-matrix multiplication using non-blocking Irecv() and blocking Send() methods with NumPy arrays.

Run tests on kuu cluster. How large is speed up from using parallel implementation of matrixmatrix multiplication?

5

2.2 Sketch of algorithm changes

The only change in the algorithm I expect is in 4b 4. do N iterations, where N is the number of processes (b) if not the last iteration i. initiate receive operation of Blocal object from the neighbor in the communication ring by using Irecv() ii. send Blocal object to the neighbor in the communication ring iii. wait on status object returned from Irecv()

6

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download