High Performance Computing - Beyond Serial Processing

Beyond Serial Processing

If your code is taking too long to run you may like to think about making parts of it run in parallel to each other. This works where parts of the code are indepedent of each other and do not have to wait for one iteration to complete before it can move onto the next for example.
The method used for 'parallelising' code on the University of Nottingham's HPC cluster uses the Message Passing Interface, more commonly known as MPI. To make your code run in parallel you make CALLs to the MPI interface as you would just like a subroutine or external numerical library for example. Some example are in order (the fortran source code is listed here but equivalent codes in C are also available):

Example 1: Hello World from Processor N

Say you want 'Hello World' to be printed out from 4 processors, and each one tells you which processor is saying Hello:

1 include 'mpif.h'
2 integer myproc, numprocs, ierr

c Initiate MPI

3 call mpi_init(ierr)
4 call mpi_comm_size(MPI_COMM_WORLD, numprocs, ierr)
5 call mpi_comm_rank(MPI_COMM_WORLD, myproc, ierr)

6 write(6,*)'Processor number ',myproc,' says Hello World!'

7 call mpi_finalize(ierr)

8 stop
9 end

A few things to note:

numproc is the number of processors you want, eg 4, and is specified at compile time
you have to specify include 'mpif.h' at the top of your code so as to enable MPI CALLs
MPI is initialised with the call mpi(...) directives on lines 3,4, and 5
close your MPI calls with the directive on line 7

This example nicely illustrates the 4 basic MPI directives you need in any MPI code. Infact, besides these 4 you may only need to know another 5 or 6 standard calls and you will then have a powerful syntax ability to start parallelising real applications.

Example 2: Calculating Pi

To calculate pi we calculate the integral 4/(1+x**2), which on a computer means doing a sum of rectangles. The clever thing is you can calculate the area of each of those rectangles independently of each other; so if you have 20 rectangles you could send the job to 20processors, or 10 processors with each processor calculating 2 areas, and so on.

1 c**********************************************************************
2 c pi.f - compute pi by integrating f(x) = 4/(1 + x**2)
3 c
4 c Each node:
5 c 1) receives the number of rectangles used in the approximation.
6 c 2) calculates the areas of it's rectangles.
7 c 3) Synchronizes for a global summation.
8 c Node 0 prints the result.
9 c
10 c Variables:
11 c
12 c pi the calculated result
13 c n number of points of integration.
14 c x midpoint of each rectangle's interval
15 c f function to integrate
16 c sum,pi area of rectangles
17 c tmp temporary scratch space for global summation
18 c i do loop index
19 c****************************************************************************
20 program main

21 include 'mpif.h'

22 double precision PI25DT
23 parameter (PI25DT = 3.141592653589793238462643d0)

24 double precision mypi, pi, h, sum, x, f, a
25 integer n, myid, numprocs, i, rc
26 c function to integrate
27 f(a) = 4.d0 / (1.d0 + a*a)

28 call MPI_INIT( ierr )
29 call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
30 call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
31 print *, 'Process ', myid, ' of ', numprocs, ' is alive'

32 sizetype = 1
33 sumtype = 2
34
35 if ( myid .eq. 0 ) then
36 write(6,37)
37 format('Enter the number of intervals: (0 quits)')
38 read(5,39) n
39 format(i10)
40 endif
41
42 call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

43 c check for quit signal
44 if ( n .le. 0 ) goto 63

45 c calculate the interval size
46 h = 1.0d0/n

47 sum = 0.0d0
48 do 51 i = myid+1, n, numprocs
49 x = h * (dble(i) - 0.5d0)
50 sum = sum + f(x)
51 continue
52 mypi = h * sum

53 c collect all the partial sums
54 call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
55 $ MPI_COMM_WORLD,ierr)

56 c node 0 prints the answer.
57 if (myid .eq. 0) then
58 write(6, 59) pi, abs(pi - PI25DT)
59 format(' pi is approximately: ', F18.16,
60 + ' Error is: ', F18.16)
61 endif

62 goto 35

63 call MPI_FINALIZE(rc)
64 stop
65 end

Again we have our initialisation directives at the top and we include mpif.h. The new things here are

we copy a piece of data to all the processors in one go with call mpi_bcast ie. we broadcast it to the pool of processors
we perform a parallel sum in one go with call_mpireduce

Beyond the above 2 samples of code other MPI calls you may need or are likely to come across are

mpi_send and mpi_recv: you can send and receive data to and from specific processors within your processor pool
mpi_barrier: this pauses where the code the is in execution. Once all processors get to the barrier the execution continues.

There are literally hundreds of mpi calls you could invoke but the above few calls could now provide you with a powerful toolbase from which to begin parallel coding.