Beyond Serial Processing
If your code is taking too long to run you may like to think about
making parts of it run in parallel to each other. This works where
parts of the code are indepedent of each other and do not have to
wait for one iteration to complete before it can move onto the next
for example.
The method used for 'parallelising' code on the University of Nottingham's
HPC cluster uses the Message Passing Interface, more commonly known
as MPI. To make your code run in parallel you make CALLs to the
MPI interface as you would just like a subroutine or external numerical
library for example. Some example are in order (the fortran source
code is listed here but equivalent codes in C are also available):
Example 1: Hello World from Processor N
Say you want 'Hello World' to be printed out from 4 processors, and each one tells you which processor is saying Hello:
1 include 'mpif.h'
2 integer myproc, numprocs, ierr
c Initiate MPI
3 call mpi_init(ierr)
4 call mpi_comm_size(MPI_COMM_WORLD, numprocs, ierr)
5 call mpi_comm_rank(MPI_COMM_WORLD, myproc, ierr)
6 write(6,*)'Processor number ',myproc,' says Hello World!'
7 call mpi_finalize(ierr)
8 stop
9 end
A few things to note:
-
numproc is the number of processors you want, eg 4, and is specified at compile time
-
you have to specify include 'mpif.h' at the top of your code so as to enable MPI CALLs
-
MPI is initialised with the call mpi(...) directives on lines 3,4, and 5
-
close your MPI calls with the directive on line 7
Example 2: Calculating Pi
To calculate pi we calculate the integral 4/(1+x**2), which on a computer means doing a sum of rectangles. The clever thing is you can calculate the area of each of those rectangles independently of each other; so if you have 20 rectangles you could send the job to 20processors, or 10 processors with each processor calculating 2 areas, and so on.
1 c**********************************************************************
2 c pi.f - compute pi by integrating f(x) = 4/(1 + x**2)
3 c
4 c Each node:
5 c 1) receives the number of rectangles used in the approximation.
6 c 2) calculates the areas of it's rectangles.
7 c 3) Synchronizes for a global summation.
8 c Node 0 prints the result.
9 c
10 c Variables:
11 c
12 c pi the calculated result
13 c n number of points of integration.
14 c x midpoint of each rectangle's interval
15 c f function to integrate
16 c sum,pi area of rectangles
17 c tmp temporary scratch space for global summation
18 c i do loop index
19 c****************************************************************************
20 program main
21 include 'mpif.h'
22 double precision PI25DT
23 parameter (PI25DT = 3.141592653589793238462643d0)
24 double precision mypi, pi, h, sum, x, f, a
25 integer n, myid, numprocs, i, rc
26 c function to integrate
27 f(a) = 4.d0 / (1.d0 + a*a)
28 call MPI_INIT( ierr )
29 call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
30 call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
31 print *, 'Process ', myid, ' of ', numprocs, ' is alive'
32 sizetype = 1
33 sumtype = 2
34
35 if ( myid .eq. 0 ) then
36 write(6,37)
37 format('Enter the number of intervals: (0 quits)')
38 read(5,39) n
39 format(i10)
40 endif
41
42 call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
43 c check for quit signal
44 if ( n .le. 0 ) goto 63
45 c calculate the interval size
46 h = 1.0d0/n
47 sum = 0.0d0
48 do 51 i = myid+1, n, numprocs
49 x = h * (dble(i) - 0.5d0)
50 sum = sum + f(x)
51 continue
52 mypi = h * sum
53 c collect all the partial sums
54 call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
55 $ MPI_COMM_WORLD,ierr)
56 c node 0 prints the answer.
57 if (myid .eq. 0) then
58 write(6, 59) pi, abs(pi - PI25DT)
59 format(' pi is approximately: ', F18.16,
60 + ' Error is: ', F18.16)
61 endif
62 goto 35
63 call MPI_FINALIZE(rc)
64 stop
65 end
Again we have our initialisation directives at the top and we include mpif.h. The new things here are
-
we copy a piece of data to all the processors in one go with call mpi_bcast ie. we broadcast it to the pool of processors
-
we perform a parallel sum in one go with call_mpireduce
-
mpi_send and mpi_recv: you can send and receive data to and from specific processors within your processor pool
-
mpi_barrier: this pauses where the code the is in execution. Once all processors get to the barrier the execution continues.