Finalproject for the COMPUTE course: “Parallel programming of HPC systems” Parallelising a 2-dimensional boundary problem March 12, 2020 1 Exercise: 2D boundary problem Please also consult the slides...

I need this assignment for HPC parallel programming and report should address each point in (eg 1.1.1).


Finalproject for the COMPUTE course: “Parallel programming of HPC systems” Parallelising a 2-dimensional boundary problem March 12, 2020 1 Exercise: 2D boundary problem Please also consult the slides introducing this exercise in addition to the instructions given here 1.1 Outline 1.1.1 Jacobi algorithm The aim of the exercise is to write a parallel program in either C or Fortran using a Jacobi Algorithm to determine an array u(x, y) which obeys (1 + 12cx)u(x + 1, y) + (1− 1 2cx)u(x− 1, y) +(1 + 12cy)u(x, y + 1) + (1− 1 2cy)u(x, y − 1) −4u(x, y) = 0 (1) for a fixed boundary. The boundary is described in detail in the course handouts. The Jacobi algorithm starts with an initial guess for u(x, y) - we suggest u(x, y) = 0 for all interior points. The problem is then solved by iterating unew(x, y) = 1 4 [ (1 + 12cx)uold(x + 1, y) + (1− 1 2cx)uold(x− 1, y) +(1 + 12cy)uold(x, y + 1) + (1− 1 2cy)uold(x, y − 1) ] . (2) Before the next iteration it is require that unew(x, y) get copied into uold(x, y). 1.1.2 Residual The norm of the residual r(x, y) provides a measure for the quality of the iterative solution. r(x, y) := (1 + 12cx)unew(x + 1, y) + (1− 1 2cx)unew(x− 1, y) +(1 + 12cy)unew(x, y + 1) + (1− 1 2cy)unew(x, y − 1) −4unew(x, y) (3) For the norm of the residual we use the L2-norm ||r(x, y)|| which is calculated as ||r(x, y)|| = √∑ x ∑ y r(x, y) r(x, y) (4) This could in principle be used as a stoping criterion, however for this exercise we will use a fixed number of steps and print the value of the L2-norm at the end. 1 1.2 Serial code You get provided with serial codes in Fortran90 and C99 that provide most of the features. You can ignore these, utilise them as an inspiration for your own code or have them as a starting point for your parallelisation. We describe a couple of the key features of the serial codes provided for this exercise. To be able to parallelise the code easily with MPI it helps if you choose a suitable data layout. The following instructions are meant as a suggestion to lead you through the design process. 1.2.1 Basic parameters The serial F90 code uses the variables nx and ny and the serial C99 code uses the variables xdim and ydim to declare the dimensions of the problem, when e.g. dynamically allocating the data structures required for the code. In Fortran it is easy to dynamically allocate multidimensional arrays - just use the keyword “allocatable”. In C matters are not so easy – in C99, there is no good build in support for multidimensional dynamically allocated data structures. The provides source use construction shown in the appendix A for all two-dimensional data. 1.2.2 Modularisation It is crucial to write modularised code – use functions and subroutines. Spaghetti- code will make your live harder. You will notice that many routines in the end will not need to know about the MPI-parallelisation – you just hand them different input parameters. 1.2.3 Data structures The code requires two basics data structures: ufield and ufield_new Both of them are dynamically allocated. 1.2.4 Subroutines and functions The code has methods (subroutines or functions) to implement: • Reading the input file (various simulation parameters) • Setting the boundaries and initialising the interior points of ufield • Determining ufield_new from ufield • Copy interior points from ufield_new to ufield. Together with the previous this constitutes the update step. Take care not to destroy the boundaries in ufield when copying. • Norm of the residual • Writing the result as an image. For this you need a dynamically allocated integer array of dimension nx by ny. Note these dimensions are different from ufield. 2 Important: All the methods should take nx and ny resp. xdim and ydim as dummy arguments. This is important since some of these routines need to work on different sized arrays after the parallelisation. Also the fields ufield and ufield_new should be passed as arguments. Avoid using global data or declaring data sturctures in a Fortran module. 1.2.5 Main program Your main program does the following: • Reading the input file • Declare data structures and allocate arrays • Initialise the fields • Loop over the update, that is – Determination of ufield_new – Followed by the copying • Terminate the loop after fixed number of iterations • Calculate the norm of the residual • Print the residual • Write the image Make sure your program works before proceeding to the parallelisation. The pro- vided codes have been tested with the foss/2019b and intel/2019b toolchains on Aurora. A problem size of nx=80 and ny=64 is a good starting point. The structure ufield will then range from 0 to 81 in height and 0 to 65 in width. 1.3 Parallelisation of the code For a simple parallelisation continue to use a serial set-up and the serial writing of the image. 1.3.1 Set up routine for parallelisation Find out how many processors are supposed to work on the problem. Set a variable nproc, which stores the number of tasks working on the problem. Use non-periodic boundary conditions. Declare an up- and down-neighbour in C or a left- and right- neighbour in Fortran. 1.3.2 Guarding The reading of the input parameter file, the serial initialisation of the field ufield and the serial writing of the image should only be executed on task 0. The field ufield should only be allocated and initialised on rank 0. Introduce an if-statement around the relevant code to prevent the code being executed on the other tasks. 3 1.3.3 Broadcasting of parameters The input file should only be read on rank 0. Use on or two broadcast operations to distriubute the parameters to all MPI tasks. If you like to set-up derived data types, a single broad cast is all that is needed. If you like to keep matters simple, use one broadcast to distribute the integer parameters as a single array and another broadcast for all double precision parameters. Do not use more than two broadcast operations. 1.3.4 Local data structures and parameters In addition to the “global” field ufield, you also need “local” fields ufield_local and ufield_new_local. The global field is used during the serial set-up and serial writing of the image. It holds all the data of the problem. The local field holds the data the individual task needs to access. This includes the halos or boundary points. Once done with the parallelisation the field ufield_new is not needed any longer and can be removed from the program. We suggest you declare nx_local and ny_local to control the dimensions of the local fields. The C template uses a convention ufield[y][x], we recommend nx_local = xdim; ny_local = ydim/nproc; For Fortran the template uses a convention ufield(x,y) and we recommend: nx_local = nx ny_local = ny/nproc You might have noticed that this code is unlikely to work if the split dimension doesn’t divide with the task count nproc. Introduce a test whether the problem divides evenly and call MPI_Abort if the test fails (e.g. there is a non-trivial division rest). The local fields should now be dynamically allocated with In C: the first index ranging from 0 to ny_local+1 and the second from 0 to nx_local+1. In Fortran: the first index ranging from 0 to nx_local+1 and the second from 0 to ny_local+1. You should notice that the extensions are extended by one point at either end for in both directions. This allows holding the halo and the boundary data. Remark: When dealing with the local fields references to nx and ny should be avoided, to allow future upgrades of the code to e.g. a 2D decomposition. One should always use nx_local and ny_local for the local fields. 1.3.5 Distributing the data After initialisation of ufield, its data needs to be distributed to ufield_local. Write a subroutine or function performing the distribution onto the tasks. In C we are splitting the columns, each task gets ny_local rows of the problem. This can be accomplished by using MPI_Scatter, starting at row number 1, that is the second row of ufield. The rows need to be received into the second row of ufield_local of each task. 4 Special treatment is needed for rank 0 and the last rank. Rank 0 requires the boundary values in row 0 and the last rank requires the boundary values in its row nx_local+1. The former is just a local copy on rank 0, while for the latter you explicitly need to send that row utilising point-to-point communication.. In Fortran we are splitting the rows, each task gets ny_local columns of the problem. This can be accomplished by using MPI_Scatter, starting at column number 1, that is the second column of ufield. The columns need to be received into the second column of ufield_local of each task. You should notice that rank 0 requires the boundary values in column 0 and the last rank requires the boundary values in its column ny_local+1. The former is just a local copy on rank 0, while for the latter you explicitly need to send that column. 1.3.6 First modification of the update step As a first step towards a parallel update step (the Jacobi step) you need to modify the input and output parameters to the two routines of the update step. These should now act on ufield_local and ufield_new_local. In addition replacing nx and ny with nx_local and ny_local in a few places should be all that is required in the first step. 1.3.7 Gathering the data After the update, prior to writing the results as an image, the data needs to be as- sembled back onto rank 0, for writing. You can use MPI_Gather, with ufield_local being the send buffer and ufield on rank 0 as the receive buffer. Make sure data is inserted in the right place into ufield, that is in row/column 1. You don’t need to deal with the boundaries like when distributing the data – they didn’t change. 1.3.8 First testing You should now test your code. The way the exercise is set up, use 4 tasks for this first test. The code should now almost work – though you should see the processor boundaries in your images. Make sure to confirm that the time gets shorter when using multiple processors. This might need a larger problem than 80× 64. 1.4 Exchanging the halos Once the first testing is successful, continue the exercise by implement the halo exchange. 1.4.1 Halo exchange routine In C, each task needs to send row 1 and row ny_local to the neighbouring tasks (your up and down-neighbours). The received data
Apr 28, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here