1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book)...

Question

1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book) Run this code on the ACI-ICS cluster. Note* this should be run on a single core, you should run it on both the aci-i and aci-b nodes.

double precision, dimension(N) :: A,B,C,D

double precision :: S,E,MFLOPS

do i=1,N // fill arrays with data

A(i) = 0.d0;B(i) = 1.d0

C(i) = 2.d0; D(i) = 3.d0

enddo

call get_walltime(S) // get start time

do j=1,R

do i=1,N

A(i) = B(i) + C(i) * D(i) // 3 loads, 1 store

enddo

if(A(2).lt.0) call dummy(A,B,C,D) // prevent loop interchange

enddo

call get_walltime(E) // get end time stamp

MFLOPS = R*N*2.d0/((E-S)*1.d6) // calculate MFLOPS

Use the following routine for get_wall_time()

#include

void get_walltime(double* wcTime) {

struct timeval tp;

gettimeofday(&tp, NULL);

*wcTime = (double)(tp.tv_sec + tp.tv_usec/1000000.0);

}

2) Explore appropriate values for R and N. Observe the results. Fix R to some value and generate a plot for N that best summarizes your observations. (MFPLOS vs. N).

*** R does not have to be fixed!!!! **** MFLOPS calculation takes care of this!

3) Your homework submission should be a zip file containing two files: a pdf write-up and a zip of your code. The short write-up should be 1-2 pages(including a graph) discussing the findings of your benchmarks. Your discussion should tie the benchmark results with the hardware under test. Your source code should include a Makefile.

Expectations:

Your code should use dynamic allocation for the arrays under test. No credit will be given if the code does not compile. Points will be lost if the code does not accurately represent the equivalent pseudo code above (actually fortran). This is a high performance computing course, the range of values for R and N should reflect that (Big Data!). None of the tests have to run for more than a few minutes.

Your write-up will be evaluated for quality of discussion and the accuracy of the graph.

Meenakshi · Accepted Answer

Low level benchmark: Vector Triad
In this assignment we are discussing about the Memory Hierarchies and their performances. First we  understand the memory concept and  check the memory performance  .  We study and analysis the concept about memory hierarchies. For memory performance we  work on algorithm and  convert  this algorithm into C language .
For this lab we have two header file  and one c code   
That header file having a function that is for walltime or delay of process 
void get_walltime(double* wcTime) {
struct timeval tp;
gettimeofday(&tp, NULL);
*wcTime = (double)(tp.tv_sec + tp.tv_usec/1000000.0);
}
In this lab we develop a program  another header file that is vector.h where we define the vector method that we will use for our array we use the array value in vector 
For this program we define four vector array double A,B,C,D and precision :: S,E,MFLOPS with the help of loop we fill the  vector using loop that run N times
Where A(i) = 0.d0;
B(i) = 1
C(i) = 2. 
D(i) = 3
call get_walltime(S) // get start time
we set  R for loop and calculate the  r and n times in nested loop
A(i) = B(i) + C(i) * D(i) // 3 loads, 1 store
We call the  dummy function that is for prevent loop interchange
call get_walltime(E) 
MFLOPS = R*N*2.d0/((E-S)*1.d6) //  
In this  assignment  we change N  memory hierarchy analysis
We change the R and N  for checking the memory performance and analysis the graph
We  run the code  and change the N and R  them evaluate the memory performance
	jam
do i=1,N,2
do j=1,N
c(i)=c(i)+a(i,j)*b(j)
enddo
do j=1,N
c(i+1)=c(i+1)+a(i+1,j)*b(j) enddo
enddo
do i=1,N,2
do j=1,N
c(i)=c(i)+a(i,j)
* b(j)
c(i+1)=c(i+1)+a(i+1,j)* b(j)
enddo
enddo
b(j) can be re-used once
from register → save 1 LD operation
Lowers Bc from 1 to ¾ W/F
May 26, 2011
PTfS 2011                          28
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d
Data access pattern for 2-way unrolled dense MVM:
Vector b[] now only loaded
N/2 times!
=   +  •
Remainder loop handled
 separately
Data transfers can further be reduced by more aggressive unrolling (i.e., m-way instead of 2-way)
Significant code bloat (try to use compiler directives if possible)
Main memory limit: b[] only be loaded once from memory (Bc ≈ ½ W/F) (can be achieved by high unrolling OR large outer level caches)
Outer loop unrolling can also be beneficial to reduce traffic within caches!
Beware: CPU registers are a limited resource
Excessive unrolling can cause register spills to memory
May 26, 2011
PTfS 2011                          29
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d: Spatial Blocking
Split up inner loop in small chunks (pref. multiple cache lines)
Add outer loop pointing to the start of the innermost loop
bs: blocking size
bs=m*CLS, nb=N/bs
do k=1,nb
do i=1,N
do j=(k-1)*bs+1,k*bs y(i)=y(i)+a(j,i)*x(j)
end do
end do
end do
What is the problem here?
Each chunk of x is loaded only once to inner cache level and (re-)used N times.
CLS: Cache Line Size
Blocking size (bs) should be CLS or multiple of it. Upper limit is imposed by inner cache size
May 26, 2011
PTfS 2011                          30
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d: Spatial Blocking
Spatially blocked MVM Matrix-Vector Multiply
y
=
A
x
bs:
blocking size
Save loads (x) from memory at the cost of additional store operations (y):
Vector x is loaded only once instead of N times
Vector y is loaded nb times instead of only once
May 26, 2011
PTfS 2011                          31
Data access – general guidelines
Case 3: O(N3)/O(N2) algorithms
Most favorable case – computation outweighs data traffic by factor of N
Examples: Dense matrix diagonalization, dense matrix-matrix multiplication
Huge optimization potential: proper optimization can render the problem cache-bound if N is large enough
Example: dense matrix-matrix multiplication
	do i=1,N
	
	Core task: dense MVM
	
	
	
	
	
	do j=1,N
	
	
	
	
	
	(O(N2)/O(N2))
	
	do k=1,N
	
	
	
	
	
	→ memory bound
	
	c(j,i)=c(j,i)+a(k,i)*b(k,j)
	
	
	
	
	
	
	
	enddo
	
	
	
	enddo
	
	
	
	
	
	
	
	enddo
	
	
	
	
	
	
3 N2 data elements involved BUT main memory traffic ~ N3 for large N Reason: b(:,:) needs to be reloaded N times!
Implement spatial blocking!
May 26, 2011
PTfS 2011                          32
Data access – case study: Dense matrix vector multiplication
Stride one access is implemented by counting inner loops over first index (FORTRAN) or last index (C/C++) either by
choosing the right data layout or
arranging nested loops in the right order (loop interchange):
	do i=1,N
	do i=1,N
	do j=1,N
	do j=1,N
	c(i)=c(i)+A(i,j)*b(j)
	c(i)=c(i)+A(j,i)*b(j)
	enddo
	enddo
	enddo
	enddo
	Stride N access to A
	Stride 1 access to A
	„non-consecutive“
	„transpose“
This can be done automatically by some compilers!
May 26, 2011
PTfS 2011                          33
Data access – case study: Dense MVM
Consider data transfer to registers for “transpose” version
(consider double precision)
	do i=1,

1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book) Run this code on the ACI-ICS cluster. Note* this should be run on a single core, you should run it on...

Answer To: 1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book)...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment