Similar to the previous exercise, transform the program piece in Fig. 8.6
with a total cyclic data distribution to a full MPI program. Compare the resulting
execution time for different matrix sizes and different numbers of processors. For
which scenarios does a significant difference occur? Try to explain the observed
behavior.