·Performance via parallelismTo achieve high performance on MPSoC via parallelism, a key issue is...

Question

·Performance via parallelismTo achieve high performance on MPSoC via parallelism,  a key issue is how to partition a given application into  different components and map them onto multiple  processors. In this paper, we propose a software pipeline- based partitioning method with cyclic dependent task  management and communication optimization. During  task partition, considering computation load balance and  communication overhead optimization at the same time  can cause interference, which leads to performance loss.  To address this issue, we formulate their constraints and  apply the Integer Linear Programming (ILP) approach to  find an optimal partitioning result trading off these two  factors. Experimental results on a reconfigurable MPSoC  platform demonstrate the effectiveness of the proposed  method, with 20% to 40% performance improvements  compared to traditional software pipeline partition  method.      Keywords: software pipeline, partition, cyclic dependent  task management, communication optimization                                                                    Manuscript received Aril. 22, 2014; revised Dec. 24, 2015; accepted Jan. 2, 2015.       Kai Huang (huangk@vlsi.zju.edu.cn), Siwen Xiu (corresponding author,  xiusw@vlsi.zju.edu.cn), Min Yu (yumin@vlsi.zju.edu.cn), Xiaomeng Zhang  (zhangxm@vlsi.zju.edu.cn), and Xiaolang Yan (yan@vlsi.zju.edu.cn) are with the Department  of Information Science & Electronic Eng, Zhejiang University, Zhejiang, China.   Rongjie Yan (yrj@ios.ac.cn) is with the Department of Information Science & Electronic  Eng, Zhejiang University, Zhejiang, China.   Zhili Liu (zhili_liu@c-sky.com) is the marketing team, Hangzhou C-Sky Microsystem  Co.,Ltd., Zhejiang, China.   I. Introduction  The increasing demands for high performance of embedded  systems promote the extensive use of Multiprocessor System- on-Chip (MPSoC). Given an application, one key issue of  generating efficient parallel codes for a target MPSoC platform  is how to partition it into different components and map them  onto different processors with the best performance. As a  prevalent parallelization method, software pipeline is an  effective solution to address this problem. For software  programs, pipelining introduces higher degree of parallelism to  increase the program throughput. For hardware processors, the  pipelined stages make it easy to partition and map the  decomposed programs onto different components to achieve  better hardware utilization [1].  However, the increasing complexity of applications and  hardware architecture challenges the efficiency of software  pipeline. To explore the parallelism of software application and  hardware architecture, the software pipeline technique has to  face the following two issues:    How to keep balanced workloads as well as maintain task  dependency. High parallelism calls for a balanced pipeline  where each stage has almost the same execution time as well as  linear stage dependency. However, most of the existing  applications involving complicated cyclic task dependencies  may constrain the distribution of tasks among processors,  which makes it harder to keep balanced workloads among  pipelined stages without destroying task dependencies.    How to minimize communication overheads. With the  increasing complexity of MPSoC, inter-stage communication  is becoming an ineligible factor for software pipeline.  Decomposing a task into finer-grained subtasks results in  higher overhead in synchronizing subtasks, with lower system  Software Partitioning Method Trading off         Load Balance and Communication Optimization  Kai Huang, Siwen Xiu, Min Yu, Xiaomeng Zhang, Rongjie Yan, Xiaolang Yan, and Zhili Liu    교 정: 초벌편집  파 일: 김수영(12-21)  The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication.  http://dx.doi.org/10.4218/etrij.15.0114.0502  RP1404-0502e © 2015 ETRI 1 performance and scalability [2]. Thus, how to reduce the  communication overhead between software pipeline stages  should also be emphasized.  In some cases, both issues may be interference with each  other, making software pipeline construction harder.  Communication Pipeline [3] is a communication optimization  technique that can significantly hide communication transfer  time between processors. But its additional latency may impact  on the handling of cyclic dependent tasks and cause  nonadjustable imbalanced workloads. Therefore, we have to  maintain a trade-off between workload balance and  communication optimization techniques for better parallelism.  In this paper, we propose a software pipeline-based  partitioning method with cyclic dependent task management  and communication cost minimization. The interference  between communication optimizations and workload balance  is well addressed for better performance. We first analyze how  to partition general pipeline stages in cyclic dependency  topology. Next, we quantify the inter-stage communication  pipeline optimization on software pipeline partition, and then  formulate these constraints in our Integer Linear Programming  (ILP) models to trade off software pipeline for a better  partitioning result. Finally, each pipeline stage is mapped to one  processor.  The main contributions of this paper are summarized as  follows: First, the proposed method combines both software  pipeline and communication pipeline techniques to balance  computation load and reduce communication overhead. For  the first time, the cyclic constraint for general software pipeline  technique is investigated and two kinds of pipelines are  combined and executed well. Second, the software pipeline  based partitioning and mapping method is integrated into  Simulink-based MPSoC multithreaded code generation flow,  which implements the automatic generation of efficient parallel  code from sequential applications.  The rest of the paper is organized as follows: Section II  gives some related works. Section III describes the background  of the Simulink model, software pipeline and communication  pipeline. Section IV introduces the proposed mapping method.  Section V shows the feasibility of the implementation of our  method. Section VI demonstrates the experiments and  discusses the results. Section VII concludes this paper and  highlights the directions for future work.  II. Related work  Current literature offers plenty of methods on code  generation from high-level models. Most methods are based on  functional modeling such as Khan Process Network (KPN) [4],  dataflow [5], UML [6] and Simulink [7]. As a prevalent  environment for modeling and simulating complex systems at  an algorithmic level of abstraction, Simulink has been widely  used, such as in Real-Time Workshop (RTW) [8], dSpace [9]  and many other code generators [10]-[11]. LESCEA (Light  and Efficient Simulink Compiler for Embedded Application)  [12] is an automatic code generation tool with memory- oriented optimization techniques. Nevertheless, the partitioning  and mapping of an application in LESCEA is conducted  manually, which requires expertise and significantly affects the  performance of the generated codes.  The high performance requirements of embedded  applications necessitate the need to realize efficient partitioning  and mapping methods. Much literature can be found to tackle  the problem. For example, search based approaches are  extensively used, such as Simulated Annealing (SA) in [13],  ILP in [14], which can achieve optimal or near-optimal  solutions. Further, performance metrics such as  communication latency, memory, energy consumption and so  on are optimized along with the mapping methods (please refer  to [15] for more details).  As one of the parallelization methods, software pipeline is  widely studied. Cyclic task dependency is an important factor  that limits the performance of a software pipeline. In [16]-[18],  all of the three approaches exploit the retiming technique to  transform intra-iteration task dependency into inter-iteration  task dependency to implement a task-level coarse-grained  software pipeline. However, communication is not fully  considered in these works. In [19], the authors construct a  software pipeline for steaming applications where  communication is optimized through laying buffers in  communication channels. As a result, sending and receiving  can be operated independently to avoid synchronization  overhead, which is similar to our work. In [20], the partitioned  streaming application is assigned to pipeline stages in such a  way that all communication (DMA) is maximally overlapped  with computation on the cores. Nevertheless, the assumption  that the whole streaming application model has no feedback  loops limits the utilization of the software pipeline in real-life  applications.   ILP is a well-known approach for the ability to calculate  optimal results for partitioning problems. It is also applied to  generate software pipelines. ILP is exploited in [20] to  determine the assignment of synchronous dataflow actors to  pipeline stages corresponding to processors to minimize the  maximal load of any processor. In [21], an ILP formulation is  utilized to search a smaller design space and find an  appropriate configuration for ASIPs, with minimized system  area as well as satisfying system runtime constraint in pipelined  processors. An ILP based mapping approach is presented in  The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication.  http://dx.doi.org/10.4218/etrij.15.0114.0502  RP1404-0502e © 2015 ETRI 2 [22] to minimize the most expensive path in a pipeline under  the constraint of program dependency and the maximal  number of concurrently executed components. These methods  also have less consideration on any or both of the discussed  two factors in software pipeline.  Previous works have implemented software pipeline in  various ways and integrated certain optimizations on cyclic  task dependencies or communications respectively. In this  paper, we consider both cyclic task dependency and  communication overheads when trading off computation and  communication, and integrate the techniques handling the two  problems into our later software pipeline partitions. We utilize  ILP formulations to quantify and combine the above two  factors in order to obtain higher performance.  III. Background  1. Simulink model  This work is based on the concepts of Simulink models,  which have been introduced in the previous works [12], [23]- [24]. A Simulink model represents the functionality of the  target system with software function and hardware architecture.  It has the following three types of basic components.    Simulink Block represents a function that takes n inputs and  produces certain outputs. Examples include user-defined (S- function), discrete delay, and pre-defined blocks such as  mathematical operations. For the ease of discussion, we mainly

Deepti · Accepted Answer

Comparative Analysis of Performance via Parallelism
Research Paper on Computer Architecture
Abstract
This paper provides a comparative analysis of three articles on the topic Performance via Parallelism. The overall purpose of this paper is to establish the platform for comparison where parallelism is used with big data. It states the comparative analysis of the methods proposed in each paper for performance improvement through parallelism. It further elaborates on the effectiveness of the methods used in the papers for performance improvement through parallel computing. The similarities and differences have been listed and the relationship among the three approaches is indicated in a precise manner. The paper is concluded with the best out of three approaches which proves itself to be more effective than the other two. 
Keywords: Optimization, Pipelining, task management, performance, ILP (Integer Linear Programming, data transfer, parallel computing
Table of Contents
Abstract…………………………………………………………………2
Introduction……………………………………………………………..3
Comparative Analysis..……..…………………………………………..4
Platform of Comparison..……………………………………………….5
Effectiveness of Methods………………………………………………..5
Conclusion……………………………………………………………….6
References….…………………………………………………………….6
1. Introduction
In the communities of high- performance computing, the in-memory computing framework is witnessing constant developments and enrichment.
(Changtian Ying, 2018) states that for the in- memory framework, the parallelism degree is difficult to be adapted. It may be ignored as it may to avoid affecting the execution efficiency of a job or affecting resource utilization rate. But if the resource allocation is better managed, then the memory allocation and better job execution can be achieved. (Kai Huang) proposes partitioning method based on software pipeline for improving performance. This method involves communication optimization and cyclic dependent task management. It obtains best partitioning results by using Integer Linear Programming Approach (ILP). (Eun-Sung) evaluates data transfer over WAN on the basis of parallel and cross-layer optimization techniques. It exploits data, task and pipeline parallelisms over the three layers in data transfer (network layer, application layer and storage layer. It then proposes cross layer optimization for better performance.
2.

To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given application into different components and map them onto multiple processors. In this paper, we propose a...

Answer To: To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment