To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given application into different components and map them onto multiple processors. In this paper, we propose a...

1 answer below »

·Performance via parallelism




To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given application into different components and map them onto multiple processors. In this paper, we propose a software pipeline- based partitioning method with cyclic dependent task management and communication optimization. During task partition, considering computation load balance and communication overhead optimization at the same time can cause interference, which leads to performance loss. To address this issue, we formulate their constraints and apply the Integer Linear Programming (ILP) approach to find an optimal partitioning result trading off these two factors. Experimental results on a reconfigurable MPSoC platform demonstrate the effectiveness of the proposed method, with 20% to 40% performance improvements compared to traditional software pipeline partition method. Keywords: software pipeline, partition, cyclic dependent task management, communication optimization Manuscript received Aril. 22, 2014; revised Dec. 24, 2015; accepted Jan. 2, 2015. Kai Huang ([email protected]), Siwen Xiu (corresponding author, [email protected]), Min Yu ([email protected]), Xiaomeng Zhang ([email protected]), and Xiaolang Yan ([email protected]) are with the Department of Information Science & Electronic Eng, Zhejiang University, Zhejiang, China. Rongjie Yan ([email protected]) is with the Department of Information Science & Electronic Eng, Zhejiang University, Zhejiang, China. Zhili Liu ([email protected]) is the marketing team, Hangzhou C-Sky Microsystem Co.,Ltd., Zhejiang, China. I. Introduction The increasing demands for high performance of embedded systems promote the extensive use of Multiprocessor System- on-Chip (MPSoC). Given an application, one key issue of generating efficient parallel codes for a target MPSoC platform is how to partition it into different components and map them onto different processors with the best performance. As a prevalent parallelization method, software pipeline is an effective solution to address this problem. For software programs, pipelining introduces higher degree of parallelism to increase the program throughput. For hardware processors, the pipelined stages make it easy to partition and map the decomposed programs onto different components to achieve better hardware utilization [1]. However, the increasing complexity of applications and hardware architecture challenges the efficiency of software pipeline. To explore the parallelism of software application and hardware architecture, the software pipeline technique has to face the following two issues:   How to keep balanced workloads as well as maintain task dependency. High parallelism calls for a balanced pipeline where each stage has almost the same execution time as well as linear stage dependency. However, most of the existing applications involving complicated cyclic task dependencies may constrain the distribution of tasks among processors, which makes it harder to keep balanced workloads among pipelined stages without destroying task dependencies.   How to minimize communication overheads. With the increasing complexity of MPSoC, inter-stage communication is becoming an ineligible factor for software pipeline. Decomposing a task into finer-grained subtasks results in higher overhead in synchronizing subtasks, with lower system Software Partitioning Method Trading off Load Balance and Communication Optimization Kai Huang, Siwen Xiu, Min Yu, Xiaomeng Zhang, Rongjie Yan, Xiaolang Yan, and Zhili Liu 교 정: 초벌편집 파 일: 김수영(12-21) The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication. http://dx.doi.org/10.4218/etrij.15.0114.0502 RP1404-0502e © 2015 ETRI 1 performance and scalability [2]. Thus, how to reduce the communication overhead between software pipeline stages should also be emphasized. In some cases, both issues may be interference with each other, making software pipeline construction harder. Communication Pipeline [3] is a communication optimization technique that can significantly hide communication transfer time between processors. But its additional latency may impact on the handling of cyclic dependent tasks and cause nonadjustable imbalanced workloads. Therefore, we have to maintain a trade-off between workload balance and communication optimization techniques for better parallelism. In this paper, we propose a software pipeline-based partitioning method with cyclic dependent task management and communication cost minimization. The interference between communication optimizations and workload balance is well addressed for better performance. We first analyze how to partition general pipeline stages in cyclic dependency topology. Next, we quantify the inter-stage communication pipeline optimization on software pipeline partition, and then formulate these constraints in our Integer Linear Programming (ILP) models to trade off software pipeline for a better partitioning result. Finally, each pipeline stage is mapped to one processor. The main contributions of this paper are summarized as follows: First, the proposed method combines both software pipeline and communication pipeline techniques to balance computation load and reduce communication overhead. For the first time, the cyclic constraint for general software pipeline technique is investigated and two kinds of pipelines are combined and executed well. Second, the software pipeline based partitioning and mapping method is integrated into Simulink-based MPSoC multithreaded code generation flow, which implements the automatic generation of efficient parallel code from sequential applications. The rest of the paper is organized as follows: Section II gives some related works. Section III describes the background of the Simulink model, software pipeline and communication pipeline. Section IV introduces the proposed mapping method. Section V shows the feasibility of the implementation of our method. Section VI demonstrates the experiments and discusses the results. Section VII concludes this paper and highlights the directions for future work. II. Related work Current literature offers plenty of methods on code generation from high-level models. Most methods are based on functional modeling such as Khan Process Network (KPN) [4], dataflow [5], UML [6] and Simulink [7]. As a prevalent environment for modeling and simulating complex systems at an algorithmic level of abstraction, Simulink has been widely used, such as in Real-Time Workshop (RTW) [8], dSpace [9] and many other code generators [10]-[11]. LESCEA (Light and Efficient Simulink Compiler for Embedded Application) [12] is an automatic code generation tool with memory- oriented optimization techniques. Nevertheless, the partitioning and mapping of an application in LESCEA is conducted manually, which requires expertise and significantly affects the performance of the generated codes. The high performance requirements of embedded applications necessitate the need to realize efficient partitioning and mapping methods. Much literature can be found to tackle the problem. For example, search based approaches are extensively used, such as Simulated Annealing (SA) in [13], ILP in [14], which can achieve optimal or near-optimal solutions. Further, performance metrics such as communication latency, memory, energy consumption and so on are optimized along with the mapping methods (please refer to [15] for more details). As one of the parallelization methods, software pipeline is widely studied. Cyclic task dependency is an important factor that limits the performance of a software pipeline. In [16]-[18], all of the three approaches exploit the retiming technique to transform intra-iteration task dependency into inter-iteration task dependency to implement a task-level coarse-grained software pipeline. However, communication is not fully considered in these works. In [19], the authors construct a software pipeline for steaming applications where communication is optimized through laying buffers in communication channels. As a result, sending and receiving can be operated independently to avoid synchronization overhead, which is similar to our work. In [20], the partitioned streaming application is assigned to pipeline stages in such a way that all communication (DMA) is maximally overlapped with computation on the cores. Nevertheless, the assumption that the whole streaming application model has no feedback loops limits the utilization of the software pipeline in real-life applications. ILP is a well-known approach for the ability to calculate optimal results for partitioning problems. It is also applied to generate software pipelines. ILP is exploited in [20] to determine the assignment of synchronous dataflow actors to pipeline stages corresponding to processors to minimize the maximal load of any processor. In [21], an ILP formulation is utilized to search a smaller design space and find an appropriate configuration for ASIPs, with minimized system area as well as satisfying system runtime constraint in pipelined processors. An ILP based mapping approach is presented in The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication. http://dx.doi.org/10.4218/etrij.15.0114.0502 RP1404-0502e © 2015 ETRI 2 [22] to minimize the most expensive path in a pipeline under the constraint of program dependency and the maximal number of concurrently executed components. These methods also have less consideration on any or both of the discussed two factors in software pipeline. Previous works have implemented software pipeline in various ways and integrated certain optimizations on cyclic task dependencies or communications respectively. In this paper, we consider both cyclic task dependency and communication overheads when trading off computation and communication, and integrate the techniques handling the two problems into our later software pipeline partitions. We utilize ILP formulations to quantify and combine the above two factors in order to obtain higher performance. III. Background 1. Simulink model This work is based on the concepts of Simulink models, which have been introduced in the previous works [12], [23]- [24]. A Simulink model represents the functionality of the target system with software function and hardware architecture. It has the following three types of basic components.   Simulink Block represents a function that takes n inputs and produces certain outputs. Examples include user-defined (S- function), discrete delay, and pre-defined blocks such as mathematical operations. For the ease of discussion, we mainly
Answered Same DayNov 24, 2021

Answer To: To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given...

Deepti answered on Nov 30 2021
141 Votes
Comparative Analysis of Performance via Parallelism
Research Paper on Computer Architecture
Abstract
This paper provides a comparative analysis of three articles on the topic Performance via Parallelism. The overall purpose of this pap
er is to establish the platform for comparison where parallelism is used with big data. It states the comparative analysis of the methods proposed in each paper for performance improvement through parallelism. It further elaborates on the effectiveness of the methods used in the papers for performance improvement through parallel computing. The similarities and differences have been listed and the relationship among the three approaches is indicated in a precise manner. The paper is concluded with the best out of three approaches which proves itself to be more effective than the other two.
Keywords: Optimization, Pipelining, task management, performance, ILP (Integer Linear Programming, data transfer, parallel computing
Table of Contents
Abstract…………………………………………………………………2
Introduction……………………………………………………………..3
Comparative Analysis..……..…………………………………………..4
Platform of Comparison..……………………………………………….5
Effectiveness of Methods………………………………………………..5
Conclusion……………………………………………………………….6
References….…………………………………………………………….6
1. Introduction
In the communities of high- performance computing, the in-memory computing framework is witnessing constant developments and enrichment.
(Changtian Ying, 2018) states that for the in- memory framework, the parallelism degree is difficult to be adapted. It may be ignored as it may to avoid affecting the execution efficiency of a job or affecting resource utilization rate. But if the resource allocation is better managed, then the memory allocation and better job execution can be achieved. (Kai Huang) proposes partitioning method based on software pipeline for improving performance. This method involves communication optimization and cyclic dependent task management. It obtains best partitioning results by using Integer Linear Programming Approach (ILP). (Eun-Sung) evaluates data transfer over WAN on the basis of parallel and cross-layer optimization techniques. It exploits data, task and pipeline parallelisms over the three layers in data transfer (network layer, application layer and storage layer. It then proposes cross layer optimization for better performance.
2....
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here