serverless_shuffle_paper (9).pdf This paper is included in the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). February 26–28, 2019 • Boston, MA,...

1 answer below »
Read and form a question. Then answer the question


serverless_shuffle_paper (9).pdf This paper is included in the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). February 26–28, 2019 • Boston, MA, USA ISBN 978-1-931971-49-2 Open access to the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19) is sponsored by Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure Qifan Pu, UC Berkeley; Shivaram Venkataraman, University of Wisconsin, Madison; Ion Stoica, UC Berkeley https://www.usenix.org/conference/nsdi19/presentation/pu Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure Qifan Pu (UC Berkeley), Shivaram Venkataraman (UW Madison), Ion Stoica (UC Berkeley) Abstract Serverless computing is poised to fulfill the long-held promise of transparent elasticity and millisecond-level pric- ing. To achieve this goal, service providers impose a fine- grained computational model where every function has a maximum duration, a fixed amount of memory and no persis- tent local storage. We observe that the fine-grained elasticity of serverless is key to achieve high utilization for general computations such as analytics workloads, but that resource limits make it challenging to implement such applications as they need to move large amounts of data between functions that don’t overlap in time. In this paper, we present Locus, a serverless analytics system that judiciously combines (1) cheap but slow storage with (2) fast but expensive storage, to achieve good performance while remaining cost-efficient. Locus applies a performance model to guide users in select- ing the type and the amount of storage to achieve the desired cost-performance trade-off. We evaluate Locus on a number of analytics applications including TPC-DS, CloudSort, Big Data Benchmark and show that Locus can navigate the cost- performance trade-off, leading to 4⇥-500⇥ performance im- provements over slow storage-only baseline and reducing re- source usage by up to 59% while achieving comparable per- formance with running Apache Spark on a cluster of virtual machines, and within 2⇥ slower compared to Redshift. 1 Introduction The past decade has seen the widespread adoption of cloud computing infrastructure where users launch virtual ma- chines on demand to deploy services on a provisioned clus- ter. As cloud computing continues to evolve towards more elasticity, there is a shift to using serverless computing, where storage and compute is separated for both resource provisioning and billing. This trend was started by ser- vices like Google BigQuery [9], and AWS Glue [22] that provide cluster-free data warehouse analytics, followed by services like Amazon Athena[5] that allow users to per- form interactive queries against a remote object storage with- out provisioning a compute cluster. While the aforemen- tioned services mostly focus on providing SQL-like analyt- ics, to meet the growing demand, all major cloud providers now offer “general” serverless computing platforms, such as AWS Lambda, Google Cloud Functions, Azure Functions and IBM OpenWhisk. In these platforms short-lived user- defined functions are scheduled and executed in the cloud. Compared to virtual machines, this model provides more fine-grained elasticity with sub-second start-up times, so that workload requirements can be dynamically matched with continuous scaling. Fine-grained elasticity in serverless platforms is natu- rally useful for on-demand applications like creating image thumbnails [18] or processing streaming events [26]. How- ever, we observe such elasticity also plays an important role for data analytics workloads. Consider for example an ad- hoc data analysis job exemplified by say TPC-DS query 95 [34] (See section 5 for more details). This query con- sists of eight stages and the amount of input data at each stage varies from 0.8MB to 66GB. With a cluster of virtual machines users would need to size the cluster to handle the largest stage leaving resources idle during other stages. Us- ing a serverless platform can improve resource utilization as resources can be immediately released after use. However, directly using a serverless platform for data an- alytics workloads could lead to extremely inefficient execu- tion. For example we find that running the CloudSort bench- mark [40] with 100TB of data on AWS Lambda, can be up to 500⇥ slower (Section 2.3) when compared to running on a cluster of VMs. By breaking down the overheads we find that the main reason for the slowdown comes from slow data shuffle between asynchronous function invocations. As the ephemeral, stateless compute units lack any local storage, and as direct transfers between functions is not always feasi- ble1, intermediate data between stages needs to be persisted on shared storage systems like Amazon S3. The character- istics of the storage medium can have a significant impact on performance and cost. For example, a shuffle from 1000 map tasks to 1000 reduce tasks leads to 1M data blocks being created on the storage system. Therefore, throughput limits of object stores like Amazon S3 can lead to significant slow downs (Section 2.3). Our key observation is that in addition to using elas- tic compute and object storage systems we can also pro- vision fast memory-based resources in the cloud, such as in-memory Redis or Memcached clusters. While naively putting all data in fast storage is cost prohibitive, we can ap- propriately combine fast, but expensive storage with slower but cheaper storage, similar to the memory and disk hierar- chy on a local machine, to achieve the best of both worlds: approach the performance of a pure in-memory execution at a significantly lower cost. However, achieving such a sweet spot is not trivial as it depends on a variety of configuration parameters, including storage type and size, degree of task parallelism, and the memory size of each serverless func- 1Cloud providers typically provide no guarantees on concurrent execu- tion of workers. USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 193 tion. This is further exacerbated by the various performance limits imposed in a serverless environment (Section 2.4). In this paper we propose Locus, a serverless analytics sys- tem that combines multiple storage types to achieve better performance and resource efficiency. In Locus, we build a performance model to aid users in selecting the appropriate storage mechanism, as well as the amount of fast storage and parallelism to use for map-reduce like jobs in server- less environments. Our model captures the performance and cost metrics of various cloud storage systems and we show how we can combine different storage systems to construct hybrid shuffle methods. Using simple micro-benchmarks, we model the performance variations of storage systems as other variables like serverless function memory and paral- lelism change. We evaluate Locus on a number of analytics applications including TPC-DS, Daytona CloudSort and the Big Data Benchmark. We show that using fine-grained elasticity, Lo- cus can reduce cluster time in terms of total core·seconds by up to 59% while being close to or beating Spark’s query completion time by up to 2⇥. We also show that with a small amount of fast storage, for example, with fast storage just large enough to hold 5% of total shuffle data, Locus matches Apache Spark in running time on CloudSort benchmark and is within 13% of the cost of the winning entry in 2016. While we find Locus to be 2⇥ slower when compared to Ama- zon Redshift, Locus is still a preferable choice to Redshift since it requires no provisioning time (vs. minutes to setup a Redshift cluster) or knowing an optimal cluster size before- hand. Finally, we also show that our model is able to accu- rately predict shuffle performance and cost with an average error of 15.9% and 14.8%, respectively, which allows Locus to choose the most appropriate shuffle implementation and other configuration variables. In summary, the main contributions of this paper are: • We study the problem of executing general purpose data analytics on serverless platforms to exploit fine-grained elasticity and identify the need for efficient shuffles. • We show how using a small amount of memory-based fast storage can lead to significant benefits in perfor- mance while remaining cost effective. • To aid users in selecting the appropriate storage mech- anism, We propose Locus, a performance model that captures the performance and cost metrics of shuffle op- erations. • Using extensive evaluation on TPC-DS, CloudSort and Big Data Benchmark we show that our performance model is accurate and can lead to 4⇥-500⇥ perfor- mance improvements over baseline and up to 59% cost reduction compared to traditional VM deployments, and within 2⇥ slower compared to Redshift. 2 Background We first present a brief overview of serverless computing and compare it with the traditional VM-based instances. Next we discuss how analytics queries are implemented on serverless infrastructure and present some of the challenges in execut- ing large scale shuffles. 2.1 Serverless Computing: What fits? Recently, cloud providers and open source projects [25, 32] have proposed services that execute functions in the cloud or providing Functions-as-a-Service. As of now, these func- tions are subject to stringent resource limits. For example, AWS Lambda currently imposes a 5 minute limit on function duration and 3GB memory limit. Functions are also assumed to be stateless and are only allocated 512MB of ephemeral storage. Similar limits are applied by other providers such as Google Cloud Functions and Azure Functions. Regard- less of such limitations, these offerings are popular among users for two main reasons: ease of deployment and flexi- ble resource allocation. When deploying a cluster of virtual machines, users need to choose the instance type, number of instances, and make sure these instances are shutdown when the computation finishes. In contrast, serverless of- ferings have a much simpler deployment model where the functions are automatically triggered based on events, e.g., arrival of new data. Furthermore, due to their lightweight nature, containers used for serverless deployment can often be launched within seconds and thus are easier to scale up or scale down when compared to VMs. The benefits of elasticity are especially pronounced for workloads where the number of cores re- quired varies across time. While this naturally happens for event-driven workloads for example where say users upload a photo to a service that needs to be compressed and stored, we find that elasticity is also important for data analytics workloads. In particular, user-facing ad-hoc queries or ex- ploratory analytics workloads are often unpredictable yet have more stringent responsiveness requirements, making it more difficult to provision a traditional cluster compared to recurring production workloads. We present two common scenarios that highlight the im- portance of elasticitiy. First, consider a stage of tasks being run as a part of an analytics workload. As most frameworks use a BSP model [15, 44] the stage completes only when the last task completes. As the same VMs are used across stages, the cores where tasks have finished are idle while the slowest tasks or stragglers complete [3]. In comparison, with a serverless model, the cores are immediately relinquished when a task completes. This shows the importance of elastic- ity within a stage. Second, elasticity is also important across stages: if we consider say consider TPC-DS query 95 (details in 5), the query consists of 8 stages with
Answered Same DayOct 12, 2021

Answer To: serverless_shuffle_paper (9).pdf This paper is included in the Proceedings of the 16th USENIX...

Nithin answered on Oct 12 2021
105 Votes
Question : How to calculate accuracy for a supervised model ?
Answer : First, we have to determine
the type of supervised model.
There are two types of supervised models. The first is regression and the other is classification.
When it comes to the regression model, there are lots of evaluation...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here