building next-generation serverless datacenters
Datacenters have been using a “monolithic” server model for decades, where each server hosts a set of hardware devices like CPU and DRAM and runs an OS or hypervisor on top to manage the hardware resources. In recent years, cloud providers offer a type of service called “serverless computing” in response to cloud users’ desires for not managing clusters and for auto-scaling their applications to the right size. Serverless computing quickly gained popularity, but today’s serverless computing still runs on servers. The monolithic server model is not the best fit for serverless computing and it fundamentally restricts datacenters from achieving efficient resource packing, hardware rightsizing, and great heterogeneity.
We propose to fully or partially “disaggregate” monolithic servers into network-attached hardware components that host different hardware resources and offer different functionalities (e.g., a computation pool for running application logics, a memory pool for enlarged and consolidated memory spaces, a persistent-memory pool for fast accesses to key-value data). With such a “serverless” datacenter, hardware resources can be allocated and scaled to the exact amount that applications use and can be individually managed and customized for different application needs. WukLab is taking piorneering efforts in building end-to-end, multi-layer solutions for the next-generation serverless datacenter, including new hardware platforms, a new operating system, new distributed systems, a new virtualization platform, new networking systems, and new security solutions.
Disaggregating Network Tasks: A Consolidation Approach with NetPool
Servers in today's data centers host software and hardware resources for processing network packets. Managing and executing network tasks at each end host can be costly both in capital cost and engineering effort. We propose to disaggregate network resources from end hosts to a separate network resource pool. We propose NetPool, a distributed SmartNIC platform that pools together at the rack scale. Each SmartNIC in NetPool consolidates network functionalities from multiple endpoints by fairly sharing limited hardware resources, and it achieves its performance goals with an auto-scaled, highly parallel data plane and a scalable control plane.
Disaggregation and Program Behavior: A Static-Runtime-Codesign Approach with Mira
Far memory, where memory accesses go to memory on remote servers, has become more popular in recent years as a solution to expand memory size and avoid memory stranding. Prior far memory systems have taken two approaches: transparently swap memory pages between local and far memory, and utilizing new programming models to move fine-grained data between local and far memory. The former requires no program changes but comes with performance penalty, while the latter requires significant program changes, though with increased performance.
We propose a far-memory system that co-designs static program analysis and compilation with run-time systems, called Mira. Mira utilizes program analysis results, profiled execution information, and system environments together to guide code compilation and system configurations for far memory. Our evaluation shows that Mira outperforms prior swap-based and programming-model-based systems up to 18 times.
Disaggregating Serverless Computing: A Resource-Centric Approach with Scad
Today’s serverless computing has several key limitations including per-function resource limits, fixed CPU-to-memory ratio, and constant resource allocation throughout a function execution and across different invocations of it. The root cause of these limitations is the “function-centric” model: a function is a fixed-size box that is allocated, executed, and terminated as an inseparable unit. This unit is pre-defined by the cloud provider and cannot properly capture user needs.
We propose a “resource-centric” model for serverless computing that captures fine-grained resource needs throughout an application’s execution using components of distinct resource type, amount, and time span. We build a new resource-based serverless execution platform that executes components in a disaggregated, on-demand, and auto-scaled manner. Our results show that Scad reduces resource consumption by 40% to 84% compared to OpenWhisk while only adding 1.3% performance overhead. Compared to PyWren, Scad even achieves 15% to 28% speedup while reducing resource consumption by 16% to 31%.
Disaggregating Memory: A Hardware Aproach with Clio
Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes with either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems.
We seek a sweet spot in the middle of these two extremes by proposing, for the first time, a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. We built a hardware-based disaggregated memory system called Clio, which virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clio’s memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 µs at median and 3.2 µs at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1× to 3.4× energy saving compared to CPU-based and SmartNICbased disaggregated memory systems and is 2.7× faster than software-based SmartNIC solutions.
Disaggregating Persistent Memory: A Passive Approach with Clover
Existing disaggregated storage systems use hard disks or SSDs as storage media. Recently, the technology of persistent memory (PM) has matured and seen initial adoption in several datacenters. Disaggregating PM could enjoy the same benefits of traditional disaggregated storage systems, but it requires new designs because of its memory-like performance and byte addressability.
We explore the design of disaggregating PM and managing them remotely from compute servers, a model we call passive disaggregated persistent memory, or pDPM. Compared to the alternative of managing PM at storage servers, pDPM significantly lowers monetary and energy costs and avoids scalability bottlenecks at storage servers. We built three key-value store systems using the pDPM model. The first one lets all compute nodes directly access and manage storage nodes. The second uses a central coordinator to orchestrate the communication between compute and storage nodes. These two systems have various performance and scalability limitations. To solve these problems, we built Clover, a pDPM system that separates the location, communication mechanism, and management strategy of the data plane and the metadata/control plane. Compute nodes access storage nodes directly for data operations, while one or few global metadata servers handle all metadata/control operations. From our extensive evaluation of the three pDPM systems, we found Clover to be the best-performing pDPM system. Its performance under common datacenter workloads is similar to non-pDPM remote in-memory key-value store, while reducing CapEx and OpEx by 1.4× and 3.9×.
Disaggregating Operating System: A SplitKernel Approach with LegoOS
The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve resource utilization, elasticity, heterogeneity, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it.
We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. A splitkernel also performs resource allocation and failure handling of a distributed set of hardware components. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, a user application can span multiple processor, memory, and storage hardware components. We implemented LegoOS on x86-64 and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS’ performance is comparable to monolithic Linux servers, while largely improving resource packing and reducing failure rate over monolithic clusters.
Find out more about and get it here.
Related Publication
Conferences and Journals
Mira: A Program-Behavior-Guided Far Memory System
Zhiyuan Guo, Zijian He, Yiying Zhang
Proceedings of the 29th ACM Symposium on Operating Systems Principles
(SOSP '23)
Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center
Yiying Zhang
ACM SIGOPS Operating Systems Review 57(1) 1-9 (2023)
(OSR '23)
Hermit: Low-Latency, High-Throughput, and Transparent Remote Memory via Feedback-Directed Asynchrony
Yifan Qiao, Chenxi Wang, Zhenyuan Ruan, Adam Belay, Qingda Lu, Yiying Zhang, Miryung Kim, Harry Xu
Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation
(NSDI '23)
Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory
Chenxi Wang, Yifan Qiao, Haoran Ma, Shi Liu, Yiying Zhang, Wenguang Chen, Ravi Netravali, Miryung Kim, Harry Xu
Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation
(NSDI '23)
Clio: A Hardware-Software Co-Designed Disaggregated Memory System
Zhiyuan Guo*, Yizhou Shan*, Xuhao Luo, Yutong Huang, Yiying Zhang (* equal contribution)
Proceedings of the 27th International Conference on Architectural Support for Programming Languages and Operating Systems
(ASPLOS '22)
Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores
Shin-Yeh Tsai, Yizhou Shan, Yiying Zhang
2020 USENIX Annual Technical Conference
(USENIX ATC '20)
Storm: A Fast Transactional Dataplane for Remote Data Structures
Stanko Novakovic, Yizhou Shan, Aasheesh Kolli, Michael Cui, Yiying Zhang, Haggai Eran, Liran Liss, Michael Wei, Dan Tsafrir, Marcos Aguilera
Proceedings of the 12th ACM International Systems and Storage Conference
(SYSTOR '19)
(Best Paper Award)
LegoOS: A Disaggregated, Distributed OS for Hardware Resource Disaggregation
Yizhou Shan, Yutong Huang, Yilun Chen, Yiying Zhang
Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation
(OSDI '18)
(Best Paper Award)
Workshop
Towards a Fully Disaggregated and Programmable Data Center
Yizhou Shan, Will Lin, Zhiyuan Guo, Yiying Zhang
to appear at the 13th ACM Asia-Pacific Workshop on Systems
(APSys '22)
Challenges in Building and Deploying Disaggregated Persistent Memory
Yizhou Shan, Yutong Huang, Yiying Zhang
the 10th Annual Non-Volatile Memories Workshop
(NVMW '19)
Building Atomic, Crash-Consistent Data Stores with Disaggregated Persistent Memory
Shin-Yeh Tsai, Yiying Zhang
the 10th Annual Non-Volatile Memories Workshop
(NVMW '19)
Disaggregating Memory with Software-Managed Virtual Cache
Yizhou Shan, Yiying Zhang
the 2018 Workshop on Warehouse-scale Memory Systems
(WAMS '18)
(co-located with ASPLOS '18)
MemAlbum: an Object-Based Remote Software Transactional Memory System
Shin-Yeh Tsai, Yiying Zhang
the 2018 Workshop on Warehouse-scale Memory Systems
(WAMS '18)
(co-located with ASPLOS '18)
Split Container: Running Containers beyond Physical Machine Boundaries
Yilun Chen, Yiying Zhang
the 2018 Workshop on Warehouse-scale Memory Systems
(WAMS '18)
(co-located with ASPLOS '18)
Disaggregated Operating System
Yiying Zhang, Yizhou Shan, Sumukh Hallymysore
the 17th International Workshop on High Performance Transaction Systems
(HPTS '17)