UCat: heterogeneous memory management for unikernels

Chong TIAN; Haikun LIU; Xiaofei LIAO; Hai JIN

doi:10.1007/s11704-022-1201-y

Frontiers of Computer Science >

2023 , Vol. 17 >Issue 1: 171204

DOI: https://doi.org/10.1007/s11704-022-1201-y

RESEARCH ARTICLE

UCat: heterogeneous memory management for unikernels

Chong TIAN ,
Haikun LIU ,
Xiaofei LIAO ,
Hai JIN

Expand

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab/Cluster and Grid Computing Lab, School of Computing Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Received date: 26 Apr 2021

Accepted date: 25 Jan 2022

Published date: 15 Feb 2023

Copyright

2023 Higher Education Press

Fold

Abstract

Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-based VMs in a single physical server. In such a cloud computing platform, main memory is the primary bottleneck resource for high-density application deployment. Recently, non-volatile memory (NVM) technologies has become increasingly popular in cloud data centers because they can offer extremely large memory capacity at a low expense. However, there still remain many challenges to utilize NVMs for unikernel-based VMs, such as the difficulty of heterogeneous memory allocation and high performance overhead of address translations.

In this paper, we present UCat, a heterogeneous memory management mechanism that support multi-grained memory allocation for unikernels. We propose front-end/back-end cooperative address space mapping to expose the host memory heterogeneity to unikernels. UCat exploits large pages to reduce the cost of two-layer address translation in virtualization environments, and leverages slab allocation to reduce memory waste due to internal memory fragmentation. We implement UCat based on a popular unikernel--OSv and conduct extensive experiments to evaluate its efficiency. Experimental results show that UCat can reduce the memory consumption of unikernels by 50% and TLB miss rate by 41%, and improve the throughput of real-world benchmarks such as memslap and YCSB by up to 18.5% and 14.8%, respectively.

Key words： unikernel; virtualization; non-volatile memory; heterogeneous memory; large page; slab allocation

Cite this article

Chong TIAN , Haikun LIU , Xiaofei LIAO , Hai JIN . UCat: heterogeneous memory management for unikernels[J]. Frontiers of Computer Science, 2023 , 17(1) : 171204 . DOI: 10.1007/s11704-022-1201-y

1 Introduction

Hardware virtualization is the de facto enabling technology for cloud computing which allows customers to rent computing resources via VMs. Applications are deployed in specialized operating system (OS) images called virtual appliances that contain a standard OS kernel and other necessary software components [1]. Each VM is a self-contained virtual computer that provides strong isolation between each other in multi-tenant cloud platforms. Generally, VMs are often specialized to run a single task, such as a Web server, a database, and scale out by cloning VMs through image templates.

Although traditional VMs provide strong isolation by running general-purpose guest OSes, they often suffer from high performance/storage overhead in terms of large memory consumption, long booting latency, and virtualization cost [2–4]. OS virtualization technologies such as containers [5] allow multiple instances share the same OS kernel as the host. Thus, containers can be more efficient than VMs, but do not provide as good isolation as VMs. Recently, unikernels have emerged as a lightweight way to deploy cloud services. Unikernels are application-specialised, single-address-space machine images constructed by using library operating systems [2]. The entire software stack of a unikernel includes OS kernel, necessary system libraries, language runtime, and application source code. They are compiled into a lightweight, single-purpose application that can directly run on a standard hypervisor. Unikernels can offer near bare-metal application performance, strong isolation, minimal memory footprint, and extremely low startup latency. Fig.1 shows the architectures of VMs, containers, and unikernels. In general, unikernels can offer VM-like strong isolation, and are even more lightweight and efficient than containers.

Fig.1 Comparison between (a) virtual machines, (b) containers, and (c) unikernels

Full size|PPT slide

Due to these promising characteristics of unikernels, they have been increasingly studied for network function virtualization [6–8], micro-services [9], IoT applications [7, 10], and serverless computing [11, 12]. In these scenarios, cloud computing platforms can efficiently deploy hundreds of unikernel-based VMs in a single physical server, and thus main memory becomes the primary bottleneck resource for high-density application deployment. Recently, NVM technologies such as PCM [13] and 3D XPoint [14] have become increasingly popular in cloud data centers because they can offer extremely large memory capacity at a low expense. However, since they have relatively lower performance and limited write endurance compared to DRAM [15], it is more practical to use NVMs combining with DRAM to fully exploit their advantages [16]. Although there have been many studies on heterogeneous memory management in the OS layer, there still remain many challenges to utilize NVMs for unikernel-based VMs [17].

First, the existing memory management system for VMs was not designed for single-address-space unikernels, and requires redesign to efficiently utilize heterogeneous DRAM/NVM memories. Currently, heterogeneous memories are not visible to guest OSes since hypervisors and VMs are initially designed to use a single type of memory. Modern OSes have already supported heterogeneous memories by using the non-uniform memory access (NUMA) architecture or different namespaces, and VMs have a opportunity to use DRAM and NVM by mapping them to different NUMA nodes or namespaces. Unfortunately, most unikernels do not support the NUMA architecture and namespaces, making the visibility of heterogeneous memories to unikernels difficult.

Second, unikernels usually suffer from high performance penalty due to two-layer address translations in the virtualization environment. A translation lookaside buffer (TLB) miss in a guest OS leads to 24 memory references due to two-layer page table walking, and the performance overhead may even exceed 50% of total execution time of applications [18–21]. Although previous studies have shown that using large pages (or hugepages) can significantly reduce the address translation overhead, they often preclude lightweight memory management such as fine-grained memory allocation, page migration, and memory de-duplication. It is challenging to bridge this fundamental conflict in unikernels.

In this paper, we present UCat, the first work to support heterogeneous memory management for unikernels. We enable heterogeneous memory allocation in unikernels by mapping two memory regions to the host-managed DRAM and NVM namespaces, respectively. UCat exploits large pages to reduce the cost of two-layer address translations in virtualization environments, and leverages slab memory allocation to reduce memory waste at the application layer. Overall, we make the following contributions:

● We manage heterogeneous memories in unikernels by mapping different memory regions in unikernels to different host-managed namespaces. This simple-yet-efficient way exposes the memory heterogeneity to unikernels, and thus supports heterogeneous memory allocation in unikernels.

● We design a multi-grained memory allocation mechanism in unikernels by taking the advantage of both large pages and slab memory allocation. This mechanism not only reduces the cost of two-layer address translations in a virtualization environment, but also reduces memory waste due to internal memory fragmentation.

● We implement UCat based on a popular unikernel–OSv. Experimental results show that our mechanisms reduce unikernels’ memory consumption by 50% and TLB miss rate by 41%, and improve the throughput of real-world benchmarks such as memslap and YCSB by up to 18.5% and 14.8%, respectively.

The rest of this paper is organized as follows. We present the background and motivation in Sections 2 and 3, respectively. Section 4 describe our address mapping mechanism for supporting heterogeneous memories in unikernels, and multi-grained memory management mechanism by integrating large pages and slab memory allocation. Section 5 present the performance evaluation of UCat. We present related work in Section 6 and conclude in Section 7.

2 Background

In this section, we present the necessary background about unikernels and address translations in virtualization environments.

2.1 Unikernel applications and cloud

A unikernel is an application-specialized VM image. It is constructed by linking an application and parts of LibOS required by the application at compile time. A unikernel can be executed as a VM on top of a hypervisor such as Xen, VMware ESXi, and KVM. Unikernels are characterized as specialization and single-address-space. That implies a unikernel is constructed only for one application, and the application and the kernel share a single memory address space.

This design principle brings many benefits. First, unikernels are very lightweight in terms of image size and runtime memory footprint because they only contain the application and its necessary parts of LibOS. Second, unikernels are highly isolated because hypervisors such as Xen and VMWare ESXi provide strong isolation between VMs on top of a hypervisor. Third, unikernels offer better security because they expose small attack surface. The absence of system calls and shells blinds attackers to subvert unikernels. Moreover, unikernels are written in statically-typed languages that provide memory safety guarantee. For example, MirageOS [2] is written in OCaml language. Fourth, unikernels are executed efficiently because unikernels are single-address-space and there is no context switches between the user application and the kernel in unikernels. Overall, these promising features make unikernels an ideal option for deploying micro-services and serverless applications in cloud platforms. They not only improve VM density for cloud providers [8], but also are of great benefit to cloud users for cost reduction.

However, it may be quite difficult to port legacy applications from traditional software stack to unikernels [8, 22]. First, unikernels are often constructed by linking application source code with libOS at compile time. However, some applications’ source code may be not available due to proprietary limitations. Second, because different unikernels are written in different programming languages, it may be difficult to refactor legacy applications in other languages. Third, unikernels usually only support limited OS features and libraries to achieve lightweight deployment [3, 23]. If legacy applications require certain system features and libraries that are not supported by the unikernel, the application should be modified or refactored.

Although unikernels have many promising features, the difficulty of porting legacy applications impede their wide application in the industry. One solution is to provide binary compatibility. For example, OSv [3], HermiTux [4], and X-Containers [24] provide the Linux binary compatibility and allow the Linux binaries to support unikernels. Binary compatibility can significant reduce the burden of unikernel application developers. As a result, we implement UCat based on a popular unikernel–OSv.

2.2 Native address translation

Virtual memory is an important concept and abstraction in modern computer systems. It brings many benefits: the ability of sharing memory between processes, enhanced security due to memory isolation, providing more memory than the available physical memory space (memory overcommitment), etc. Most virtual memory management systems use page tables to translate virtual address (VA) to physical address (PA). The address translation process is usually called page table walking. The cost of address translation is mainly determined by the levels of page tables. For x86-64 architectures, the page tables have four levels. Fig.2 illustrates the native page table walking in x86-64 architectures.

Fig.2 The x86-64 native page table walk

Full size|PPT slide

Modern CPUs usually speed up address translations by caching the most recently accessed page tables in the page table walking (PTW) cache. In addition, to avoid frequent page table walks, TLB has been widely used to store the recent VA-to-PA translations for future use. It has become a common component of the memory management unit (MMU). When 4 KB pages are used, a TLB miss leads to four memory references in a VA-to-PA translation. Thus, page table walking usually has a significant impact on application performance if the TLB miss rate is high.

2.3 Address translations in virtualization environments

In a virtualization environment, the address translation from VA to PA is divided into two stages. In the first stage, the guest OS performs a page table walk that translates guest virtual address (GVA) into guest physical address (GPA). In the second stage, the hypervisor further translates the GPA into host physical address (HPA). There are mainly two approaches to achieve the two-layer address translation: shadow page table (SPT) and hardware-assisted virtualization address translation.

Fig.3 illustrates the work process of SPT. When the SPT is used, the hypervisor maintains a separate shadow page table for each process in a VM. The SPT can directly translate a GVA into a HPA efficiently. However, when the guest page table is updated, the SPT should be modified at the same time. Therefore, the modification of the guest page table needs to be tracked by the hypervisor. To monitor the modification of guest page tables, the guest page table is set as read-only. As a result, any modification made by the guest OS would trigger a page fault and is trapped into the hypervisor. This procedure is also called VM-exit, which is rather costly for VM-resident applications. The hypervisor then modifies the guest page table and updates GVA-to-HPA mapping in the SPT. Although the SPT provides a direct mapping of GVA to HPA, it also has some limitations. First, the hypervisor needs to maintain a separate SPT for each process in a VM, resulting in non-trivial memory consumption. Second, frequent VM-exits also incur significant performance overhead. Third, the synchronization of guest page tables and SPTs also wastes a lot of CPU resource.

Fig.3 The shadow page table walk

Full size|PPT slide

To improve the performance of address translations, Intel and AMD proposed Extended Page Table (EPT) [25] and Nested Page Tables (NPT) [26], respectively. As these two technologies are similar, in the following we take Intel's EPT as an example to illustrate the process of address translations. The EPT eliminates the dependency between the GVA-to-GPA translation and the GPA-to-HPA translation. If a page fault occurs in the guest OS, it can be handled directly by the guest OS without a VM-exit. When the GVA is translated into GPA by the guest OS, the EPT then translates GPA to HPA. This process is similar to the native x86-64 address translation. Since only one EPT is needed for a VM, the memory consumption due to address translation is greatly reduced.

Fig.4 depicts the two-dimensional EPT-supported page table walk in virtualization environments. When both the guest OS and the host OS are mapped using 4 KB pages, each level of translation in the guest page table requires an EPT walk which requires four memory references. Thus, a TLB miss in a VM leads to total (num_of_ page_dir + num_of_ page) × accesses_ per_ EPT_walk − final_ page_ access memory references for the GPA-to-HPA translation. Thus, total

(4 + 1) × 5 − 1 = 24

memory references are required for accessing a page with 4 KB upon a TLB miss. Large pages (such as 2 MB pages) can be exploited to speed up the address translation in virtualization environments. As shown in Fig.2, using 2 MB pages in a bare-metal machine can reduce one memory reference. When 2 MB pages are used in both the guest OS and the host OS, total

(3 + 1) × 4 − 1 = 15

memory references are required. Thus,

24 − 15 = 9

memory references are reduced in the two-dimensional page table walks when the page size is changed from 4 KB to 2 MB. As a result, using 2 MB large pages can significantly reduce the number of memory references for a GPA-to-HPA translation [25, 27].

Fig.4 The two-dimensional page table walk in virtualization environments

Full size|PPT slide

3 Motivation

Unikernels are lightweight virtualization technologies with smaller image size, better security, shorter startup time, and higher performance. Because of these advantages, unikernels are more promising for deploying microservices in cloud platforms than traditional VMs. First, because unikernels only contain necessary libraries for a single application, the image size and runtime memory footprint are very small. As shown in Tab.1, the image size of a Web server built with unikernel is only 3.5 MB, while the image size built with a container is almost 3 times larger than that of unikernel. As for VMs, their images generally are hundreds of MB and even more larger. Second, the startup time of unikernels are very short. This advantage is particularly useful for fast service deployment and failure recovery. Fig.5 shows the startup time of unikernels compared with a Docker container and a native Linux-based VM. The boot time of a native Linux is hundreds of times longer than the unikernel. The boot time of dockers is also almost 10 times as long as that of unikernels. Third, because unikernels consume less resource, more unikernels can be deployed on a single physical server, and thus significantly improve the cost-efficiency of cloud applications.

Tab.1 The image size of dockers and unikernels

	Docker	Unikernel
DNS Server	9.82 MB	3.3 MB
Web Server	10.1 MB	3.5 MB

Fig.5 The startup time of a Linux VM, a docker, and a unikernel

Full size|PPT slide

OSv is a typical unikernel designed for cloud platforms, enabling easy deployment and management of micro-services and serverless applications. However, there still remain challenges to use OSv in heterogeneous memory systems. First, since OSv is not aware of the heterogeneity of the underlying heterogeneous memory resources, it does not support heterogeneous memory allocation. To the best of our knowledge, none of unikernels support heterogeneous memory allocation. This is a major obstruct for deploying unikernel-based VMs in heterogeneous memory systems.

Second, OSv also suffers high performance overhead of the two-layer address translations in the virtualization environment, because OSv also uses four-level page tables to translate the GVA into the GPA. Although large pages can be used in OSv to speed up address translation, large pages often impede lightweight memory management such as fine-grained memory allocation/migration/sharing, resulting in high latency of large page allocations, costly large page migration in heterogeneous memory systems, and internal fragmentation. When the memory resource is overallocated, using large pages can lead to costly page swapping and significant system performance degradation. That is the root cause that most commercial Databases such as mongoDB [28] recommend to disable transparent huge pages (THP) for better performance. Therefore, there still remains challenges of using large pages to reduce the cost of address translation for OSv-based VMs.

Third, the OSv’s memory allocation mechanism cannot adapt to diversifying memory requirement of cloud applications, especially for memory requests with different sizes. The OSv’s memory allocation mechanism mainly includes two stages. In the first stage (pre-SMP enabled), OSv allocates memory directly from the free_ page_ranges. In the second stage, the memory is allocated according to the data size. If the data size is less than or equal to 1024 bytes, the memory is allocated from the malloc_ pools. If the data size is larger than 1024 bytes and less than 4 KB, an entire 4 KB page is allocated. If the data size is larger than 4 KB, the memory is allocated from the free_ page_ranges. Memory requests with different sizes often result in high cost of address translation and a waste of memory resource. For example, an entire 4 KB page is allocated to a request of 1030 bytes, resulting in a waste of nearly 3 KB memory. This implies that there is still vast room to improve the utilization of memory resource in OSv through fine-grained memory allocation.

To address the above problems, we present UCat, a heterogeneous memory management mechanism for a typical unikernel–OSv. First, we develop an address mapping mechanism to support heterogeneous memory allocation in OSv, so that OSv can perceive the underlying heterogeneous memories. Second, we implement a heterogeneous memory management mechanism combining large pages and slab allocation. We use large pages to reduce the cost of address translation, and leverage slabs to support fine-grained memory allocation to reduce the memory consumption of OSv.

4 Design and implementation

In this section, we first give an overview of UCat, and then present the technical details of the address mapping mechanism and the multi-grained memory allocation mechanism. Finally, we integrate these two mechanisms into the open-source OSv.

4.1 Architecture overview

Fig.6 depicts the architecture of UCat. Intel Optane DC Persistent Memory Module (PMM) can be used in two operation modes: Memory Mode and App Direct Mode. We use the App Direct Mode so that the Optane DC PMMs and DRAM are visible to the host OS.

Fig.6 Architecture of UCat

Full size|PPT slide

Although DRAM and Intel Optane Persistent Memory can be placed in different NUMA nodes in a physical host, NUMA accesses may degrade the performance of applications [14], especially for unikernels that usually run only one single program in the VM. Since Linux can manage NVM in a contiguous physical memory space through a pmem namespace, it is potential to manage NVM and DRAM through different namespaces separately, as shown in Fig.6. In this way, the unikernel can simply distinguish NVM and DRAM by mapping two continuous memory regions to the DRAM and NVM namespaces managed by the host server.

To use heterogeneous memories, OSv should be aware of the capacity and region of DRAM and NVM. In UCat, each OSv divides its main memory space into a DRAM region and a NVM region, and then informs the memory boundary (called heterogeneous_memory_boundary) to the KVM hypervisor via quick emulator (QEMU). When KVM obtains the memory boundary, it can map the OSv’s DRAM and NVM address space to the corresponding namespaces. As shown in Fig.6, the DRAM region in OSv is mapped to namespace 0, and the NVM region is mapped to namespace 1. Upon a page fault, the page fault address (fault_addr) is stored in the CR2 register. By comparing the fault_addr and the heterogeneous_memory_boundary, KVM can determine the type of memory required by OSv and apply for memory from namespace 0 or namespace 1 accordingly. In addition, to reduce the two-layer address translation overhead and the internal memory fragmentation, we combine large pages and different slab classes to manage the heterogeneous memories in a multi-grained manner.

4.2 Cooperative address space mapping

OSv needs to perceive the heterogeneity of the underlying physical memory to allocate DRAM and NVM in OSv. A simple way is to manage the physical DRAM and NVM in two namespaces separately, and then the OSv can map two memory regions to the underlying different namespaces in the host.

To support heterogeneous memory allocation in OSv, we propose a simple-yet-efficient mechanism called front-end/back-end cooperative address space mapping, as shown in Fig.7. The front-end and back-end refer to the OSv-based VM and the underlying hypervisor (QEMU/KVM), respectively. We map the DRAM region in OSv to namesapce 0, and the NVM region to namesapce 1. To achieve this goal, we modify the OSv startup module and the OS initialization module to support heterogeneous memory allocation during the OSv booting. Specifically, the OSv startup module should be aware of the DRAM and NVM sizes. These two parameters are stored in a special location of the OSv image before the VM is started.

Fig.7 The front-end/back-end cooperative address space mapping

Full size|PPT slide

First, we construct a QEMU command including the DRAM and NVM sizes to start a VM. QEMU parses the two memory regions assigned to the VM and creates a memory boundary parameter, which is stored in the VM-related meta-data structure. At this time, the KVM hypervisor can obtain the memory boundary between the DRAM region and the NVM region in OSv, and then maps the two memory regions to different namesapces.

Second, when the VM starts, the OSv initialization module parses the parameters of DRAM and NVM sizes from the OSv image, and stores the memory boundary as a global variable in the VM at runtime. Therefore, the memory management subsystem of OSv can recognize the type of memory requested by the application. At this time, both the front-end (OSv) and the back-end (QEMU/KVM) is aware of the memory boundary between DRAM and NVM regions. For each memory request, KVM can allocate different physical memories to upper VMs according to the memory boundary exposed by each OSv. As shown in Fig.7, if the virtual memory requested by OSv locates in the DRAM region, KVM allocates DRAM from namesapce 0. Otherwise, KVM allocates NVM from namesapce 1.

With the collaboration between front-end and back-end, we establish the memory address mapping between OSv and KVM in a simple yet efficient way. Our mechanism is also applicable to other unikernels to support heterogeneous memories.

4.3 Multi-grained memory allocation

Since OSv still uses four-level page tables for address translation, it still suffers the high cost of the two-layer address translation. On the other hand, OSv only supports a limited number of page sizes, and thus may waste a considerable amount of memory resource. To mitigate internal memory fragmentation, a simple approach is to split a large page into 4 KB pages when the utilization of the large page is low. In contrast, when the number of continuous 4 KB pages exceeds a given threshold, they can be merged into a large page to reduce the cost of address translation of these small pages. Although large pages splitting and promotion [18, 27, 29, 30] have been long explored in the past decades, unfortunately there is not a real implementation in modern computers because of its high complexity in software/hardware.

Our multi-grained memory allocation mechanism takes the advantage of both large pages and slab allocation to speed up the GVA-to-HPA address translation while still retaining a low degree of internal memory fragmentation. Since there are two-layer address translations in a virtualization environment, a TLB miss can result in 24 memory references during the two-layer page table walking when both VMs and the host OS all use 4 KB pages. In UCat, we use 2 MB large pages to reduce the cost of two-layer address translations. On the one hand, large pages can significantly improve the TLB coverage, and thus reduce the TLB miss rate. On the other hand, using large pages can further reduce the number of memory references from 24 to 15 (37.5% reduction) upon a TLB miss in OSv.

Moreover, to reduce the waste of memory resource, we exploit slab allocation to support fine-grained memory allocation. Specifically, we logically split 2 MB large pages into different slab classes (

20, 21, . . ., 220

) to satisfy multi-grained memory requests. As shown in Fig.8, for the slab class t, a 2 MB large page can be split into multiple slabs of

2 t

bytes. In the beginning of each slab class, there is a slab header recording the address offset of each slab. Generally, a 2 MB large page can satisfy

(2 M B − h) / 2 t

memory requests, where t denotes the size of a slab, and h denotes the space consumed by the slab header. We note that the address translation of these slabs still relies on TLBs and page tables of large pages. When the large page address of a slab is translated, the slab allocator should first consult the slab header to get the address offset of the requested slab, and then access the data. In this way, UCat can manage memory in fine-grained slab classes while still benefiting the performance advantages of large pages for address translation.

Fig.8 Multi-grained memory allocation using large pages and slabs

Full size|PPT slide

We modify the memory allocation strategy in OSv to support fine-grained memory allocation. When a memory request is greater than or equal to 2 MB, UCat directly allocates memory from the free_ page_ranges. For memory requests less than 2 MB, we exploit the slab allocator to find the best-fit slab class, and then return the address offset of the available slab to the object. Since UCat initializes different slab classes for allocation when a OSv-based VM is created, the slab allocator can always find a slab class that best fits for the memory request with the least memory waste. If there is a free slab available in a slab class, the memory allocation can be completed quickly. Otherwise, Ucat should first apply for a large page from the free_ page_ranges, and then append it to an existing slab class. On the other hand, if one slab class contains a large amount of free slabs for a long time, this slab class can be reclaimed and reassigned to another slab class with few free slabs.

4.4 Work flow of memory allocation

Since there is significant performance gap between DRAM and NVM, we bind the memory used by the OSv kernel to DRAM for high performance. Because applications may have a very large memory footprint, we can allocate DRAM and NVM alternately to applications according to a pre-defined ratio. Since a unikernel is built by compiling the source code of the application with the LibOS together, the programmer can specify the ratio in advance of the compiling. However, the value of the ratio is usually hard to determine in practice because the memory usage of an application may dynamically change at runtime. Thus, we simply allocate DRAM and NVM to the application alternately according to the DRAM-to-NVM ratio in the unikernel. The guest kernel still uses the vanilla memory allocation mechanism in OSv, while the application uses our multi-grained memory allocation mechanism for higher performance and less memory waste according to the DRAM-to-NVM ratio in the unikernel. Fig.9 depicts the execution flow of our memory allocation mechanism in UCat. For the application, if the requested memory can be offered by a slab class, a free slab is returned immediately, and thus the latency of slab allocation is very low. However, if the requested slab class is not available or there is no free space, the slab allocator should first determines which kinds of memory should be allocated according to the memory boundary, and then apply for a 2 MB large page. At last, it constructs the slab-related data structure for allocation. This process usually leads to a rather long tail latency of memory allocation.

Fig.9 The flow chart of memory allocation mechanism in UCat

Full size|PPT slide

5 Evaluation

In this section, we first verify the effectiveness of our address mapping mechanism, and then evaluate the performance of multi-grained memory allocation mechanism using real-world workloads.

5.1 Methodology

Experiment setup In our experiments, the host server is equipped with dual-socket Intel Xeon Gold 6230 2.10 GHz 20-core processors, 128 GB of DRAM, and 1 TB of Intel Optane DC Persist Memory Module (DCPMM). We organize DRAM and DCPMM in two namespaces using ndctl and daxctl tools [31, 32], and use Optane DCPMMs in App Direct Mode. The host server runs Ubuntu 19.10, the KVM hypervisor with Linux kernel version 5.1.1, and QEMU 2.11.50. The guest VM is a unikernel running modified OSv 0.54.

We compare our multi-grained memory allocation mechanism with the OSv memory management mechanism. If not specified otherwise, they are all deployed on the host with the same settings and run Memcached [33], which is a high-performance, distributed in-memory key-value object caching system for accelerating database queries.

Real-world workloads We use memslap [34] and YCSB [35] benchmarks to evaluate the performance of Memcached deployed in OSv. Memslap is a load generation and benchmark tool for Memcached. It can generate different workloads with configurable parameters such as threads, concurrency, and execution time. If not specified in the benchmark, we use the default parameter settings. For example, the ratio of gets to sets is 9:1, and the task window size of each concurrency is 10K. YCSB is a general-purpose performance benchmark tool, which can evaluate the performance of various non-relational databases. YCSB includes six types of workloads representing different application scenarios, as shown in Tab.2.

Tab.2 Description of YCSB workloads

Workload	Description	Ratio of operations
A	Update heavy	50% reads and 50% writes
B	Read mostly	95% reads and 5% writes
C	Read only	100% reads
D	Read latest workload	95% reads and 5% inserts
E	Range-query intensive	95% scans and 5% inserts
F	Read-modify-write	50% reads and 50% read-modify-writes

Micro-benchmarks We develop a micro-benchmark to evaluate the performance of OSv and UCat in detail. The workload of this application is to allocate memory at first, update the memory, and release the memory finally. We can configure the number of memory allocations and the size of memory requested to generate different memory access intensity.

5.2 Validation of the address mapping mechanism

To validate the effectiveness of our front-end/back-end address mapping mechanism, we place the data in Memcached in DRAM and NVM with different ratios, and run memslap benchmark to evaluate the performance. We also evaluate the performance in DRAM-only and NVM-only scenarios for reference. We use UDP protocol for network transmission. We set the number of threads to 8, the concurrency to 128, and total execution time to 30 seconds. Fig.10 shows the transactions per second (TPS) under different DRAM-to-NVM allocation ratios, all normalized to the NVM-only configuration. When the proportion of data distributed in DRAM increases (from 1:16 to 1:1), the throughput of OSv becomes higher accordingly because the performance of DRAM is much higher than NVM. These results demonstrates that the heterogeneous memory resources are visible to the application, and our address mapping mechanism can map the virtual DRAM/NVM regions to different host namespaces effectively.

Fig.10 The throughput of OSv relative to the NVM-only configuration

Full size|PPT slide

5.3 Performance of the multi-grained memory allocation

In the following, we conduct a set of experiments to evaluate the performance of our multi-grained memory allocation mechanism.

5.3.1 Execution time reduction

We compare UCat to OSv with the address mapping mechanism enabled (called OSv-plus) under different DRAM-to-NVM allocation ratios. For each experiment, we run the micro-benchmark to allocate/update/free the memory in different locations for 1 million times. At first, we compare UCat with OSv-plus using 4 KB pages. Fig.11 shows that our multi-grained memory management mechanism can significantly reduce the execution time by 38.8% on average compared with OSV-plus under the same DRAM-to-NVM allocation ratio. The reason is that using large pages can significantly reduce the TLB miss rate and the number of memory accesses due to page table walking. On the other hand, compared with OSv’s memory allocation mechanism, our multi-grained memory allocation mechanism also reduces the latency of memory allocation because we have more free slabs available in each slab class for allocation, and do not allocate memory from the free_ page_ranges frequently. We also compare UCat with OSv-plus using 2 MB large pages. We find that UCat show similar performance with OSv-plus for all cases because the two memory allocation schemes can fully exploit the benefit of large pages for accelerating address translations. However, OSv-plus using only 2 MB large pages wastes a large amount of memory space, as shown in the following section. These results demonstrate the efficiency of our multi-grained memory management mechanism.

Fig.11 The execution time reduced by our multi-grained memory management mechanism, all relative to OSv-plus

Full size|PPT slide

5.3.2 Memory saving

In this section, we construct a micro-benchmark to illustrate the effectiveness of our memory management mechanism in memory saving. The micro-benchmark allocates 1300 B memory blocks for 5.12 million times. For memory requests larger than 1 KB but smaller than or equal to 4 KB, OSv’ memory allocator directly allocates an entire 4 KB page. Therefore, for each memory request, OSv will consume 4KB memory. Consequently, 5.12 million memory requests need 19.5 GB memory. In our experiments, OSv actually consumes total 19.7 GB memory including 0.2 GB memory used for the kernel and application codes. In contrast, UCat only needs 9.8 GB memory to run the application. Ideally, if we do not consider the memory internal fragmentation, only 6.20 GB memory is required for the application. Tab.3 shows the memory consumed and wasted by OSv and UCat compared with the ideal condition. Our mechanism significantly reduces the memory consumption by 50% compared with OSv, and reduces the memory waste by

2.75 ×

(13.5 GB vs. 3.6 GB). The reason is that UCat can split large pages into more fine-grained sizes and return the slab that best fits the memory request, and thus significantly reduce the memory consumption.

Tab.3 Memory consumed and wasted compared with the ideal condition

Mechanism	Memory consumed	Memory wasted
OSv	19.7 GB	13.5 GB
UCat	9.8 GB	3.6 GB

5.3.3 TLB miss rate

We run memslap and YCSB to evaluate the TLB miss rate of Memcached in OSv. We exploit the performance profiling tool–perf in the host Linux to count the TLB miss rate of OSv. For memslap, we set the number of threads to 4, and the concurrency to 256, and the total execution time to 30 seconds. For YCSB, we use workload A and set the field length to 128. Fig.12 shows that UCat reduces the TLB miss rate of memslap and YCSB by 41% and 28% compared with OSv, respectively. Because one TLB entry can cover a much wider address space when using large pages, and thus significantly reduces the TLB miss rates of applications.

Fig.12 The TLB miss rate reduced by UCat, all relative to OSv

Full size|PPT slide

5.3.4 Memslap evaluation

To compare the memory management mechanism of UCat with that of OSv, we run the memslap benchmark using both TCP and UDP communication protocols. For fairness, we run both UCat and the vanilla OSv only using DRAM. We set different combinations of parameters in our experiments. Fig.13 shows the TPS for different values of thread and concurrency, all normalized to OSv at the same settings. For example, the numbers “4+64” in the X-axis represent that the number of threads and concurrency are 6 and 64, respectively. For the UDP communication, UCat improves TPS by up to 18.5%, and 15.4% on average. The throughput for the TCP communication is a little lower than the UDP communication when the number of threads becomes larger. Overall, the workload throughput of UCat is always better than that of OSv for each configuration. The performance improvement is mainly attributed to using large pages for address translations. They not only reduce the TLB miss rate, but also reduce the total memory references due to page table walking in virtualization environments.

Fig.13 TPS improvement of UCat compared with OSv

Full size|PPT slide

5.3.5 YCSB evaluation

We also evaluate the performance of UCat and OSv by running six YCSB benchmark with different field-length parameters. The data requests of workload A, B, C, E, and F all follow a zipfian distribution, while workload D follows a latest distribution. Similarly, we run both UCat and the vanilla OSv only using DRAM for fairness. Fig.14 shows the throughput improved by UCat, all relative to the OSv’s vanilla memory management mechanism for the same field-length setting. UCat improves the workload throughput by up to 14.9%. For different KV sizes reflected by the field-length parameter, the performance of different workloads is similar. This implies that our multi-grained memory management mechanism is able to fully take the advantage of large pages for accelerating address translations, regardless of the sizes of memory allocations.

Fig.14 The throughput of YCSB workloads improved by UCat, all relative to the vanilla OSv

Full size|PPT slide

6 Related work

In recent years, there have emerged a few unikernel projects. Rumprun [36], OSv [3], and HermiTux [4] mainly focus on the compatibility for legacy applications. Rumprun is based on the NetBSD rump kernel and can run existing unmodified POSIX compliant software as a unikernel. OSv is designed for cloud applications. It is able to run unmodified Linux applications, however, the application source code should be recompiled as a relocatable shared object. HermiTux is also able to run unmodified Linux binaries as unikernels. ClickOS [6] is a high-performance software middle-box particularly designed for network function virtualization. However, these unikernels all require applications’ source code available. Otherwise, the applications are unable to recompile and re-link to these unikernels.

Some unikernels only support some special programming languages. MirageOS [2] is the first unikernel written in OCaml. It is designed for secure, high-performance network applications. HaLVM [37] is a port of the Glasgow Haskell compiler tools suite that enables Haskell programmers to develop unikernel-based virtual machines. Clive [38] is a unikernel implemented in Golang language, and applications in Clive should be compiled with a special compiler. These unikernel projects do not support most popular programming languages such as C/C++/java, and thus limit their applicability for cloud application deployment.

There have been a few recent studies on heterogeneous memory management for virtualization environments. HMvisor [39] exploits a NUMA abstraction to construct two virtual NUMA nodes in VMs, and dynamically maps the virtual NUMA nodes to the physical NUMA nodes managed by the KVM hypervisor. In this way, KVM exposes the underlying heterogeneous memory resources to up-layer VMs in a simple-yet-efficient way. Unfortunately, it is very difficult to support the NUMA mechanism in OSv. HeteroOS [40] exploit VM-level memory access behaviors to guide data placement on DRAM/NVM heterogeneous memories in virtualization environments. RAMinate [41] and HeteroVisor [42] support dynamic heterogeneous memory management for VMs. They achieve page migration in the hypervisor layer, without exposing the memory heterogeneity to VMs. However, they need to suspend VMs during page migrations because VMs are not aware of the on-the-fly pages. The above studies all focus on supporting heterogeneous memory management for VMs. To the best of our knowledge, there has been little work on heterogeneous memory management for unikernels. UCat is the first to support heterogeneous memories and multi-grained memory allocation in unikernels.

7 Conclusion and future work

In this paper, we present UCat, a heterogeneous memory management system for unikernels. We implement UCat based on a popular unikernel–OSv. First, we propose front-end/back-end cooperative address mapping to expose the host memory heterogeneity to unikernels. Second, we implement a multi-grained memory allocation mechanism that takes the advantage of both large pages and slab allocations. It not only reduces the cost of two-layer address translation in virtualization environments, but also reduces memory waste due to internal memory fragmentation. Our experimental results demonstrates the effectiveness and efficiency of our mechanisms for heterogeneous memory management in unikernels. The proposed mechanisms are also applicable to other virtualized appliance such as VMs, containers.

In the future, we will optimize our prototype to support more flexible slab memory allocations, allowing the size of slabs and the number of slab classes adapt to application's memory access behaviors. Moreover, we will enable fast in-memory checkpointing (zero-copy) for unikernels by exploiting the non-volatility of persistent memory. This functionality for unikernels may incite new research opportunities such as Function-as-a-Service (FaaS) for the future cloud computing paradigm.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62072198, 61732010, 61825202, and 62032008).

References

Publishing order | Descend order by publishing year | Descend order by cited within

1	Yuan P F, Guo Y, Zhang L, Chen X Q, Mei H . Building application-specific operating systems: a profile-guided approach. Science China Information Sciences, 2018, 61( 9): 092102

2	Madhavapeddy A, Mortier R, Rotsos C, Scott D, Singh B, Gazagnaire T, Smith S, Hand S, Crowcroft J . Unikernels: library operating systems for the cloud. ACM SIGARCH Computer Architecture News, 2013, 41( 1): 461– 472

3	Kivity A, Laor D, Costa G, Enberg P, Har’El N, Marti D, Zolotarov V. Osv—optimizing the operating system for virtual machines. In: Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. 2014, 61– 72

4	Olivier P, Chiba D, Lankes S, Min C, Ravindran B. A binary-compatible unikernel. In: Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 2019, 59– 73

5	Merkel D . Docker: lightweight Linux containers for consistent development and deployment. Linux Journal, 2014, 2014( 239): 2

6	Martins J, Ahmed M, Raiciu C, Olteanu V, Honda M, Bifulco R, Huici F. Clickos and the art of network function virtualization. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014, 459– 473

7	Cozzolino V, Ding A Y, Ott J. FADES: fine-grained edge offloading with unikernels. In: Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems. 2017, 36– 41

8	Manco F, Lupu C, Schmidt F, Mendes J, Kuenzer S, Sati S, Yasukata K, Raiciu C, Huici F. My VM is lighter (and safer) than your container. In: Proceedings of the 26th Symposium on Operating Systems Principles. 2017, 218– 233

9	Dragoni N, Giallorenzo S, Lafuente A L, Mazzara M, Montesi F, Mustafin R, Safina L. Microservices: yesterday, today, and tomorrow. In: Mazzara M, Meyer B, eds. Present and Ulterior Software Engineering. Cham: Springer, 2017, 195− 216

10	Duncan B, Happe A, Bratterud A. Enterprise IoT security and scalability: how unikernels can improve the status Quo. In: Proceedings of the 9th International Conference on Utility and Cloud Computing. 2016, 292– 297

11	Tan B, Liu H K, Rao J, Liao X F, Jin H, Zhang Y. Towards lightweight serverless computing via unikernel as a function. In: Proceedings of the 28th IEEE/ACM International Symposium on Quality of Service. 2020, 1– 10

12	Fingler H, Akshintala A, Rossbach C J. USETL: unikernels for serverless extract transform and load why should you settle for less? In: Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems. 2019, 23– 30

13	Zilberberg O, Weiss S, Toledo S . Phase-change memory: an architectural perspective. ACM Computing Surveys, 2013, 45( 3): 29

14	Yang J, Kim J, Hoseinzadeh M, Izraelevitz J, Swanson S. An empirical guide to the behavior and use of scalable persistent memory. In: Proceedings of the 18th USENIX Conference on File and Storage Technologies. 2020, 169– 182

15	Wu S, Zhou F, Gao X, Jin H, Ren J L . Dual-page checkpointing: An architectural approach to efficient data persistence for in-memory applications. ACM Transactions on Architecture and Code Optimization, 2018, 15( 4): 57

16	Chen T T, Liu H K, Liao X F, Jin H . Resource abstraction and data placement for distributed hybrid memory pool. Frontiers of Computer Science, 2021, 15( 3): 153103

17	Cai M, Huang H . A survey of operating system support for persistent memory. Frontiers of Computer Science, 2021, 15( 4): 154207

18	Wang X Y, Liu H K, Liao X F, Chen J, Jin H, Zhang Y, Zheng L, He B S, Jiang S . Supporting superpages and lightweight page migration in hybrid memory systems. ACM Transactions on Architecture and Code Optimization, 2019, 16( 2): 11

19	Barr T W, Cox A L, Rixner S . Translation caching: skip, don’t walk (the page table). ACM SIGARCH Computer Architecture News, 2010, 38( 3): 48– 59

20	Basu A, Gandhi J, Chang J C, Hill M D, Swift M M. Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013, 237– 248

21	Du Y, Zhou M, Childers B R, Mossé D, Melhem R. Supporting superpages in non-contiguous physical memory. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 223– 234

22	Schmidt F. Uniprof: a unikernel stack profiler. In: Proceedings of the SIGCOMM Posters and Demos. 2017, 31– 33

23	Bratterud A, Walla A A, Haugerud H, Engelstad P E, Begnum K. Includeos: a minimal, resource efficient unikernel for cloud services. In: Proceedings of the 7th IEEE International Conference on Cloud Computing Technology and Science. 2015, 250– 257

Shen Z M, Sun Z, Sela G E, Bagdasaryan E, Delimitrou C, Van Renesse R, Weatherspoon H. X-containers: breaking down barriers to improve performance and isolation of cloud-native containers. In: Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 2019, 121– 135

25	Merrifield T, Taheri H R. Performance implications of extended page tables on virtualized x86 processors. In: Proceedings of the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 2016, 25– 35

26	Bhargava R, Serebrin B, Spadini F, Manne S. Accelerating two-dimensional page walks for virtualized systems. In: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 2008, 26– 35

27	Guo F, Kim S, Baskakov Y, Banerjee I. Proactively breaking large pages to improve memory overcommitment performance in VMware ESXi. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 2015, 39– 51

28	Chodorow K. MongoDB: the Definitive Guide: Powerful and Scalable data Storage. 2nd ed. Sebastopol, California, USA, O’Reilly Media, Inc., 2013

29	Pham B, Veselý J, Loh G H, Bhattacharjee A. Large pages and lightweight memory management in virtualized environments: Can you have it both ways? In: Proceedings of the 48th International Symposium on Microarchitecture. 2015, 1– 12

30	Kwon Y, Yu H C, Peter S, Rossbach C J, Witchel E. Coordinated and efficient huge page management with ingens. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 2016, 705– 721

31	Intel. ndctl. See Github.com/pmem/ndctl, 2020

32	Intel. ipmctl. See Github.com/intel/ipmctl, Dec. 03, 2020

33	Fitzpatrick B. Distributed caching with memcached. Linux Journal, 2004, 2004(124): 1– 5

34	Aker B. Memslap–load testing and benchmarking a server. See Docs.libmemcached.org/bin/memslap, 2013

35	Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143– 154

36	Kantee A. Rump file systems: Kernel code reborn. In: Proceedings of the USENIX Annual Technical Conference. 2009, 1– 14

37	Stengel K, Schmaus F, Kapitza R. EsseOS: haskell-based tailored services for the cloud. In: Proceedings of the 12th International Workshop on Adaptive and Reflective Middleware. 2013, 4: 1– 6

38	Ballesteros F J . Structured I/O streams in Clive: a toolbox approach for wide area network computing. Journal of Internet Services and Applications, 2017, 8( 1): 1–16

39	Yang D, Liu H K, Jin H, Zhang Y . HMvisor: dynamic hybrid memory management for virtual machines. Science China Information Sciences, 2021, 64( 9): 1−16

40	Kannan S, Gavrilovska A, Gupta V, Schwan K. HeteroOS: OS design for heterogeneous memory management in datacenter. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. 2017, 521– 534

41	Hirofuchi T, Takano R. RAMinate: hypervisor-based virtualization for hybrid main memory systems. In: Proceedings of the 7th ACM Symposium on Cloud Computing. 2016, 112– 125

42	Gupta V, Lee M, Schwan K. HeteroVisor: Exploiting resource heterogeneity to enhance the elasticity of cloud platforms. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 2015, 79– 92

Options

Outlines

About the journal

Browse

Authors & reviewers

Abstract

Cite this article

1 Introduction

Fig.1 Comparison between (a) virtual machines, (b) containers, and (c) unikernels

2 Background

2.1 Unikernel applications and cloud

2.2 Native address translation

Fig.2 The x86-64 native page table walk

2.3 Address translations in virtualization environments

Fig.3 The shadow page table walk

Fig.4 The two-dimensional page table walk in virtualization environments

3 Motivation

Tab.1 The image size of dockers and unikernels

Fig.5 The startup time of a Linux VM, a docker, and a unikernel

4 Design and implementation

4.1 Architecture overview

Fig.6 Architecture of UCat

4.2 Cooperative address space mapping

Fig.7 The front-end/back-end cooperative address space mapping

4.3 Multi-grained memory allocation

Fig.8 Multi-grained memory allocation using large pages and slabs

4.4 Work flow of memory allocation

Fig.9 The flow chart of memory allocation mechanism in UCat

5 Evaluation

5.1 Methodology

Tab.2 Description of YCSB workloads

5.2 Validation of the address mapping mechanism

Fig.10 The throughput of OSv relative to the NVM-only configuration

5.3 Performance of the multi-grained memory allocation

5.3.1 Execution time reduction

Fig.11 The execution time reduced by our multi-grained memory management mechanism, all relative to OSv-plus

5.3.2 Memory saving

Tab.3 Memory consumed and wasted compared with the ideal condition

5.3.3 TLB miss rate

Fig.12 The TLB miss rate reduced by UCat, all relative to OSv

5.3.4 Memslap evaluation

Fig.13 TPS improvement of UCat compared with OSv

5.3.5 YCSB evaluation

Fig.14 The throughput of YCSB workloads improved by UCat, all relative to the vanilla OSv

6 Related work

7 Conclusion and future work

Acknowledgements

References