Compared with traditional solid-state drives (SSDs), open-channel SSDs (OCSSDs) expose their internal physical layout and provide a host-based flash translation layer (FTL) that allows host-side software to control the internal operations such as garbage collection (GC) and input/output (I/O) scheduling. In this paper, we comprehensively survey research works built on OCSSDs in recent years. We show how they leverage the features of OCSSDs to achieve high throughput, low latency, long lifetime, strong performance isolation, and high resource utilization. We categorize these efforts into five groups based on their optimization methods: adaptive interface customizing, rich FTL co-designing, internal parallelism exploiting, rational I/O scheduling, and efficient GC processing. We discuss the strengths and weaknesses of these efforts and find that almost all these efforts face a dilemma between performance effectiveness and management complexity. We hope that this survey can provide fundamental knowledge to researchers who want to enter this field and further inspire new ideas for the development of OCSSDs.
Recently, solid-state drives (SSDs) have been used in a wide range of emerging data processing systems. Essentially, an SSD is a complex embedded system that involves both hardware and software design. For the latter, firmware modules such as the flash translation layer (FTL) orchestrate internal operations and flash management, and are crucial to the overall input/output performance of an SSD. Despite the rapid development of new SSD features in the market, the research of flash firmware has been mostly based on simulations due to the lack of a realistic and extensible SSD development platform. In this paper, we propose SoftSSD, a software-oriented SSD development platform for rapid flash firmware prototyping. The core of SoftSSD is a novel framework with an event-driven programming model. With the programming model, new FTL algorithms can be implemented and integrated into a full-featured flash firmware in a straightforward way. The resulting flash firmware can be deployed and evaluated on a hardware development board, which can be connected to a host system via peripheral component interconnect express and serve as a normal non-volatile memory express SSD. Different from existing hardware-oriented development platforms, SoftSSD implements the majority of SSD components (e.g., host interface controller) in software, so that data flows and internal states that were once confined in the hardware can now be examined with a software debugger, providing the observability and extensibility that are critical to the rapid prototyping and research of flash firmware. We describe the programming model and hardware design of SoftSSD. We also perform experiments with real application workloads on a prototype board to demonstrate the performance and usefulness of SoftSSD, and release the open-source code of SoftSSD for public access.
Emergence of new hardware, including persistent memory and smart network interface card (SmartNIC), has brought new opportunities to file system design. In this paper, we design and implement a new file system named NICFS based on persistent memory and SmartNIC. We divide the file system into two parts: the front end and the back end. In the front end, data writes are appended to the persistent memory in a log-structured way, leveraging the fast persistence advantage of persistent memory. In the back end, the data in logs are fetched, processed, and patched to files in the background, leveraging the processing capacity of SmartNIC. Evaluation results show that NICFS outperforms Ext4 by about 21%/10% and about 19%/50% on large and small reads/writes, respectively.
Persistent memory (PM) file systems have been developed to achieve high performance by exploiting the advanced features of PMs, including nonvolatility, byte addressability, and dynamic random access memory (DRAM) like performance. Unfortunately, these PMs suffer from limited write endurance. Existing space management strategies of PM file systems can induce a severely unbalanced wear problem, which can damage the underlying PMs quickly. In this paper, we propose a Wear-leveling-aware Multi-grained Allocator, called WMAlloc, to achieve the wear leveling of PMs while improving the performance of file systems. WMAlloc adopts multiple min-heaps to manage the unused space of PMs. Each heap represents an allocation granularity. Then, WMAlloc allocates less-worn blocks from the corresponding min-heap for allocation requests. Moreover, to avoid recursive split and inefficient heap locations in WMAlloc, we further propose a bitmap-based multi-heap tree (BMT) to enhance WMAlloc, namely, WMAlloc-BMT. We implement WMAlloc and WMAlloc-BMT in the Linux kernel based on NOVA, a typical PM file system. Experimental results show that, compared with the original NOVA and dynamic wear-aware range management (DWARM), which is the state-of-the-art wear-leveling-aware allocator of PM file systems, WMAlloc can, respectively, achieve 4.11× and 1.81× maximum write number reduction and 1.02× and 1.64× performance with four workloads on average. Furthermore, WMAlloc-BMT outperforms WMAlloc with 1.08× performance and achieves 1.17× maximum write number reduction with four workloads on average.
Extendible hashing is an effective way to manage increasingly large file system metadata, but it suffers from low concurrency and lack of optimization for non-volatile memory (NVM). In this paper, a multilevel hash directory based on lazy expansion is designed to improve the concurrency and efficiency of extendible hashing, and a hash bucket management algorithm based on groups is presented to improve the efficiency of hash key management by reducing the size of the hash bucket, thereby improving the performance of extendible hashing. Meanwhile, a hierarchical storage strategy of extendible hashing for NVM is given to take advantage of dynamic random access memory (DRAM) and NVM. Furthermore, on the basis of the device driver for Intel Optane DC Persistent Memory, the prototype of high-concurrency extendible hashing named NEHASH is implemented. Yahoo cloud serving benchmark (YCSB) is used to test and compare with CCEH, level hashing, and cuckoo hashing. The results show that NEHASH can improve read throughput by up to 16.5% and write throughput by 19.3%.
In distributed storage systems, replication and erasure code (EC) are common methods for data redundancy. Compared with replication, EC has better storage efficiency, but suffers higher overhead in update. Moreover, consistency and reliability problems caused by concurrent updates bring new challenges to applications of EC. Many works focus on optimizing the EC solution, including algorithm optimization, novel data update method, and so on, but lack the solutions for consistency and reliability problems. In this paper, we introduce a storage system that decouples data updating and EC encoding, namely, decoupled data updating and coding (DDUC), and propose a data placement policy that combines replication and parity blocks. For the (N, M) EC system, the data are placed as N groups of M+1 replicas, and redundant data blocks of the same stripe are placed in the parity nodes, so that the parity nodes can autonomously perform local EC encoding. Based on the above policy, a two-phase data update method is implemented in which data are updated in replica mode in phase 1, and the EC encoding is done independently by parity nodes in phase 2. This solves the problem of data reliability degradation caused by concurrent updates while ensuring high concurrency performance. It also uses persistent memory (PMem) hardware features of the byte addressing and eight-byte atomic write to implement a lightweight logging mechanism that improves performance while ensuring data consistency. Experimental results show that the concurrent access performance of the proposed storage system is 1.70–3.73 times that of the state-of-the-art storage system Ceph, and the latency is only 3.4%–5.9% that of Ceph.
The combinatorial optimization problem (COP), which aims to find the optimal solution in discrete space, is fundamental in various fields. Unfortunately, many COPs are NP-complete, and require much more time to solve as the problem scale increases. Troubled by this, researchers may prefer fast methods even if they are not exact, so approximation algorithms, heuristic algorithms, and machine learning have been proposed. Some works proposed chaotic simulated annealing (CSA) based on the Hopfield neural network and did a good job. However, CSA is not something that current general-purpose processors can handle easily, and there is no special hardware for it. To efficiently perform CSA, we propose a software and hardware co-design. In software, we quantize the weight and output using appropriate bit widths, and then modify the calculations that are not suitable for hardware implementation. In hardware, we design a specialized processing-in-memory hardware architecture named COPPER based on the memristor. COPPER is capable of efficiently running the modified quantized CSA algorithm and supporting the pipeline further acceleration. The results show that COPPER can perform CSA remarkably well in both speed and energy.
To improve the accuracy of modulated signal recognition in variable environments and reduce the impact of factors such as lack of prior knowledge on recognition results, researchers have gradually adopted deep learning techniques to replace traditional modulated signal processing techniques. To address the problem of low recognition accuracy of the modulated signal at low signal-to-noise ratios, we have designed a novel modulation recognition network of multi-scale analysis with deep threshold noise elimination to recognize the actually collected modulated signals under a symmetric cross-entropy function of label smoothing. The network consists of a denoising encoder with deep adaptive threshold learning and a decoder with multi-scale feature fusion. The two modules are skip-connected to work together to improve the robustness of the overall network. Experimental results show that this method has better recognition accuracy at low signal-to-noise ratios than previous methods. The network demonstrates a flexible self-learning capability for different noise thresholds and the effectiveness of the designed feature fusion module in multi-scale feature acquisition for various modulation types.