REVIEW ARTICLE

Optical storage: an emerging option in long-term digital preservation

  • Shenggang WAN 1,2 ,
  • Qiang CAO 1 ,
  • Changsheng XIE , 1,2
Expand
  • 1. Wuhan National Laboratory for Optoelectronics, Wuhan 430074, China
  • 2. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Received date: 11 Jun 2014

Accepted date: 24 Sep 2014

Published date: 12 Dec 2014

Copyright

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg

Abstract

Long-term digital preservation is an important issue in data storage area. For years, magnetic media based solutions, such as tape and hard disk drive (HDD) based archive systems, monopolize the data archiving market due to their high capacity and low cost. However, in the era of big data, rapidly increasing volume, velocity, and variety of data set bring numerous challenges to the archive systems in various aspects, such as capacity, cost, performance, reliability, power consumption, and so on. In recent years, high capacity optical media, such as blu-ray discs (BDs) and holographic discs, emerge with the revival of optical storage. Due to the natural simple construction of the optical media, the archive systems based on those optical media, e.g., BD library, demonstrate attractive properties, such as cost per bit, reliability, power consumption, and so on, thus become feasible options in long-term digital preservation. In this paper, we reviewed and compared both the magnetic and optical media based solutions for long-term digital preservation, followed by a summarization on techniques to improve the optical media based archive system.

Cite this article

Shenggang WAN , Qiang CAO , Changsheng XIE . Optical storage: an emerging option in long-term digital preservation[J]. Frontiers of Optoelectronics, 0 , 7(4) : 486 -492 . DOI: 10.1007/s12200-014-0442-2

Introduction

Long-term preservation of digital information is an important issue in modern world. From the whole society to one single person, there always are some data having long-term value to be preserved. And part of the retention is even forced by law, e.g., most of the account information should be preserved for at least 7 years [ 1]. The design documents should be kept for more than 15 years [ 2]. And the retention time of important medical records should be more than 30 years or longer [ 3]. In particular, most of those data are rarely accessed, which are also known as cold data.
For years, tape and hard disk drive (HDD) based systems dominate the data archiving market due to the high capacity and low cost of the those media [ 2]. However, the situation has changed recently. According to a report released by the international data corporation (IDC), the total amount of data in the world in 2011 was around 1.8 ZB (1 ZB= 106 PB= 1018 Bytes) and it would increase at a speed of over 50% per year [ 4]. In contrast, the growth ratio of storage density of magnetic tapes and disks slows down, which is far from the prediction in the famous Kryder’s law [ 5], e.g., the areal density of HDD increases at a speed of 20% per year after the year of 2010, and that of the linear tape open (LTO) tapes is around 40% per year at the same time [ 6]. The gap between increasing ratios of data and storage density will bring huge impact on the archive system and deeply influence the storage industry. Furthermore, with the growth of the data set, the archive system has to face the challenge of power consumption.
In recent few years, high capacity optical media, such as blu-ray discs (BDs) and holographic discs, have emerged with the revival of optical storage, e.g., the capacity of a BD ranges from 100 to 128 GB in current generation (BDXL, high capacity recordable and rewritable discs), and that of the next generation is up to 500 GB. In addition, the optical medium has a much longer lifetime (e.g., 50-100 years for BDXL), compared to the tape (15-30 years) and HDD (3-5 years). As a result, the optical media based archive system will achieve not only a high reliability, but also low power consumption due to the less times of migration. Due to all above factors, including capacity, reliability, and power consumption, the optical media based archive system becomes an attractive option for long-term digital preservation. In recent one or two years, prototypes of BD based archive systems have been designed, developed, and exhibited [ 2].
In this paper, we first reviewed the background of the long-term digital preservation in the big data era; followed by some comparison between existing magnetic and optical media based archive systems; then, we proposed techniques to address the drawbacks of the optical media based archive system; finally, we made our conclusions.

Long-term digital preservation meets the big data

In this section, we will review the background of long-term digital preservation in the big data era.
As we mentioned in Section 1, the report released by the IDC claims that the total amount of data in the world in 2011 was around 1.8 ZB and would increase at a speed of over 50% per year, i.e., it will be around 53 ZB in the year 2020 as shown in Fig. 1. However, even this prediction about data increasing ratio is thought to be conservative. E.g., over 700 TB of data will be generated per day by the square kilometre array (SKA) project, which will be deployed in Africa and Australia [ 7]. Utah Governor Gary R. Herbert claimed the NSA Utah Data Center, which has started to build in the year 2011, would be the “first facility in the world expected to gather and house a yottabyte” (1 yottabyte= 1 YB= 1024 Bytes) [ 8]. It is also reported that the total amount of data to be generated by the BRAIN project is around 1 YB [ 9].
Fig.1 Prediction on total amount of data in the world by IDC

Full size|PPT slide

Besides those scientific data, huge amount of commercial and personal data also have long-term value to preserve. In the year of 2006, Storage Networking Industry Association (SNIA) takes a survey on long-term digital preservation among 276 organizations [ 10]. Figure 2 represents some results of the survey. About 90% of those organizations have digital data that must be preserved for more than 10 years. And 81% of responders even have data preservation requirement over 50 years. In addition, 18% of the respondents said that the amount of their data to be preserved is larger than 100 TB.
The huge data set generated from large scale web applications, e.g., social networking, also need to be preserved. It is reported that most of the pictures uploaded to Facebook are never or rarely re-accessed, e.g., 97% of content only receives 29% of requests [ 11]. However, due to the huge potential value of those pictures, the company prefers to preserving those cold data rather than deleting them. Right now, Facebook starts build data center to maintain those cold data. The BD based archive system is one of their choices.
According to the latest definition given by SNIA [ 12], “long-term preservation” is “The act of maintaining information, in a correct and independently understandable form, over a period of decades or longer” and “digital preservation” is “Ensuring continued access to, and usability of, digital information and records, especially over long periods of time.” According to their definition, the long-term digital preservation is not only limited to the capacity, and cost (or cost per bit). To maintain data at ZB or even YB level, a lot of other factors should also be considered, such as the reliability, power consumption and so on. Although the latter factors can be roughly converted into monetary cost, we do not conduct the conversion in this paper since it is hard to estimate the loss of a piece of precious data or emission of a ton of CO2. As a result, we analyzed and compared the feasible solutions of the long-term digital preservation from the following five aspects, including storage density, storage cost, performance, reliability, and power consumption.
Besides above factors, some other issues are also important to the archive systems, such as security, semantics, and so on. E.g., Front Poach Digital proposes the Archive eXchange Format (AXF), which is an open format for universal content transport, storage, and long-term preservation [ 13]. However, we do not discuss these issues in this paper, since the former five metrics that we choose for comparison are more relevant to the storage media.
Fig.2 Results of survey on data retention by SNIA. (a) Requirements on retention time; (b) amount of data to preserve

Full size|PPT slide

Comparison of feasible solutions

In this section, we first reviewed feasible storage media for long-term digital storage in the big data era; followed by comparisons of archive systems based on those media from the viewpoints of the storage medium (include the drive) and the whole system, respectively.

Non-volatile storage media

Kryder and Kim reviewed 13 types of novel nonvolatile memory technologies, which are the candidates to replace HDD [ 5]. The famous Kryder’s law, “The 30-year history of the cost of digital storage media dropping exponentially,” is used in this work to predict the future features of those memory technologies. Rosenthal et al. analyzed the economics of the long-term digital storage in detail [ 14]. Three storage technologies, including disk, tape, and NAND-flash, were compared in their work. They concluded that the tape is most likely to be the best candidate of long-term digital preservation in this decode among above three storage media, due to the potential improvement space of storage density and relatively low cost. Fontana et. al. also compared the technology roadmap for tape, HDD, and NAND-flash. According to their conclusions, only the tape demonstrates a potential high (over 40% per year) increase ratio of storage density [ 6].
In this paper, we considered the tape and HDD based archive systems, as well as the BD based archive systems, due to their relatively long lifetime and low cost per bit. We do not consider the NAND-flash based archive systems due to the following two reasons: 1) the cost per bit of NAND-flash is much higher than other media, e.g., the cost per bit of the NAND-flash is as about 10 times as that of the HDD; 2) The NAND-flash records the digital information by injecting electronics into field effect transistors. However, this recording mechanism is not stable. Those electronics will dissipate slowly thus lead to loss of stored information, particularly in a multiple level cell (MLC, a cell records few bits) based NAND-flash memory. This would incur lots of extra migration/refreshment operations.

Perspective from storage media

We conducted quantitative comparison among HDD, tape, and BD. The results are listed in Table 1. We chose 2 TB 3.5 inch SATA HDDs for the comparison since they have lower cost per bit compared to 2.5 inch SAS HDDs. Due to the domination of LTO tapes in the magnetic tapes market, the LTO6 tape, which is the latest generation of LTO tapes, was chosen for the comparison. The 25 GB BD-R disc and 100 BDXL disc were chosen for comparison because the former one is the main stream production and the latter one has already been used in the first generation of BD based archive system [ 2].
Tab.1 Quantitative comparisons among HDD, tape, and BD
HDD tape BD
type 3.5″ SATA LTO6 BD-R BDXL
capacity/TB 2 2.5 0.025 0.1
storage/volume density/(GB·cm-3) 5.1 10.8 0.85 3.4
cost per bit/($·GB-1) 0.05–0.10 0.026–0.036 0.024-0.035 0.45–0.68
latency/ms ~10 ~45000 ~1 00
recording bandwidth/(MB·s-1) ~100 ~160 ~45 ~18
lifetime/year 3–5 15–30 50–100
recording power consumption/(MB·J-1) 8.3–12.5 6.7–13.4 1.5–3.0 0.6–1.2
• Storage density: Although the BD has the lowest storage density in all above storage mediums, the storage density of the BD is seen increasing dramatically in the near future. For example, the capacity of the following generation of BD is up to 500 GB.
• Cost per bit: The BDXL has the highest cost per bit, and the BD-R (25 GB) and the tape has the lowest cost per bit. The cost per bit of the HDD ranges from 2 to 3 times as that of the tape.
BD-R (25 GB). The cost per bit of BDXL is so high since it has not been produced massively. In other words, if the optical storage medium can gain some market share, the price may further drop diluted cost of development.
• Recording bandwidth: The recording bandwidth of existing optical storage technologies is much lower compared to other storage technologies. For example, the recording speed of the 16X BDR is up to 72 MB/s (1X= 4.5 MB/s), and around 45 MB/s on average. That of BDXL is even lower and only around 18 MB/s. In addition, the improvement of storage density of BD is based on the growth of recording layer, i.e., the recording bandwidth cannot be improved with the growth of storage density.
• Reliability: Essentially, the lifetime is an important metrics to evaluate the systemic reliability. Although the magnetic medium is very stable, the lifetime of HDD (also known as MTTF, or Mean Time To Failure of HDD) is much less than expected [ 15, 16]. Similar as the HDD, the lifetime of tape is also not determined by that of the magnetic medium. Fujifilm and HP claims that the lifetime of their latest production is around 30 years. However, since the tape is rolled up, it is easy to adhere together. As a result, the content in the tape should be refreshed periodically via being copied from one tape to another. This refresh operation brings significant extra management overhead. The lifetime of an optical medium is usually determined by its recording materiel. According to the report by HITACHI, the lifetime of BDXL typically ranges from 50 to 100 years. In addition, the lifetime of the M-DISC, which is a production of RITEK, is around 1000 years [ 17].
• Power consumption: The power consumption is normalized as the ratio of recording bandwidth and power. Although the BD drive suffers from the worst power consumption during the recording process due to its low recording bandwidth, the total power consumption of the BD based archive system is much lower than that of the HDD and tape based archive systems for the long-term digital preservation. Essentially, since the accessing options are rare in long-term digital preservation, the total power consumption of the archive system is determined by the power consumption of migrations. As we mentioned in Section 1, the BD based archive system suffer from much less migration times due to the long lifetime of BD.

Perspective from archive systems

Typically, the archive systems containing robot arms can be classified into two types according to the different organizations. One is rack-server style, and the other is rack style. Figure 3 represents the BD based archive systems organized in above two styles. In the former style, the storage media (BD) and the access components (BD drives) are sealed together in one server. Multiple servers can be deployed in racks to improve the capacity and recording bandwidth. In the latter style, the storage media and BD drives are deployed in a huge rack. There is only one delivery component (robot) in the rack style. As a result, the storage density (spatial density) and capacity can be further improved.
Fig.3 Blu-ray disc (BD) based archive systems organized in the rack-server and rack styles. (a) A BD based archive system organized in the rack-server style; (b) a BD based archive system organized in the rack style

Full size|PPT slide

Compared to the rack style organization, the rack-server style organization has good performance, reliability and scalability, but suffers much higher cost to maintain a number of extra components such as CPU, memory, mainboard, and so on. In addition, all archive systems, including HDD, tape, and BD based, can take the rack-server style. However, to our knowledge, only the tape and BD based archive system can take the rack style due to the storage media are naturally separated from the access components (drives).
We did not make quantitatively comparison at systemic level, since the costs of the extra part can vary dramatically according to the organizations. In addition, comparisons on costs under certain configurations demonstrate that, to archive 100 TB data for 20 years, the costs of the HDD, tape, and BD based solutions are 7.7, 2.9, and 1.6 million US dollars respectively [ 2]. Although the HDD has the best power efficiency during the recording process, the power consumption of HDD based archive system is thought to be the worst among all above archive systems [ 2]. It is also reported that the prototype of BD based archive system in Facebook can further reduce the cost by 50%, and the power consumption by 80%, compared to the shingled-recording HDD based archive system. Due to the short lifetime of HDD, the migration of data can bring huge extra power consumption. And the situation is similar with the tape based archive systems. Benefited from the relatively long lifetime, the BD based archive system achieves the lowest power consumption.

Improve the BD based archive system

Although the BD based archive system is attractive, it still faces various challenges. In this section, we discuss and propose some techniques to address those challenges.

Recording bandwidth

With no doubt, the poor recording bandwidth is the most significant weakness of the BD based archive system. To satisfy the demand on the velocity, parallel techniques can be used to improve the bandwidth. The data to be preserved are first divided into separated chunks. Then, all those chunks are recorded via multiple drives on to different discs. Furthermore, to absorb small write requests and avoid recording data on multiple tracks in a single disc, dedicated write buffer should be configured.

Access latency

To achieve high recording bandwidth and PetaByte or ExaByte level capacity, most of those discs should be grouped and stored out of the drives, whatever the archive system is organized in the rack-server style or the rack style. Clearly, this will bring significant latency to access data in the off-the-drive-stored discs. For example, that latency in the rack-server style BD based archive system is about 65 s, since the discs should be inserted into drives one by one. Therefore, it will extend the range of application of the BD based archive system by hiding or shortening that latency.
The following techniques might be used to mitigate the poor latency.
• Buffer cache: Buffer cache is widely deployed in storage systems to improve the average access latency. By capturing the re-accesses to the hot spots using memory with low latency, those high latencies of accessing slow back end devices can be hidden thus improve the overall performance. As a result, to reduce the average access latency in such scenarios, a buffer cache may be deployed as the front end of the archive system.
• Priority based scheduler: To reduce the access latency, priority based schedulers are also widely used in computer architectures, particularly in those latency sensitive scenarios. Via accurately locating and transmitting the requested data in a data block (the basic I/O unit), the access latency can be shortened compared to fetching the whole data block sequentially. As a result, to improve the access latency in an archive system which needs to access the storage media one by one, a priority based scheduler may dramatically reduce the latency of accessing data in back end devices.
• Other approaches: Besides above methods, new mechanical components which can insert all discs concurrently may effectively reduce the latency.

Reliability

Although the BD is thought to be much more reliable than the disk and tape, the BD based archive system still faces the challenge of data reliability, particularly in a large scale system which maintains millions of discs. The following two techniques may be used to address this problem.
• Data redundancy: Data redundancy is a common technique to maintain the storage system reliable. Via generating redundant information from the original data, it can protect data loss from media failures by using the redundant information for recovery. Typically, the data redundancy can be further divided into replicating and erasure coding. The biggest differences between those two schemes may be the spatial utility (can be calculated as a m o u n t o f d a t a a m o u n t o f ( d a t a + r e d u d d a n c y ) ) and recovery speed (determines an important metrics of reliability in storage system, mean time to repair (MTTR)). For example, RAID-6 is a storage system using two failure tolerant erasure coding.
• Sampling: Sampling is usually used in optical disc based storage systems to detect and predict failures. Through periodically checking the sample discs rather than monitoring all discs, it can achieve high efficiency and accuracy.

Other issues

Except for the performance and reliability, there are many other issues to be addressed in the BD based archive system, such as manageability, fast look up, spatial efficiency, scalability, semantics, security, and so on.
• Manageability: The traditional file system for optical discs, such as universal disc format (UDF), is designed for a single disc, and not fit for volume across multiple discs. As a result, new file systems should be designed for above case.
• Fast look up: It is challenge to look up data from PetaByte or ExaByte level off-the-drive-stored discs. Two techniques can be utilized to address this problem. One is digesting that the text is identified and fetched from graphic information as the look up index. The other is using multiple level meta-data to reduce the number of I/Os to access a single data chunk.
• Spatial efficiency: Deduplication can dramatically reduce the amount of data to preserve thus improve the spatial efficiency. However, it is difficult to implement deduplication in optical disc library (also difficult in the tape based archive system). As a result, we propose to conduct localized deduplication instead of the traditional global deduplication. Via executing deduplication between two continuous versions of data, it cannot only reduce the total amount of fingerprint but also the total amount of data.
• Scalability, semantics, and security: Numerous challenges are still left to be discussed and addressed, such as scalability, semantics, and security, e.g., scaling up the archive system to satisfy the dramatically increasing demand on the capacity, recognizing the archived information after 100 or more years, and maintaining the security and privacy of the archived information. Here, we only list the problems and let them open to discuss.

Conclusions

Long-term digital preservation, particularly the preservation of the cold data, is an important issue in storage area. As the fundamental of digital preservation, storage technologies have to face the challenges from cost, capacity, reliability, performance, power consumption, and so on. In this paper, we reviewed and compared three feasible technologies in archive market from the view point of long-term digital preservation; followed by discussion on techniques to improve the optical media based (particularly BD based) archive system. Based on our comparison and discussion, we make the following conclusions.
1) The optical disc based archive system becomes an attractive choice in long-term digital preservation, due to the low cost per bit and long lifetime of the storage media, and low power consumption of the whole system. Currently, the cost per bit of the BDXL seems to be too high to afford, due to the very low market share. However, it also implies the huge cut space in the future, which can be derived from the cost per bit of BD-R (25 GB) the main stream production.
2) To cache up with the HDD and tape based archive system, the performance of the BD based archive system should be further improved, particularly the recording performance. In addition, it will simultaneously improve the power consumption of the BD based archive system by increasing the recording performance per drive.
3) The long lifetime of BD is the most attractive feature of the BD based archive system. However, this long lifetime is only the theoretical value gotten from the laboratory. It still needs to be verified in the real world environments.
4) High capacity optical discs, e.g., 1 to 10 TB/disc, need to be developed to make the optical based archive system staying competitive. Fortunately, except for the BD, other optical storage technologies with higher storage density, such as holographic and multiple-dimensional optical recording [ 18], are under research.
Finally, we conclude that the optical media based archive system may be one of the best choices in long-term digital preservation.

Acknowledgements

This work was supported in part by the National Basic Research Program of China (No. 2011CB302303), the National High Technology Research and Development Program of China (No. 2013AA013203), the National Natural Science Foundation of China (Grant No. 60933002), and the Fundamental Research Funds for Central Universities, Huazhong University of Science and Technology, (No. 2013KXYQ003). This work was also supported by Key Laboratory of Data Storage System, Ministry of Education.
1
Sarbanes P, Oxley M G. Sarbanescoxley act2002. https://www.sec.gov/about/laws/soa2002.pdf

2
Watanabe A. Optical library system for longterm preservation with extended error correction coding. In: Proceedings of IEEE Symposium on Massive Storage Systems and Technologies, 2013, Keynote Talk

3
Health insurance portability and accountability act1996. http://www.hhs.gov/ocr/privacy/

4
IDC Digital Universe. http://www.emc.com/leadership/programs/digitaluniverse.htm

5
Kryder M H, Kim C S . After hard drives—what comes next? IEEE Transactions on Magnetics, 2009, 45(10): 3406–3413

DOI

6
Fontana R E, Hetzler S R, Decad G. Technology roadmap comparisons for TAPE, HDD, and NAND flash: implications for data storage applications. In: Proceedings of IEEE Symposium on Massive Storage Systems and Technologies, 2013

7
SKA project. https://www.skatelescope.org

8
NSA Utah Data Center. http://nsa.gov1.info/utah-datacenter/

9
Obama Brain mapping project tests big data limits. https://twitter.com/attivio/status/319825221374861312

10
Peterson M, Zasman G, Mojica P, Porter J. 100 year archive requirements survey. http://www.snia.org, 2007

11
Huang Q, Birman K, van Renesse R, Lloyd W, Kumar S, Li H C. An analysis of Facebook photo caching. In: Proceedings of Symposium on Operating Systems Principles, 2013, 167–181

DOI

12
Storage Networking Industry Association. Snia dictionary 2013. http://www.snia.org, 2013

13
Campanotti B. Archive eXchange Format. http://www.snia.org

14
Rosenthal D S, Rosenthal D C, Miller E L, Adams I F, Storer M W, Zadok E. The economics of long-term digital storage. The Memory of the Word in the Digital Age: Digitization and Preservation, 2012

15
Schroeder B, Gibson G A. Disk failures in the real world: what does an MTTF of 1000000 hours mean to you? In: Proceedings of USENIX Conference on File and Storage Technologies, 2007, 1–16

16
Pinheiro E, Weber W D, Barroso L A. Failure trend in a large disk drive population. In: Proceedings of USENIX Conference on File and Storage Technologies, 2007, 17–28

17
Wood K. Optical media technical roadmap: the revival of optical storage. In: Proceedings of IEEE Symposium on Massive Storage Systems and Technologies, 2013, Keynote Talk

18
Zijlstra P, Chon J W M, Gu M. Five-dimensional optical recording mediated by surface plasmons in gold nanorods. Nature, 2009, 459(7245): 410–413

DOI PMID

Outlines

/