TextGen: a realistic text data content generation method for modern storage system benchmarks
Long-xiang WANG, Xiao-she DONG, Xing-jun ZHANG, Yin-feng WANG, Tao JU, Guo-fu FENG
TextGen: a realistic text data content generation method for modern storage system benchmarks
Modern storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for modern storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen is O(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.
Benchmark / Storage system / Word-based compression
[1] |
Agrawal, N., Bolosky, W.J., Douceur, J.R.,
|
[2] |
Agrawal, N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., 2009. Generating realistic impressions for file-system benchmarking.ACM Trans. Stor., 5(4):16.1–16.30. http://dx.doi.org/10.1145/1629080.1629086
|
[3] |
Anderson, E., Kallahalla, M., Uysal, M.,
|
[4] |
Armstrong, T.G., Ponnekanti, V., Borthakur, D.,
|
[5] |
Arnold, R., Bell, T., 1997. A corpus for the evaluation of lossless compression algorithms. Data Compression Conf., p.201–210. http://dx.doi.org/10.1109/DCC.1997.582019
|
[6] |
Baayen, H., 1992. Statistical-models for word-frequency distributions—a linguistic evaluation.Comput. Human. 26(5-6):347–363. http://dx.doi.org/10.1007/Bf00136980
|
[7] |
Bäck, T., 1996. Evolutionary Algorithms in Theory and Practice. Oxford University Press, Oxford, UK, p.120.
|
[8] |
Bonwick, J., Ahrens, M., Henson, V.,
|
[9] |
Box, G.E.P., Muller, M.E., 1958. A note on the generation of random normal deviates.Ann.. Math. Statist., 29(2): 610–611. http://dx.doi.org/10.1214/aoms/1177706645
|
[10] |
Brisaboa, N.R., Iglesias, E., Navarro, G.,
|
[11] |
Brisaboa, N.R., Fariña, A., Navarro, G.,
|
[12] |
Brisaboa, N.R., Fariña, A., Navarro, G., 2008. New adaptive compressors for natural language text.Softw.-Pract. Exper., 38(13):1429–1450. http://dx.doi.org/10.1002/spe.882
|
[13] |
Brisaboa, N.R., Fariña, A., Navarro, G.,
|
[14] |
Chilan, C.M., 2005. IOzone: an Open Source File System Benchmark Tool. Technical Report, the National Center for Supercomputing Applications Hierarchical Data Format Group, University of Illinois at Urbana-Champaign, Illinois.
|
[15] |
Cooper, B.F., Silberstein, A., Tam, E.,
|
[16] |
Difallah, D.E., Pavlo, A., Curino, C.,
|
[17] |
Drago, I., Bocchi, E., Mellia, M.,
|
[18] |
Dvorský, J., Pokorný, J., Snášel, V., 1999. Word-based compression methods and indexing for text retrieval systems.Adv. Database Inform. Syst., 1691:76–84. http://dx.doi.org/10.1007/3-540-48252-0_6
|
[19] |
Fariña, A., Brisaboa, N.R., Navarro, G.,
|
[20] |
Gracia-Tinedo, R., Harnik, D., Naor, D.,
|
[21] |
Horspool, R.N., Cormack, G.V., 1992. Constructing wordbased text compression algorithms. Data Compression Conf., p.62–71. http://dx.doi.org/10.1109/DCC.1992.227475
|
[22] |
Lang, K., 1995. Newsweeder: learning to filter netnews. Proc. Int. Conf. on Machine Learning, p.331–339.
|
[23] |
Li, A., Yang, X., Kandula, S.,
|
[24] |
Li, W.T., 1992. Random texts exhibit Zipf-law-like wordfrequency distribution.IEEE Trans. Inform. Theor., 38(6):1842–1845. http://dx.doi.org/10.1109/18.165464
|
[25] |
Moffat, A., Zobel, J., Sharman, N., 1997. Text compression for dynamic document databases.IEEE Trans. Knowl. Database Eng., 9(2):302–313. http://dx.doi.org/10.1109/69.591454
|
[26] |
Myung, I.J., 2003. Tutorial on maximum likelihood estimation.J. Math. Psychol., 47(1):90–100. http://dx.doi.org/10.1016/S0022-2496(02)00028-7
|
[27] |
Powers, D.M.W., 1998. Applications and explanations of Zipf’s law. Proc. Joint Conf. on New Methods in Language Processing and Computational Natural Language Learning, p.151–160.
|
[28] |
Rodeh, O., Bacik, J., Mason, C., 2013. BTRFS: the Linux B-tree filesystem.ACM Trans. Stor., 9(3):1–32. http://dx.doi.org/10.1145/2501620.2501623
|
[29] |
Salomon, D., 2006. Data Compression: the Complete Reference. Springer-Verlag New York, Inc., New York, USA, p.885.
|
[30] |
Tarasov, V., Bhanage, S., Zadok, E.,
|
[31] |
Traeger, A., Zadok, E., Joukov, N.,
|
[32] |
Vitter, J.S., 1985. Random sampling with a reservoir.ACM Trans. Math. Softw., 11(1):37–57. http://dx.doi.org/10.1145/3147.3165
|
[33] |
Yoshida, S., Morihara, T., Yahagi, H.,
|
[34] |
Ziv, J., Lempel, A., 1977. A universal algorithm for sequential data compression.IEEE Trans. Inform. Theor., 23(3): 337–343. http://dx.doi.org/10.1109/TIT.1977.1055714
|
/
〈 | 〉 |