Optimizing partitioning strategies for faster inverted index compression

Xingshen SONG; Yuexiang YANG; Yu JIANG; Kun JIANG

doi:10.1007/s11704-016-6252-5

Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (2) :343 -356. DOI: 10.1007/s11704-016-6252-5

RESEARCH ARTICLE

Optimizing partitioning strategies for faster inverted index compression

Author information +

History +

PDF (876KB)

Abstract

The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.

Keywords

inverted index / index compression / optimal partition / approximation algorithm

Cite this article

Download citation ▾

Xingshen SONG, Yuexiang YANG, Yu JIANG, Kun JIANG. Optimizing partitioning strategies for faster inverted index compression. Front. Comput. Sci., 2019, 13 (2) : 343-356 DOI:10.1007/s11704-016-6252-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008

[2]	Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999

[3]	Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6

[4]	Catena M, Macdonald C, Ounis I. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371

[5]	Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29

[6]	Ottaviano G, Tonellotto N, Venturini R. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56

[7]	Silvestri F, Venturini R. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228

[8]	Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410

[9]	Ottaviano G, Grossi R. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494

[10]	Anh V N, Moffat A. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166

[11]	Anh V N, Moffat A. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147

[12]	Anh V N, Moffat A. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67

[13]	Delbru R, Campinas S, Tummarello G. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58

[14]	Ottaviano G, Venturini R. Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282

[15]	Ferragina P, Nitto I, Venturini R. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74

[16]	Trotman A. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014

[17]	Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002

[18]	Navarro G, Puglisi S J. Dual-sorted inverted lists. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2010, 309–321

[19]	Dimopoulos C, Nepomnyachiy S, Suel T. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACMInternational Conference onWeb Search and DataMining. 2013, 113–122

[20]	Stepanov A A, Gangolli A R, Rose D E, Ernst R J, Oberoi P S. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326

[21]	Zhao W X, Zhang X, Lemire D, Shan D, Nie J Y, Yan H F, Wen J R. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15

[22]	Goldstein J, Ramakrishnan R, Shaft U. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379

[23]	Boldi P, Vigna S. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28

[24]	Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542

[25]	Sacco G M. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421

[26]	Culpepper J S, Moffat A. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1

[27]	Ao N Y, Zhang F, Wu D, Stones D S, Wang G, Liu X G, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481

[28]	Lemire D, Boytsov L, Kurz N. SIMD compression and the intersection of sorted integers. Software: Practice and Experience, 2016, 46(6): 723–749

[29]	Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009

[30]	Gog S, Venturini R. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233