Efficient dynamic pruning on largest scores first (LSF) retrieval
Kun JIANG, Yue-xiang YANG
Efficient dynamic pruning on largest scores first (LSF) retrieval
Inverted index traversal techniques have been studied in addressing the query processing performance challenges of web search engines, but still leave much room for improvement. In this paper, we focus on the inverted index traversal on document-sorted indexes and the optimization technique called dynamic pruning, which can efficiently reduce the hardware computational resources required. We propose another novel exhaustive index traversal scheme called largest scores first (LSF) retrieval, in which the candidates are first selected in the posting list of important query terms with the largest upper bound scores and then fully scored with the contribution of the remaining query terms. The scheme can effectively reduce the memory consumption of existing term-at-atime (TAAT) and the candidate selection cost of existing document-at-a-time (DAAT) retrieval at the expense of revisiting the posting lists of the remaining query terms. Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines. To further reduce the number of postings that need to be revisited, we present efficient rank safe dynamic pruning techniques based on LSF, including two important optimizations called list omitting (LSF_LO) and partial scoring (LSF_PS) that make full use of query term importance. Finally, experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and produce slightly better results compared with the MaxScore baseline, while returning the same results as exhaustive evaluation.
Inverted index / Index traversal / Query latency / Largest scores first (LSF) retrieval / Dynamic pruning
[1] |
Anh, V.N., Moffat, A., 2005. Simplified similarity scoring using term ranks. Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.226–233. http://dx.doi.org/10.1145/1076034.1076075
|
[2] |
Anh, V.N., Moffat, A., 2006. Pruned query evaluation using pre-computed impacts. Proc. 29th Annual ACM SIGIR Conf. on Research and Development in Information Retrieval, p.372–379. http://dx.doi.org/10.1145/1148170.1148235
|
[3] |
Anh, V.N., Moffat, A., 2010. Index compression using 64-bit words. Softw. Pract. Exper., 40(2):131–147. http://dx.doi.org/10.1002/spe.948
|
[4] |
Badue, C., Ribeiro-Neto, B., Baeza-Yates, R., et al., 2001. Distributed query processing using partitioned inverted files. Proc. 8th Int. Symp. on String Processing and Information Retrieval, p.10–20.http://dx.doi.org/10.1109/SPIRE.2001.989733
|
[5] |
Broder, A.Z., Carmel, D., Herscovici, M., et al., 2003. Ef¬ficient query evaluation using a two-level retrieval pro¬cess. Proc. 12th Int. Conf. on Information and Knowledge Management, p.426–434. http://dx.doi.org/10.1145/956863.956944
|
[6] |
Buckley, C., Lewit, A.F., 1985. Optimization of inverted vec¬tor searches. Proc. 8th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.97–110. http://dx.doi.org/10.1145/253495.253515
|
[7] |
Büttcher, S., Clarke, C.L.A., 2007. Index compression is good, especially for random access. Proc. 16th ACM Conf. on Information and Knowledge Management, p.761–770. http://dx.doi.org/10.1145/1321440.1321546
|
[8] |
Büttcher, S., Clarke, C.L.A., Cormack, G.V., 2010. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, USA.
|
[9] |
Chakrabarti, K., Chaudhuri, S., Ganti, V., 2011. Interval-based pruning for top-k processing over compressed lists. Proc. 27th Int. Conf. on Data Engineering, p.709–720. http://dx.doi.org/10.1109/ICDE.2011.5767855
|
[10] |
Croft, B., Metzler, D., Strohman, T., 2010. Search Engines: Information Retrieval in Practice. Addison Wesley, USA.
|
[11] |
Dean, J., 2009. Challenges in building large-scale information retrieval systems: invited talk. Proc. 2nd ACM Int. Conf. on Web Search and Data Mining, p.1. http://dx.doi.org/10.1145/1498759.1498761
|
[12] |
Delbru, R., Campinas, S., Tummarello, G., 2012. Searching web data: an entity retrieval and high-performance indexing model. Web Semant. Sci. Serv. Agents World Wide Web, 10:33–58. http://dx.doi.org/10.1016/j.websem.2011.04.004
|
[13] |
Dimopoulos, C., Nepomnyachiy, S., Suel, T., 2013. Optimizing top-k document retrieval strategies for block-max indexes. Proc. 6th ACM Int. Conf. on Web Search and Data Mining, p.113-122. http://dx.doi.org/10.1145/2433396.2433412
|
[14] |
Ding, S., Suel, T., 2011. Faster top-k document retrieval using block-max indexes. Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.993–1002. http://dx.doi.org/10.1145/2009916.2010048
|
[15] |
Fontoura, M., Josifovski, V., Liu, J.H., et al., 2011. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endow., p.1213–1224.
|
[16] |
Jiang, K., Yang, Y.X., 2015. Exhaustive hybrid posting lists traversing technique. Proc. 5th Int. Conf. on Intelligence Science and Big Data Engineering, p.1–11. http://dx.doi.org/10.1007/978-3-319-23862-3_1
|
[17] |
Jiang, K., Song, X.S., Yang, Y.X., 2014. Performance evaluation of inverted index traversal techniques. Proc. 17th Int. Conf. on Computational Science and Engineering, p.1715–1720. http://dx.doi.org/10.1109/CSE.2014.315
|
[18] |
Jonassen, S., Bratsberg, S.E., 2011. Efficient compressed in¬verted index skipping for disjunctive text-queries. Proc. 33rd European Conf. on Advances in Information Re¬trieval, p.530–542. http://dx.doi.org/10.1007/978-3-642-20161-5_53
|
[19] |
Lacour, P., Macdonald, C., Ounis, I., 2008. Effciency comparison of document matching techniques. Proc. European Conf. on Information Retrieval, p.37–46.
|
[20] |
Lester, N., Moffat, A., Webber, W., et al., 2005. Space-limited ranked query evaluation using adaptive pruning. Proc. 6th Int. Conf. on Web Information Systems Engineering, p.470–477. http://dx.doi.org/10.1007/11581062_37
|
[21] |
Macdonald, C., Ounis, I., Tonellotto, N., 2011. Upperbound approximations for dynamic pruning. ACM Trans. Inform. Syst., 29(4):17.1–17.28. http://dx.doi.org/10.1145/2037661.2037662
|
[22] |
Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, USA.
|
[23] |
Melink, S., Raghavan, S., Yang, B., et al., 2001. Building a distributed full-text index for the Web. Proc. 10th Int. Conf. on World Wide Web, p.396–406. http://dx.doi.org/10.1145/371920.372095
|
[24] |
Moffat, A., Zobel, J., 1996. Self-indexing inverted files for fast text retrieval. ACM Trans. Inform. Syst., 14(4):349–379http://dx.doi.org/10.1145/237496.237497
|
[25] |
Ounis, I., Amati, G., Plachouras, V., et al., 2006. Terrier: a high performance and scalable information retrieval platform. Proc. OSIR Workshop, p.18–25.
|
[26] |
Puppin, D., Silvestri, F., Perego, R., et al., 2010. Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inform. Syst., 28(2):5.1–5.36. http://dx.doi.org/10.1145/1740592.1740593
|
[27] |
Silvestri, F., Venturini, R., 2010. VSEncoding: effcient coding and fast decoding of integer lists via dynamic programming. Proc. 19th ACM Int. Conf. on Information and Knowledge Management, p.1219–1228. http://dx.doi.org/10.1145/1871437.1871592
|
[28] |
Strohman, T., Croft, W.B., 2007. Efficient document re¬trieval in main memory. Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Infor¬mation Retrieval, p.175–182. http://dx.doi.org/10.1145/1277741.1277774
|
[29] |
Strohman, T., Turtle, H., Croft, W.B., 2005. Optimization strategies for complex queries. Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.219–225. http://dx.doi.org/10.1145/1076034.1076074
|
[30] |
Turtle, H., Flood, J., 1995. Query evaluation: strategies and optimizations. Inform.Process.Manag., 31(6):831–850. http://dx.doi.org/10.1016/0306-4573(95)00020-H
|
[31] |
Wang, L.D., Lin, J., Metzler, D., 2011. A cascade ranking model for efficient ranked retrieval. Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.105–114. http://dx.doi.org/10.1145/2009916.2009934
|
[32] |
Zobel, J., Moffat, A., 2006. Inverted files for text search engines. ACM Comput. Surv., 38(2):6.1–6.56. http://dx.doi.org/10.1145/1132956.1132959
|
[33] |
Zukowski, M., Heman, S.,Nes, N., et al., 2006. Super-scalar RAM-CPU cache compression. Proc. 22nd Int. Conf. on Data Engineering, p.59. http://dx.doi.org/10.1109/ICDE.2006.150
|
/
〈 | 〉 |