
ETI: an efficient index for set similarity queries
Lianyin JIA, Jianqing XI, Mengjuan LI, Yong LIU, Decheng MIAO
ETI: an efficient index for set similarity queries
Set queries are an important topic and have attracted a lot of attention. Earlier research mainly concentrated on set containment queries. In this paper we focus on the T-Overlap query which is the foundation of the set similarity query. To address this issue, unlike traditional algorithms that are based on an inverted index, we design a new paradigm based on the prefix tree (trie) called the expanded trie index (ETI) which expands the trie node structure by adding some new properties. Based on ETI, we convert the TOverlap problem to finding query nodes with specific query depth equaling to T and propose a new algorithm called TSimilarity to solve T-Overlap efficiently. Then we carry out a three-step framework to extend T-Overlap to other similarity predicates. Extensive experiments are carried out to compare T-Similarity with other inverted index based algorithms from cardinality of query, overlap threshold, dataset size, the number of distinct elements and so on. Results show that T-Similarity outperforms the state-of-the-art algorithms in many aspects.
expanded trie index (ETI) / set similarity query / T-Overlap / T-Similarity algorithm / T-SimilarityExact algorithm
[1] |
Helmer S, Moerkotte G. A performance study of four index structures for set-valued attributes of low cardinality. The VLDB Journal, 2003, 12(3): 244-261
CrossRef
Google scholar
|
[2] |
Helmer S. Index structures for databases containing data items with set-valued attributes. Technical Report 2/97, University at Mannheim, http://pi3.informatik.uni-mannheim.de
|
[3] |
Helmer S, Aly R, Neumann T, Moerkotte G. Indexing set-valued attributes with a multi-level extendible hashing scheme. In: Proceedings of DEXA. 2007, 98-108
|
[4] |
Agrawal P, Arasu A, Kaushik R. On indexing error-tolerant set containment. In: Proceedings of SIGMOD Conference. 2010, 927-938
|
[5] |
Morzy M, Morzy T, Nanopoulos A, Manolopoulos Y. Hierarchical bitmap index: an efficient and scalable indexing technique for setvalued attributes. In: Proceedings of ADBIS. 2003, 236-252
|
[6] |
Terrovitis M, Passas S, Vassiliadis P, Sellis T K. A combination of trietrees and inverted files for the indexing of set-valued attributes. In: Proceedings of CIKM. 2006, 728-737
|
[7] |
Hossain S, Jamil H M. A hybrid index structure for set-valued attributes using item set tree and inverted list. In: Proceedings of DEXA. 2010, 349-357
|
[8] |
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of VLDB. 2006, 918-929
|
[9] |
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE. 2006
|
[10] |
Xiao C, Wang W, Lin X, Yu J X, Wang G. Efficient similarity joins for near duplicate detection. ACM Transactions on Database Systems (TODS) 2008, 36(3): 15
|
[11] |
Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203-215
CrossRef
Google scholar
|
[12] |
Bayardo R J, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web. 2007, 131-140
CrossRef
Google scholar
|
[13] |
Ribeiro L, Härder T. Efficient set similarity joins using min-prefixes. In: Proceedings of ADBIS. 2009, 88-102
|
[14] |
Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of ICDE. 2008, 257-266
|
[15] |
Han J, Pei J, Yin Y, Mao R. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining and Knowledge Discovery, 2004, 8(1): 53-87
CrossRef
Google scholar
|
[16] |
Agrawal R, Imielinski T, Swami A N. Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD Conference. 1993, 207-216
|
[17] |
Wang J N, Feng J H, Li G L. Trie-join: efficient trie-based string similarity joins with edit distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1-2): 1219-1230
|
[18] |
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of SIGMOD Conference. 2004, 743-754
|
[19] |
http://en.wikipedia.org/wiki/Trie
|
[20] |
Bay S D, Kibler D, Pazzani M J, Smyth P. The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter, 2000, 2(2): 81-85
CrossRef
Google scholar
|
/
〈 |
|
〉 |