%A Yun PENG, Yitong XU, Huawei ZHAO, Zhizheng ZHOU, Huimin HAN %T Most similar maximal clique query on large graphs %0 Journal Article %D 2020 %J Front. Comput. Sci. %J Frontiers of Computer Science %@ 2095-2228 %R 10.1007/s11704-019-7235-0 %P 143601-${article.jieShuYe} %V 14 %N 3 %U {https://journal.hep.com.cn/fcs/EN/10.1007/s11704-019-7235-0 %8 %X

This paper studies the most similar maximal clique query (MSMCQ). Given a graphG and a set of nodes Q,MSMCQ is to find the maximal clique of G having the largest similarity with Q. MSMCQ has many real applications including advertising industry, public security, task crowdsourcing and social network, etc. MSMCQ can be studied as a special case of the general set similarity query (SSQ). However, the MCs of G has several specialties from the general sets. Based on the specialties of MCs, we propose a novel index, namely MCIndex. MCIndex outperforms the state-of-the-art SSQ method significantly in terms of the number of candidates and the query time. Specifically, we first construct an inverted index I for all the MCs of G. Since the MCs in a posting list often have a lot of overlaps, MCIndex selects some pivots to cluster the MCs with a small radius. Given a query Q, we compute the distance from the pivots to Q. The clusters of the pivots assured not answer can be pruned by our distance based pruning rule. Since it is NP-hard to construct a minimum MCIndex, we propose to construct a minimal MCIndex on I(v) with an approximation ratio 1+ ln |I(v)|. Since the MCs have properties that are inherent of graph structure, we further propose a SIndex within each cluster of a MCIndex and a structure based pruning rule. SIndex can significantly reduce the number of candidates. Since the sizes of intersections between Q and many MCs need to be computed during the query evaluation, we also propose a binary representation of MCs to improve the efficiency of the intersection size computation. Our extensive experiments confirm the effectiveness and efficiency of our proposed techniques on several real-world datasets.