A machine learning approach to query generation in plagiarism source retrieval
Lei-lei KONG, Zhi-mao LU, Hao-liang QI, Zhong-yuan HAN
A machine learning approach to query generation in plagiarism source retrieval
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
Plagiarism detection / Source retrieval / Query generation / Machine learning / Learning to rank
[1] |
Alzahrani, S.M., Salim, N., Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, and de-tection methods. IEEE Trans. Syst. Man Cybern. C, 42(2):133–149. https://doi.org/10.1109/TSMCC.2011.2134847
|
[2] |
Barrón-Cedeño, A., Vila, M., Martí, M.A.,
|
[3] |
Cao, Y., Xu, J., Liu, T.Y.,
|
[4] |
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn., 20(3):273–297. https://doi.org/10.1023/A:1022627411411
|
[5] |
Elizalde, V., 2013. Using statistic and semantic analysis to detect plagiarism—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[6] |
Gillam, L., 2013. Guess again and see if they line up: surrey’s runs at plagiarism detection—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[7] |
Hagen, M., Potthast, M., Stein, B., 2015. Source retrieval for plagiarism detection from large web corpora: recent ap-proaches. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[8] |
Haggag, O., El-Beltagy, S., 2013. Plagiarism candida- te retrieval using selective query formulation and discriminative query scoring—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[9] |
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference and Predic-tion. CRC Press, Boca Raton.
|
[10] |
Herbrich, R., Graepel, T., Obermayer, K., 2000. Large margin rank boundaries for ordinal regression. In: Smola, A.J., Bartlett, P., Schölkopf, B., et al. (Eds.), Advances in Large Margin Classifiers. MIT Press, Cambridge, p.115–132.
|
[11] |
Höffgen, K.U., Simon, H.U., Vanhorn, K.S., 1995. Robust trainability of single neurons. J. Comput. Syst. Sci., 50(1):114–125. https://doi.org/10.1006/jcss.1995.1011
|
[12] |
Jayapal, A., 2012. Similarity overlap metric and greedy string tiling at PAN 2012: plagiarism detection—notebook for PAN at CLEF 2012. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[13] |
Joachims, T., 2002. Optimizing search engines using click-through data. Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.133–142. https://doi.org/10.1145/775047.775067
|
[14] |
Kong, L.L., Qi, H.L., Wang, S.,
|
[15] |
Lee, T., Chae, J., Park, K.,
|
[16] |
Nallapati, R., 2004. Discriminative models for information retrieval. Proc. 27th Annual ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, p.64–71. https://doi.org/10.1145/1008992.1009006
|
[17] |
Potthast, M., Gollub, T., Hagen, M.,
|
[18] |
Potthast, M., Hagen, M., Stein, B.,
|
[19] |
Potthast, M., Hagen, M., Gollub, T.,
|
[20] |
Potthast, M., Hagen, M., Völske, M.,
|
[21] |
Potthast, M., Hagen, M., Beyer, A.,
|
[22] |
Prakash, A., Saha, S., 2014. Experiments on document chunking and query formation for plagiarism source re-trieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[23] |
Rafiei, J., Mohtaj, S., Zarrabi, V.,
|
[24] |
Robertson, S.E., 1997. Overview of the Okapi projects. J. Docum., 53(1):3–7. https://doi.org/10.1108/EUM0000000007186
|
[25] |
Suchomel, Š., Brandejs, M., 2015. Improving synoptic que-rying for source retrieval—notebook for PAN at CLEF 2015. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[26] |
Toutanova, K., Klein, D., Manning, C.D.,
|
[27] |
Williams, K., Chen, H.H., Choudhury, S.R.,
|
[28] |
Williams, K., Chen, H.H., Giles, C.L., 2014a. Supervised ranking for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
[29] |
Williams, K., Chen, H.H., Giles, C.L., 2014b. Classifying and ranking search engine results as potential sources of pla-giarism. Proc. ACM Symp. on Document Engineering, p.97–106. https://doi.org/10.1145/2644866.2644879
|
[30] |
Zubarev, D., Sochenkov, I., 2014. Using sentence similarity measure for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
|
/
〈 | 〉 |