RESEARCH ARTICLE

Can prior knowledge help graph-based methods for keyword extraction?

  • Zhiyuan LIU ,
  • Maosong SUN
Expand
  • Department of Computer Science and Technology, State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China

Received date: 28 Apr 2011

Accepted date: 30 Aug 2011

Published date: 05 Jun 2012

Copyright

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg

Abstract

Graph-based methods are one of the widely used unsupervised approaches for keyword extraction. In this approach, words are linked according to their co-occurrences within the document. Afterwards, graph-based ranking algorithms are used to rank words and those with the highest scores are selected as keywords. Although graph-based methods are effective for keyword extraction, they rank words merely based on word graph topology. In fact, we have various prior knowledge to identify how likely the words are keywords. The knowledge of words may be frequency-based, position-based, or semantic-based. In this paper, we propose to incorporate prior knowledge with graph-based methods for keyword extraction and investigate the contributions of the prior knowledge. Experiments reveal that prior knowledge can significantly improve the performance of graph-based keyword extraction. Moreover, by combining prior knowledge with neighborhood knowledge, in experiments we achieve the best results compared to previous graph-based methods.

Cite this article

Zhiyuan LIU , Maosong SUN . Can prior knowledge help graph-based methods for keyword extraction?[J]. Frontiers of Electrical and Electronic Engineering, 2012 , 7(2) : 242 -253 . DOI: 10.1007/s11460-011-0174-7

1
Turney P D. Learning to extract keyphrases from text. Technical Report ERB-1057. Ottawa: National Research Council Canada, 1999

2
Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009, 257-266

3
Liu Z, Sun M. Domain-specific term rankings using topic models. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science, 2010, 6458: 454-465

4
Liu Z, Shi C, Sun M. FolkDiffusion: A graph-based tag suggestion method for folksonomies. In: Proceedings of the 6th Asia Information Retrieval Societies Conference. Lecture notes in Computer Science, 2010, 6458: 231-240

5
Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010, 366-376

6
Liu Z, Chen X, Zheng Y, Sun M. Automatic keyphrase extraction by bridging vocabulary gap. In: Proceedings of the Fifth Conference on Computational Natural Language Learning. 2011, 135-144

7
Mihalcea R, Tarau P. TextRank: Bringing order into texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2004, 404-411

8
Wan X, Xiao J. Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd National Conference on Artificial Intelligence. 2008, 855-860

9
Wan X, Xiao J. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics. 2008, 969-976

10
Litvak M, Last M. Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop Multi-Source Multilingual Information Extraction and Summarization. 2008, 17-24

11
Huang C, Tian Y, Zhou Z, Ling C X, Huang T. Keyphrase extraction using semantic networks structure analysis. In: Proceedings of the Sixth IEEE International Conference on Data Mining. 2006, 275-284

12
Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford Digital Library Technologies Project, 1998, 1-17

13
Gyongyi Z, Garcia-Molina H, Pedersen J. Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases. 2004, 576-587

14
Yang H, King I, Lyu M R. DiffusionRank: A possible penicillin for web spamming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007, 431-438

15
Ma H, Yang H, Lyu M R, King I. Mining social networks using heat diffusion processes for marketing candidates selection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008,233-242

16
Ma H, Yang H, King I, Lyu M R. Learning latent semantic relations from clickthrough data for query suggestion. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 709-718

17
Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Upper Saddle River: Addison-Wesley, 1999

18
Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval. New York, NY: Cambridge University Press, 2008

19
Croft B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. Upper Saddle River: Addison-Wesley, 2009

20
Frank E, Paynter G W, Witten I H, Gutwin C, Nevill-Manning C G. Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 668-673

21
Medelyan O, Witten I H. Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology, 2008, 59(7): 1026-1040

DOI

22
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022

23
Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2-3): 259-284

DOI

24
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 1-8

25
Minka T, Lafferty J. Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. 2002, 352-359

26
Griffiths T L, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(Suppl 1): 5228-5235

DOI PMID

27
Zhai C. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 2008, 1(1): 1-141

DOI

28
Hulth A. Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 2003, 216-223

29
Over P, Liggett W, Gilbert H, Sakharov A, Thatcher M. Introduction to DUC-2001: An intrinsic evaluation of generic news text summarization systems. In: Proceedings of 2001 Document Understanding Conference. 2001

30
Turney P D. Learning algorithms for keyphrase extraction. Information Retrieval, 2000, 2(4): 303-336

DOI

Outlines

/