Vari-gram language model based on word clustering

Li-chi Yuan

doi:10.1007/s11771-012-1109-z

Journal of Central South University ›› 2012, Vol. 19 ›› Issue (4) :1057 -1062. DOI: 10.1007/s11771-012-1109-z

Article

Vari-gram language model based on word clustering

Li-chi Yuan ¹^,²^,^a

Author information +

History +

PDF

Abstract

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.

Keywords

Cite this article

Download citation ▾

Li-chi Yuan. Vari-gram language model based on word clustering. Journal of Central South University, 2012, 19 (4) : 1057-1062 DOI:10.1007/s11771-012-1109-z

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	ManningC. D., SchutzeH.Foundations of statistical natural language processing [M], 1999LondonThe MIT Press210-225

[2]	GoodmanJ. T.. A bit of progress in language modeling [J]. Computer Speech and Language, 2001, 15(4): 403-434

[3]	XueN.-w., XiaF., ChiouF.-d., PalmerM.. The Penn Chinese treebank: Phrase structure annotation of a large corpus [J]. Natural Language Engineering, 2005, 11(2): 207-238

[4]	FungP., NgaiG., YangY.-s., ChenB.-feng.. A maximum-entropy Chinese parser augmented by transformation-based learning [J]. ACM Trans on Asian language Processing, 2004, 3(2): 159-168

[5]	ChelbaC., JelinekF.. Structured language modeling [J]. Computer Speech and Language, 2000, 14(4): 283-332

[6]	AviranS., SiegelP. H., WolfJ. K.. Optimal parsing trees for run-length coding of biased data [J]. IEEE Transaction on Information Theory, 2008, 54(2): 841-849

[7]	ZhouD.-y., HeY.-lan.. Discriminative training of the hidden vectors state model for semantic parsing [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(1): 66-77

[8]	SeoK.-J., NamK.-C., ChoiK.-Sun.. A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency [J]. Computer Processing of Oriental Languages, 2000, 12(3): 309-322

[9]	LiZ.-h., CheW.-x., LiuTing.. Beam-search based high-Order dependency parser [J]. Journal of Chinese Information Processing, 2010, 24(1): 37-41

[10]	YuanL.-chi.. A speech recognition method based on improved hidden Markov model [J]. Journal of Central South University: Natural Science, 2008, 39(6): 1303-1308

[11]	MATSUZAKI T, MIYAO Y, TSUJII J. An efficient clustering algorithm for class-Based language models [C]// Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003). Edmonton, Canada, 2003: 119–126.

[12]	DaganI.. Context word similarity and estimation from sparse data [J]. Computer Speech and Language, 1995, 9(2): 123-152

[13]	CUTTING D R, KARGER D R, PEDERSEN J O, TUKEY J R. Scatter/garther: A cluster-based approach to browsing large document collections [C]// Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92). Copenhagen, Denmark, 1992: 318–329.

[14]	LeeL.Similarity-based approaches to natural language processing [D], 1997Cambridge, MAHarvard University56-72

[15]	KAROV Y, EDELMAN S. Learning similarity-based word sense disambiguation from sparse data [C]// Proceedings of the Fourth Workshop on Very Large Corpora. Copenhagen, Denmark, 1996: 42–55.

[16]	YuanL.-chi.. Dependency language paring model based on word clustering [J]. Journal of Central South University: Natural Science and Technology, 2011, 42(7): 2023-2027

[17]	NIESLER T R, WOODLAND P C. A variable-length category-based n-gram language model [C]// Proceeding of the International Conference of Acoustics Speech and Signal Processing. Atlanta, Georgia, USA, 1996: 164–167.

[18]	GaoJ.-f., WangH.-f., LiM.-j., LeeK.-fu.. A unified approach to statistical language modeling for Chinese [C]. Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), 2000, 6: 1703-1706