Vari-gram language model based on word clustering
Li-chi Yuan
Journal of Central South University ›› 2012, Vol. 19 ›› Issue (4) : 1057 -1062.
Vari-gram language model based on word clustering
Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
word similarity / word clustering / statistical language model / vari-gram language model
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
MATSUZAKI T, MIYAO Y, TSUJII J. An efficient clustering algorithm for class-Based language models [C]// Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003). Edmonton, Canada, 2003: 119–126. |
| [12] |
|
| [13] |
CUTTING D R, KARGER D R, PEDERSEN J O, TUKEY J R. Scatter/garther: A cluster-based approach to browsing large document collections [C]// Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92). Copenhagen, Denmark, 1992: 318–329. |
| [14] |
|
| [15] |
KAROV Y, EDELMAN S. Learning similarity-based word sense disambiguation from sparse data [C]// Proceedings of the Fourth Workshop on Very Large Corpora. Copenhagen, Denmark, 1996: 42–55. |
| [16] |
|
| [17] |
NIESLER T R, WOODLAND P C. A variable-length category-based n-gram language model [C]// Proceeding of the International Conference of Acoustics Speech and Signal Processing. Atlanta, Georgia, USA, 1996: 164–167. |
| [18] |
|
/
| 〈 |
|
〉 |