![](/develop/static/imgs/pdf.png)
Controllable data synthesis method for grammatical error correction
Liner YANG, Chengcheng WANG, Yun CHEN, Yongping DU, Erhong YANG
Controllable data synthesis method for grammatical error correction
Due to the lack of parallel data in current grammatical error correction (GEC) task, models based on sequence to sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data. The first approach is to corrupt each word in the monolingual corpus with a fixed probability, including replacement, insertion and deletion. Another approach is to train error generation models and further filtering the decoding results of the models. The experiments on different synthetic data show that the error rate is 40% and that the ratio of error types is the same can improve the model performance better. Finally, we synthesize about 100 million data and achieve comparable performance as the state of the art, which uses twice as much data as we use.
grammatical error correction / sequence to sequence / data synthesis
[1] |
Dale R, Kilgarriff A. Helping our own: The hoo 2011 pilot shared task. In: Proceedings of the 13th European Workshop on Natural Language Generation, Association for Computational Linguistics. 2011, 242−249
|
[2] |
Dale R, Anisimoff I, Narroway G. Hoo 2012: a report on the preposition and determiner error correction shared task. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics. 2012, 54−62
|
[3] |
Ng H T, Wu S M, Wu Y, Hadiwinoto C, Tetreault J. The conll-2013 shared task on grammatical error correction. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2013, 1−12
|
[4] |
Ng H T, Wu S M, Briscoe T, Hadiwinoto C, Susanto R H, Bryant C. The conll-2014 shared task on grammatical error correction. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 1−14
|
[5] |
Brockett C, Dolan W B, Gamon M. Correcting esl errors using phrasal smt techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. 2006, 249−256
|
[6] |
Chollampatt S, Ng H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
|
[7] |
Chollampatt S, Ng H T. Neural quality estimation of grammatical error correction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2528−2539
|
[8] |
Grundkiewicz R, Junczys-Dowmunt M. Near human-level performance in grammatical error correction with hybrid machine translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 284−290
|
[9] |
Ge T, Wei F, Zhou M. Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 1055−1065
|
[10] |
Mizumoto T, Komachi M, Nagata M, Matsumoto Y. Mining revision log of language learning sns for automated japanese error correction of second language learners. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 147−155
|
[11] |
Dahlmeier D, Ng H T, Wu S M. Building a large annotated corpus of learner english: The nus corpus of learner english. In: Proceedings of the 8th workshop on innovative use of NLP for building educational applications. 2013, 22−31
|
[12] |
Junczys-Dowmunt M, Grundkiewicz R, Guha S, Heafield K. Approaching neural grammatical error correction as a low-resource machine translation task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 595−606
|
[13] |
Zhao W, Wang L, Shen K, Jia R, Liu J. Improving grammatical error correction via pre-training a copy-augmente d architecture with unlabeled data. arXiv, 1903.00138
|
[14] |
Lichtarge J, Alberti C, Kumar S, Shazeer N, Parmar N, Tong S. Corpora generation for grammatical error correction. arXiv, 1904.05780
|
[15] |
Xie Z, Avati A, Arivazhagan N, Jurafsky D, Ng A Y. Neural language correction with character-based attention. arXiv, 1603.09727
|
[16] |
Xie Z, Genthial G, Xie S, Ng A, Jurafsky D. Noising and denoising natural language: Diverse backtranslation for grammar correction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 619−628
|
[17] |
Felice M, Yuan Z, Andersen E, Yannakoudakis H, Kochmar E. Grammatical error correction using hybrid systems and type filtering. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 15−24
|
[18] |
Junczys-Dowmunt M, Grundkiewicz R. The amu system in the conll-2014 shared task: Grammatical error correction by data-intensive and featurerich statistical machine translation. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 2014, 25−33
|
[19] |
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 2007, 177−180
|
[20] |
Chollampatt S, Ng H T. Connecting the dots: towards human-level grammatical error correction. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 327−333
|
[21] |
Yuan Z, Briscoe T. Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 380−386
|
[22] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017, 5998−6008
|
[23] |
Yuan Z, Felice M. Constrained grammatical error correction using statistical machine translation. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task. 2013, 52−61
|
[24] |
Rei M, Felice M, Yuan Z, Briscoe T. Artificial error generation with machine translation and syntactic patterns. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 287−292
|
[25] |
Rozovskaya A, Roth D. Generating confusion sets for context-sensitive error correction. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2010, 961−970
|
[26] |
Felice M, Yuan Z. Generating artificial errors for grammatical error correction. In: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014, 116−126
|
[27] |
Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 86−96
|
[28] |
Bryant C , Felice M , Briscoe E J . Automatic annotation and evaluation of error types for grammatical error correction. Association for Computational Linguistics, 2017,
|
[29] |
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 1312.3005
|
[30] |
Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Association for Computational Linguistics. 2016, 1715−1725
|
[31] |
Edunov S, Ott M, Auli M, Grangier D. Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 489−500
|
[32] |
Dahlmeier D, Ng H T. Better evaluation for grammatical error correction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. 2012, 68−572
|
[33] |
Fadaee M, Monz C. Back-translation sampling by targeting difficult words in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 436−446
|
[34] |
Junczys-Dowmunt M, Grundkiewicz R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv, 1605.06353
|
/
〈 |
|
〉 |