Zero-shot reinforcement learning for multi-domain task-oriented dialogue policy
Yuan REN , Si CHEN , Ri-Chong ZHANG , Xu-Dong LIU , Ming-Tian PENG
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009353
Zero-shot reinforcement learning for multi-domain task-oriented dialogue policy
Dialogue policy, a critical component in multi-domain task-oriented dialogue systems, decides the dialogue acts according to the received dialogue state. We introduce zero-shot reinforcement learning for dialogue policy learning, which aims to learn dialogue policies capable of generalizing to unseen domains without further training. This setup brings forward two challenges: 1) the representation of unseen actions & states, and 2) zero-shot generalization to unseen domains. For the first issue, we propose Unified Representation (UR), an ontology-agnostic representation, which effectively infers representations in unseen domains by capturing the underlying semantic relations between unseen actions and states and seen ones. To tackle the second issue, we propose Q-Values Perturbation (QVP), a family of exploration strategies that can be applied either during training or testing. Experiments on MultiWOZ, suggest that UR, QVP, and an integrated framework combining the two are all effective.
dialogue system / reinforcement learning / zero-Shot / unified representation / generalization
| [1] |
Lee C H, Cheng H, Ostendorf M. Dialogue state tracking with a language model using schema-driven prompting. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 4937–4949 |
| [2] |
Xu K, Wang Z, Long Y, Zhao Q. Deep reinforcement learning-based dialogue policy with graph convolutional Q-network. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 4555–4565 |
| [3] |
|
| [4] |
Li J, Monroe W, Ritter A, Jurafsky D, Galley M, Gao J. Deep reinforcement learning for dialogue generation. In: Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 1192–1202 |
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
Su S Y, Li X, Gao J, Liu J, Chen Y N. Discriminative deep Dyna-Q: robust planning for dialogue policy learning. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3813–3823 |
| [13] |
|
| [14] |
Takanobu R, Zhu H, Huang M. Guided dialog policy learning: reward estimation for multi-domain task-oriented dialog. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 100–110 |
| [15] |
|
| [16] |
Hou Z, Liu B, Zhao R, Ou Z, Liu Y, Chen X, Zheng Y. Imperfect also deserves reward: multi-level and sequential reward modeling for better dialog management. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2993–3001 |
| [17] |
Peng B, Li X, Li L, Gao J, Celikyilmaz A, Lee S, Wong K F. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 2231–2240 |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
Budzianowski P, Wen T H, Tseng B H, Casanueva I, Ultes S, Ramadan O, Gašić M. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 5016–5026 |
| [36] |
|
| [37] |
|
| [38] |
Eric M, Goel R, Paul S, Sethi A, Agarwal S, Gao S, Kumar A, Goyal A, Ku P, Hakkani-Tur D. MultiWOZ 2.1: a consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 422–428 |
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543 |
| [43] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171–4186 |
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
Higher Education Press
/
| 〈 |
|
〉 |