Zero-shot reinforcement learning for multi-domain task-oriented dialogue policy

Yuan REN , Si CHEN , Ri-Chong ZHANG , Xu-Dong LIU , Ming-Tian PENG

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009353

PDF (2935KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009353 DOI: 10.1007/s11704-025-41285-5
Artificial Intelligence
RESEARCH ARTICLE

Zero-shot reinforcement learning for multi-domain task-oriented dialogue policy

Author information +
History +
PDF (2935KB)

Abstract

Dialogue policy, a critical component in multi-domain task-oriented dialogue systems, decides the dialogue acts according to the received dialogue state. We introduce zero-shot reinforcement learning for dialogue policy learning, which aims to learn dialogue policies capable of generalizing to unseen domains without further training. This setup brings forward two challenges: 1) the representation of unseen actions & states, and 2) zero-shot generalization to unseen domains. For the first issue, we propose Unified Representation (UR), an ontology-agnostic representation, which effectively infers representations in unseen domains by capturing the underlying semantic relations between unseen actions and states and seen ones. To tackle the second issue, we propose Q-Values Perturbation (QVP), a family of exploration strategies that can be applied either during training or testing. Experiments on MultiWOZ, suggest that UR, QVP, and an integrated framework combining the two are all effective.

Graphical abstract

Keywords

dialogue system / reinforcement learning / zero-Shot / unified representation / generalization

Cite this article

Download citation ▾
Yuan REN, Si CHEN, Ri-Chong ZHANG, Xu-Dong LIU, Ming-Tian PENG. Zero-shot reinforcement learning for multi-domain task-oriented dialogue policy. Front. Comput. Sci., 2026, 20(9): 2009353 DOI:10.1007/s11704-025-41285-5

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Lee C H, Cheng H, Ostendorf M. Dialogue state tracking with a language model using schema-driven prompting. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 4937–4949

[2]

Xu K, Wang Z, Long Y, Zhao Q. Deep reinforcement learning-based dialogue policy with graph convolutional Q-network. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 4555–4565

[3]

Zhang M, He T, Dong M . Meta-path reasoning of knowledge graph for commonsense question answering. Frontiers of Computer Science, 2024, 18( 1): 181303

[4]

Li J, Monroe W, Ritter A, Jurafsky D, Galley M, Gao J. Deep reinforcement learning for dialogue generation. In: Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 1192–1202

[5]

Kim Y B, Kim D, Kumar A, Sarikaya R. Efficient large-scale neural domain classification with personalized attention. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2214–2224

[6]

Liu J, Yu Z, Guo B, Deng C, Fu L, Wang X, Zhou C . EvolveKG: a general framework to learn evolving knowledge graphs. Frontiers of Computer Science, 2024, 18( 3): 183309

[7]

Geng Y, Chen J, Chen Z, Pan J Z, Ye Z, Yuan Z, Jia Y, Chen H. OntoZSL: ontology-enhanced zero-shot learning. In: Proceedings of the Web Conference 2021. 2021, 3325–3336

[8]

Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J . A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18( 6): 186345

[9]

Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529

[10]

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347

[11]

Peng B, Li X, Gao J, Liu J, Wong K F. Deep Dyna-Q: integrating planning for task-completion dialogue policy learning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2182–2192

[12]

Su S Y, Li X, Gao J, Liu J, Chen Y N. Discriminative deep Dyna-Q: robust planning for dialogue policy learning. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3813–3823

[13]

Zhao Y, Wang Z, Huang Z. Automatic curriculum learning with over-repetition penalty for dialogue policy learning. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 14540–14548

[14]

Takanobu R, Zhu H, Huang M. Guided dialog policy learning: reward estimation for multi-domain task-oriented dialog. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 100–110

[15]

Li Z, Lee S, Peng B, Li J, Kiseleva J, de Rijke M, Shayandeh S, Gao J. Guided dialogue policy learning without adversarial learning in the loop. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 2308–2317

[16]

Hou Z, Liu B, Zhao R, Ou Z, Liu Y, Chen X, Zheng Y. Imperfect also deserves reward: multi-level and sequential reward modeling for better dialog management. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2993–3001

[17]

Peng B, Li X, Li L, Gao J, Celikyilmaz A, Lee S, Wong K F. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 2231–2240

[18]

Casanueva I, Budzianowski P, Ultes S, Kreyssig F, Tseng B H, Wu Y C, Gašić M. Feudal dialogue management with jointly learned feature extractors. In: Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue. 2018, 332–337

[19]

Takanobu R, Liang R, Huang M. Multi-agent task-oriented dialog policy learning with role-aware reward decomposition. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 625–638

[20]

Zhang Y, Ou Z, Yu Z. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 9604–9611

[21]

Yang Y, Li Y, Quan X. UBAR: towards fully end-to-end task-oriented dialog system with GPT-2. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 14230–14238

[22]

Wang Z, Wen T H, Su P H, Stylianou Y. Learning domain-independent dialogue policies via ontology parameterisation. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2015, 412–416

[23]

Papangelis A, Stylianou Y. Single-model multi-domain dialogue management with deep learning. In: Proceedings of the 8th International Workshop on Spoken Dialog Systems, 2019, 71–77

[24]

Mendez J A, Geramifard A, Ghavamzadeh M, Liu B. Reinforcement learning of multi-domain dialog policies via action embeddings. In: Proceedings of the 3rd Workshop Conversational AI. 2019

[25]

Huang X, Qi J, Sun Y, Zhang R. Semi-supervised dialogue policy learning via stochastic reward estimation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 660–670

[26]

Xu Y, Zhu C, Peng B, Zeng M. Meta dialogue policy learning. 2020, arXiv preprint arXiv: 2006.02588

[27]

Cobbe K, Klimov O, Hesse C, Kim T, Schulman J. Quantifying generalization in reinforcement learning. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 1282–1289

[28]

Nichol A, Pfau V, Hesse C, Klimov O, Schulman J. Gotta learn fast: a new benchmark for generalization in RL. 2018, arXiv preprint arXiv: 1804.03720

[29]

Farebrother J, Machado M C, Bowling M. Generalization and regularization in DQN. 2020, arXiv preprint arXiv: 1810.00123

[30]

Fu Q, Wang Z, Fang N, Xing B, Zhang X, Chen J . MAML2: meta reinforcement learning via meta-learning for task categories. Frontiers of Computer Science, 2023, 17( 4): 174325

[31]

Andreas J, Klein D, Levine S. Modular multitask reinforcement learning with policy sketches. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 166–175

[32]

Oh J, Singh S, Lee H, Kohli P. Zero-shot task generalization with multi-task deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2661–2670

[33]

Jain A, Szot A, Lim J. Generalization to new actions in reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 4661–4672

[34]

Guo W, Zhuang F, Zhang X, Tong Y, Dong J . A comprehensive survey of federated transfer learning: challenges, methods and applications. Frontiers of Computer Science, 2024, 18( 6): 186356

[35]

Budzianowski P, Wen T H, Tseng B H, Casanueva I, Ultes S, Ramadan O, Gašić M. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 5016–5026

[36]

Chen W, Chen J, Qin P, Yan X, Wang W Y. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 3696–3709

[37]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

[38]

Eric M, Goel R, Paul S, Sethi A, Agarwal S, Gao S, Kumar A, Goyal A, Ku P, Hakkani-Tur D. MultiWOZ 2.1: a consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 422–428

[39]

Lee S, Zhu Q, Takanobu R, Zhang Z, Zhang Y, Li X, Li J, Peng B, Li X, Huang M, Gao J. ConvLab: multi-domain end-to-end dialog system platform. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2019, 64–69

[40]

Schatzmann J, Thomson B, Weilhammer K, Ye H, Young S. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In: Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. 2007, 149–152

[41]

Zhu Q, Zhang Z, Fang Y, Li X, Takanobu R, Li J, Peng B, Gao J, Zhu X, Huang M. ConvLab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020, 142–149

[42]

Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543

[43]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171–4186

[44]

BehnamGhader P, Adlakha V, Mosbach M, Bahdanau D, Chapados N, Reddy S. LLM2Vec: large language models are secretly powerful text encoders. 2024, arXiv preprint arXiv: 2404.05961

[45]

Liang E, Liaw R, Nishihara R, Moritz P, Fox R, Goldberg K, Gonzalez J, Jordan M, Stoica I. RLlib: abstractions for distributed reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 3053–3062

[46]

Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D. Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the 32th AAAI Conference on Artificial Intelligence. 2018, 3215–3222

[47]

Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (2935KB)

243

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/