The new energy vehicle (NEV) supply chain faces significant challenges stemming from highly uncertain end-user demand and sharp fluctuations in key raw material prices. These factors make procurement costs and inventory levels difficult to control, directly impacting supply chain stability and profitability. Traditional methods, such as stochastic dynamic programming (DP) and standard reinforcement learning (RL) models, which primarily respond only to historical and current state information, often prove insufficient for effectively addressing these complexities. To address these limitations, this paper proposes a Proactive Reinforcement Learning (Pro-RL) framework for joint procurement and inventory decision-making. By integrating a predictive information module into the sequential decision-making process of a Soft Actor-Critic (SAC) agent, the framework constructs an enhanced state space that incorporates predicted future information. This allows the agent to move beyond traditional passive response patterns, enabling proactive utilization of market information to achieve a better balance between immediate costs and long-term risks through iterative learning. To validate its effectiveness, this study develops a supply chain simulation platform aligned with NEV industry characteristics, and comparisons with multiple benchmarks were conducted. Experimental results demonstrate that this end-to-end decision-making policy, which integrates predictive information with deep reinforcement learning, offers advantages in responding to market volatility and achieving coordinated optimization of cost and service levels. This provides NEV enterprises with a theoretical model and practical approach for building flexible and efficient smart supply chains.
Funding
University-Level “Integrated Curriculum Development” Special Fund (SG-JW-2024)
| [1] |
Ben-Tal A., El Ghaoui L., & Nemirovski A. (2009). Robust optimization. Princeton University Press.
|
| [2] |
Campbell G. M. (2011). A two-stage stochastic program for scheduling and allocating cross-trained workers. Journal of the Operational Research Society, 62(6), 1038-1047. https://doi.org/10.1057/jors.2010.16
|
| [3] |
Christensen C. M. (1997). The innovator’s dilemma: When new technologies cause great firms to fail. Harvard Business School Press.
|
| [4] |
Coş kun K., & Tümer B. (2023). Learning under concept drift and non-stationary noise: Introduction of the concept of persistence. Engineering Applications of Artificial Intelligence, 123, Article 106363.
|
| [5] |
Fujimoto S., van Hoof H., & Meger D. (2018). Addressing function approximation error in actor-critic methods. https://doi.org/10.48550/ARXIV.1802.09477
|
| [6] |
Golabi A., Erradi A., Qiblawey H., Tantawy A., Bensaid A., & Shaban K. (2024). Optimal operation of reverse osmosis desalination process with deep reinforcement learning methods. Applied Intelligence, 54(8), 6333-6353. https://doi.org/10.1007/s10489-024-05452-8
|
| [7] |
Haarnoja T., Zhou A., Abbeel P., & Levine S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. https://doi.org/10.48550/ARXIV.1801.01290
|
| [8] |
Henderson P., Islam R., Bachman P., Pineau J., Precup D., & Meger D. (2018). Deep reinforcement learning that matters. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Article AAAI-18-0245.
|
| [9] |
Kotler P., & Keller K. L. (2016). Marketing management (15th ed.). Pearson Education.
|
| [10] |
Lillicrap T. P., Hunt J. J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., & Wierstra D. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
|
| [11] |
Oroojlooy A., & Nazari M. (2021). A review of deep reinforcement learning in operations research and management science. European Journal of Operational Research, 293(2), 401-418.
|
| [12] |
Porter M. E. (1990). The competitive advantage of nations. Free Press.
|
| [13] |
Powell W. B. (2011). Approximate dynamic programming: Solving the curses of dimensionality (2nd ed.). John Wiley & Sons.
|
| [14] |
Schulman J., Wolski F., Dhariwal P., Radford A., & Klimov O. (2017). Proximal policy optimization algorithms. https://doi.org/10.48550/ARXIV.1707.06347
|
| [15] |
Shapiro A., Dentcheva D., & Ruszczynski A. (2021). Lectures on stochastic programming: Modeling and theory (3rd ed.). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611976595
|
| [16] |
Simchi-Levi D., & Zhao Y. (2003). The value of information sharing in a two-stage supply chain with production capacity constraints. Naval Research Logistics, 50(8), 888-916. https://doi.org/10.1002/nav.10094
|
| [17] |
Sutton R. S., & Barto A. G. (2018). Reinforcement learning: An introduction (2nd ed.) MIT Press.
|
| [18] |
Ta T. A., Mai T., Bastin F., & L’Ecuyer P. (2020). On a multistage discrete stochastic optimization problem with stochastic constraints and nested sampling. Mathematical Programming, 190(1-2), 1-37. https://doi.org/10.1007/s10107-020-01518-w
|
| [19] |
Wang B., Wang L., Zhong S., Xiang N., & Qu Q. (2022). Assessing the supply risk of geopolitics on critical minerals for energy storage technology in China. Frontiers in Energy Research, 10, Article 1032000. https://doi.org/10.3389/fenrg.2022.1032000
|
| [20] |
Wang M., Li X., He Y., Li Y., Bennis M., Islam R., & Wang H. (2024). Wavelet predictive representations for non-stationary reinforcement learning. arXiv. https://doi.org/10.48550/ARXIV.2510.04507
|
| [21] |
Wang Y., Geng S., & Gao H. (2018). A proactive decision support method based on deep reinforcement learning and state partition. Knowledge-Based Systems, 143, 248-258. https://doi.org/10.1016/j.knosys.2017.11.005
|