Item-level Forecasting for E-commerce Demand with High-dimensional Data Using a Two-stage Feature Selection Algorithm

Hongyan Dai , Qin Xiao , Nina Yan , Xun Xu , Tingting Tong

Journal of Systems Science and Systems Engineering ›› 2022, Vol. 31 ›› Issue (2) : 247 -264.

PDF
Journal of Systems Science and Systems Engineering ›› 2022, Vol. 31 ›› Issue (2) : 247 -264. DOI: 10.1007/s11518-022-5520-1
Article

Item-level Forecasting for E-commerce Demand with High-dimensional Data Using a Two-stage Feature Selection Algorithm

Author information +
History +
PDF

Abstract

With the rapid development of information technology and fast growth of Internet users, e-commerce nowadays is facing complex business environment and accumulating large-volume and high-dimensional data. This brings two challenges for demand forecasting. First, e-merchants need to find appropriate approaches to leverage the large amount of data and extract forecast features to capture various factors affecting the demand. Second, they need to efficiently identify the most important features to improve the forecast accuracy and better understand the key drivers for demand changes. To solve these challenges, this study conducts a multi-dimensional feature engineering by constructing five feature categories including historical demand, price, page view, reviews, and competition for e-commerce demand forecasting on item-level. We then propose a two-stage random forest-based feature selection algorithm to effectively identify the important features from the high-dimensional feature set and avoid overfitting. We test our proposed algorithm with a large-scale dataset from the largest e-commerce platform in China. The numerical results from 21,111 items and 109 million sales observations show that our proposed random forest-based forecasting framework with a two-stage feature selection algorithm delivers 11.58%, 5.81% and 3.68% forecast accuracy improvement, compared with the Autoregressive Integrated Moving Average (ARIMA), Random Forecast, and Random Forecast with one-stage feature selection approach, respectively, which are widely used in literature and industry. This study provides a useful tool for the practitioners to forecast demands and sheds lights on the B2C e-commerce operations management.

Keywords

Forecasting / e-commerce / high-dimensional feature / feature selection

Cite this article

Download citation ▾
Hongyan Dai, Qin Xiao, Nina Yan, Xun Xu, Tingting Tong. Item-level Forecasting for E-commerce Demand with High-dimensional Data Using a Two-stage Feature Selection Algorithm. Journal of Systems Science and Systems Engineering, 2022, 31(2): 247-264 DOI:10.1007/s11518-022-5520-1

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Abasabadi S, Nematzadeh H, Motameni H, Akbari E. Automatic ensemble feature selection using fast non-dominated sorting. Information Systems, 2021, 100: 101760.

[2]

Abolghasemi M, Beh E, Tarr G, Gerlach R. Demand forecasting in supply chain: The impact of demand volatility in the presence of promotion. Computers & Industrial Engineering, 2020, 142: 106380.

[3]

Ali Ö G, Sayın S, Van Woensel T, Fransoo J. SKU demand forecasting in the presence of promotions. Expert Systems with Applications, 2009, 36(10): 12340-12348.

[4]

Andersen J, Giversen A, Jensen A H, Larsen R S, Pedersen T B, Skyt J (2000). Analyzing clickstreams using subsessions. In Proceedings of the 3rd ACM international workshop on Data warehousing and OLAP. ACM, November, 25–32.

[5]

Athanasopoulos G, Hyndman R J, Kourentzes N, Petropoulos F. Forecasting with temporal hierarchies. European Journal of Operational Research, 2017, 262(1): 60-74.

[6]

Bauer H H, Falk T, Hammerschmidt M. eTransQual: A transaction process-based approach for capturing service quality in online shopping. Journal of Business Research, 2006, 59(7): 866-875.

[7]

Besbes O, Gur Y, Zeevi A. Optimization in online content recommendation services: Beyond click-through rates. Manufacturing & Service Operations Management, 2016, 18(1): 15-33.

[8]

Biau G, Scornet E. A random forest guided tour. Test, 2016, 25(2): 197-227.

[9]

Breiman L. Random forests. Machine Learning, 2001, 45(1): 5-32.

[10]

Breiman L, Friedman J, Stone C J, Olshen R A (1984). Classification and Regression Trees, CRC press.

[11]

Cantallops A S, Salvi F. New consumer behavior: A review of research on eWOM and hotels. International Journal of Hospitality Management, 2014, 36: 41-51.

[12]

Cao P, Zhao N, Wu J. Dynamic pricing with Bayesian demand learning and reference price effect. European Journal of Operational Research, 2019, 279(2): 540-556.

[13]

Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering, 2014, 40(1): 16-28.

[14]

Chen Q, Zhang M, Xue B. Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation, 2017, 21(5): 792-806.

[15]

Chiew K L, Tan C L, Wong K, Yong K S, Tiong W K. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences, 2019, 484: 153-166.

[16]

Choi T M, Hui C L, Liu N, Ng S F, Yu Y. Fast fashion sales forecasting with limited data and time. Decision Support Systems, 2014, 59: 84-92.

[17]

Chong A Y L, Ch’ng E, Liu M J, Li B. Predicting consumer product demands via Big Data: The roles of online promotional marketing and online reviews. International Journal of Production Research, 2017, 55(17): 5142-5156.

[18]

Chong A Y L, Li B, Ngai E W, Ch’ng E, Lee F. Predicting online product sales via online reviews, sentiments, and promotion strategies: A big data architecture and neural network approach. International Journal of Operations & Production Management, 2016, 36(4): 358-383.

[19]

Chou M C, Sim C K, Yuan X M. Policies for inventory models with product returns forecast from past demands and past sales. Annals of Operations Research, 2020, 288: 137-180.

[20]

Dai A, Zhang Z, Hou P, Yue J, He S, He Z. Warranty claims forecasting for new products sold with a two-dimensional warranty. Journal of Systems Science and Systems Engineering, 2019, 28(6): 715-730.

[21]

Ding Y, Liu J. Joint pricing strategies of multi-product retailer with reference-price and substitution-price effect. Journal of Data, Information and Management, 2021, 3(1): 49-63.

[22]

Divakar S, Ratchford B T, Shankar V. Practice prize article — CHAN4CAST: A multichannel, multiregion sales forecasting model and decision support system for consumer packaged goods. Marketing Science, 2005, 24(3): 334-350.

[23]

Dong J, Hu Z, Liang C (2017). E-commerce supply chain coordination under demand influenced by historical sales rate. 2017 3rd International Conference on In formatiom Management (ICIM) 61–71, IEEE.

[24]

Fan Z P, Che Y J, Chen Z Y. Product sales forecasting using online reviews and historical sales data: A method combining the Bass model and sentiment analysis. Journal of Business Research, 2017, 74: 90-100.

[25]

Ferreira K J, Lee B H A, Simchi-Levi D. Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management, 2016, 18(1): 69-88.

[26]

Fildes R, Goodwin P, Önkal D. Use and misuse of information in supply chain forecasting of promotion effects. International Journal of Forecasting, 2019, 35(1): 144-156.

[27]

Giang N L, Ngan T T, Tuan T M, Phuong H T, Abdel-Basset M, de Macêdo A R L, de Albuquerque V H C. Novel incremental algorithms for attribute reduction from dynamic decision tables using hybrid filter-wrapper with fuzzy partition distance. IEEE Transactions on Fuzzy Systems, 2019, 28(5): 858-873.

[28]

Goltsos T E, Syntetos A A, van der Laan E. Forecasting for remanufacturing: The effects of serialization. Journal of Operations Management, 2019, 65(5): 447-467.

[29]

Got A, Moussaoui A, Zouache D. Hybrid filter-wrapper feature selection using Whale Optimization Algorithm: A Multi-Objective approach. Expert Systems with Applications, 2021, 183: 115312.

[30]

Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3: 1157-1182.

[31]

Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1): 389-422.

[32]

Hanna R C, Lemon K N, Smith G E. Is transparency a good thing? How online price transparency and variability can benefit firms and influence consumer decision making. Business Horizons, 2019, 62(2): 227-236.

[33]

He J, Wang X, Vandenbosch M B, Nault B R. Revealed preference in online reviews: Purchase verification in the tablet market. Decision Support Systems, 2020, 132: 113281.

[34]

Huang G, Liu L. Supply chain decision-making and coordination under price-dependent demand. Journal of Systems Science and Systems Engineering, 2006, 15(3): 330-339.

[35]

Huang T, Fildes R, Soopramanien D. The value of competitive information in forecasting FMCG retail product sales and the variable selection problem. European Journal of Operational Research, 2014, 237(2): 738-748.

[36]

Hyndman R J, Koehler A B. Another look at measures of forecast accuracy. International Journal of Forecasting, 2006, 22: 679-688.

[37]

Hyndman R J, Koehler A B, Snyder R D, Grose S. A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting, 2002, 18(3): 439-454.

[38]

Jiménez-Cordero A, Morales J M, Pineda S. A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification. European Journal of Operational Research, 2021, 293(1): 24-35.

[39]

Kamakura W A, Kang W. Chain-wide and storelevel analysis for cross-category management. Journal of Retailing, 2007, 83(2): 159-170.

[40]

Kim J, Kang J, Sohn M. Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data. Knowledge-Based Systems, 2021, 220: 106901.

[41]

Kim S, Kim H. A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting, 2016, 32(3): 669-679.

[42]

Koehn D, Lessmann S, Schaal M. Predicting online shopping behaviour from clickstream data using deep learning. Expert Systems with Applications, 2020, 150: 113342.

[43]

Korobilis D. Quantile regression forecasts of inflation under model uncertainty. International Journal of Forecasting, 2017, 33(1): 11-20.

[44]

Kursa M B, Rudnicki W R. Feature selection with the Boruta package. Journal of Statistical Software, 2010, 36(11): 1-13.

[45]

Lee L, Charles V. The impact of consumers’ perceptions regarding the ethics of online retailers and promotional strategy on their repurchase intention. International Journal of Information Management, 2021, 57: 102264.

[46]

Leung K H, Mo D Y, Ho G T, Wu C H, Huang G Q. Modelling near-real-time order arrival demand in e-commerce context: A machine learning predictive methodology. Industrial Management & Data Systems, 2020, 120(6): 1149-1174.

[47]

Li C, Lim A. A greedy aggregation-decomposition method for intermittent demand forecasting in fashion retailing. European Journal of Operational Research, 2018, 269(3): 860-869.

[48]

Li J, Manry M T, Narasimha P L, Yu C. Feature selection using a piecewise linear network. IEEE Transactions on Neural Networks, 2006, 17(5): 1101-1115.

[49]

Lohrmann C, Luukka P. Classification of intraday S&P500 returns with a Random Forest. International Journal of Forecasting, 2019, 35(1): 390-407.

[50]

Lu L, Gou Q, Tang W, Zhang J. Joint pricing and advertising strategy with reference price effect. International Journal of Production Research, 2016, 54(17): 5250-5270.

[51]

Ma S, Fildes R, Huang T. Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra-and inter-category promotional information. European Journal of Operational Research, 2016, 249(1): 245-257.

[52]

Makridakis S. Accuracy measures: Theoretical and practical concerns. International journal of Forecasting, 1993, 9(4): 527-529.

[53]

Maldonado S, Pérez J, Bravo C. Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research, 2017, 261(2): 656-665.

[54]

Maldonado S, Weber R, Basak J. Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 2011, 181(1): 115-128.

[55]

Martínez A, Schmuck C, Pereverzyev S Jr, Pirker C, Haltmeier M. A machine learning framework for customer purchase prediction in the non-contractual setting. European Journal of Operational Research, 2020, 281(3): 588-596.

[56]

Mueller S Q. Pre-and within-season attendance forecasting in Major League Baseball: A random forest approach. Applied Economics, 2020, 52(41): 4512-4528.

[57]

Nakariyakul S, Casasent D P. An improvement on floating search algorithms for feature subset selection. Pattern Recognition, 2009, 42(9): 1932-1940.

[58]

Nakariyakul S. High-dimensional hybrid feature selection using interaction information-guided search. Knowledge-Based Systems, 2018, 145: 59-66.

[59]

Narayanan A, Sahin F, Robinson E P. Demand and order-fulfillment planning: The impact of point-of-sale data, retailer orders and distribution center orders on forecast accuracy. Journal of Operations Management, 2019, 65(5): 468-486.

[60]

Navarro F F G, Muñoz L A B. Gene subset selection in microarray data using entropic filtering for cancer classification. Expert Systems, 2009, 26(1): 113-124.

[61]

Neto J Q F, Bloemhof J, Corbett C. Market prices of remanufactured, used and new items: Evidence from eBay. International Journal of Production Economics, 2016, 171: 371-380.

[62]

Nikolopoulos K. We need to talk about intermittent demand forecasting. European Journal of Operational Research, 2021, 291(2): 549-559.

[63]

Omuya E O, Okeyo G O, Kimwele M W. Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 2021, 174: 114765.

[64]

Ot A, Ttn B, Sm C. A novel wrapper-based feature subset selection method using modified binary differential evolution algorithm. Information Sciences, 2021, 565: 278-305.

[65]

Pang G, Casalin F, Papagiannidis S, Muyldermans L, Tse Y K. Price determinants for remanufactured electronic products: A case study on eBay UK. International Journal of Production Research, 2015, 53(2): 572-589.

[66]

Pannakkong W, Sriboonchitta S, Huynh V N. An ensemble model of arima and ann with restricted boltzmann machine based on decomposition of discrete wavelet transform for time series forecasting. Journal of Systems Science and Systems Engineering, 2018, 27(5): 690-708.

[67]

Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.

[68]

Petropoulos F, Hyndman R J, Bergmeir C. Exploring the sources of uncertainty: Why does bagging for time series forecasting work?. European Journal of Operational Research, 2018, 268(2): 545-554.

[69]

Ramanathan U, Muyldermans L. Identifying demand factors for promotional planning and forecasting: A case of a soft drink company in the UK. International journal of production economics, 2010, 128(2): 538-545.

[70]

Reunanen J. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 2003, 3(Mar): 1371-1382.

[71]

Subramanian R, Subramanyam R. Key factors in the market for remanufactured products. Manufacturing & Service Operations Management, 2012, 14(2): 315-326.

[72]

Sun L, Zheng X, Jin Y, Jiang M, Wang H. Estimating promotion effects using big data: A partially profiled LASSO model with endogeneity correction. Decision Sciences, 2019, 50(4): 816-846.

[73]

Tang L, Sun L, Guo C, Zuo Y, Zhang Z. A Simulation Research Towards Better Leverage of Sales Ranking. Journal of Systems Science and Systems Engineering, 2021, 30(1): 105-122.

[74]

Trapero J R, Kourentzes N, Fildes R. On the identification of sales forecasting models in the presence of promotions. Journal of the operational Research Society, 2015, 66(2): 299-307.

[75]

Van Donselaar K H, Peters J, de Jong A, Broekmeulen R A. Analysis and forecasting of demand during promotions for perishable items. International Journal of Production Economics, 2016, 172: 65-75.

[76]

Wang P, Du R, Hu Q. How to promote sales: Discount promotion or coupon promotion?. Journal of Systems Science and Systems Engineering, 2020, 29(9): 381-399.

[77]

Wu M, Ma L, Xue W. Order timing for manufacturers with spot purchasing price uncertainty and demand information updating. Journal of Systems Science and Systems Engineering, 2020, 29(6): 631-654.

[78]

Wu W, Liu M, Liu Q, Shen W. A quantum multiagent based neural network model for failure prediction. Journal of Systems Science and Systems Engineering, 2016, 25(2): 210-228.

[79]

Xie G, Qian Y, Wang S. Forecasting Chinese cruise tourism demand with big data: An optimized machine learning approach. Tourism Management, 2021, 82: 104208.

[80]

Xu X, Zeng S, He Y. The influence of e-services on customer online purchasing behavior toward reman-ufactured products. International Journal of Production Economics, 2017, 187: 113-125.

[81]

Yan T, Sun B. A study on statical and dynamical characteristics model of e-commerce competitive environment. 2011 International Conference on Business Management and Electronic Information IEEE, 2011, 4: 573-580.

[82]

Ye Q, Law R, Gu B. The impact of online user reviews on hotel room sales. International Journal of Hospitality Management, 2009, 28(1): 180-182.

[83]

Yeo J, Hwang S W, Koh E, Lipka N. Conversion prediction from clickstream: Modeling market prediction and customer predictability. IEEE Transactions on Knowledge and Data Engineering, 2018, 32(2): 246-259.

[84]

Yıldırım M, Okay F Y, Özdemir S. Big data analytics for default prediction using graph theory. Expert Systems with Applications, 2021, 176: 114840.

[85]

Yu H, Chen X, Li Z, Zhang G, Liu P, Yang J, Yang Y. Taxi-based mobility demand formulation and prediction using conditional generative adversarial network-driven learning approaches. IEEE Transactions on Intelligent Transportation Systems, 2019, 20(10): 3888-3899.

[86]

Zhu F, Zhang X. Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics. Journal of Marketing, 2010, 74(2): 133-148.

AI Summary AI Mindmap
PDF

338

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/