Optimizing LSTM Neural Networks for Resource-Constrained Retail Sales Forecasting: A Model Compression Study

Reading time: 5 minute
...

📝 Original Info

  • Title: Optimizing LSTM Neural Networks for Resource-Constrained Retail Sales Forecasting: A Model Compression Study
  • ArXiv ID: 2601.00525
  • Date: 2026-01-02
  • Authors: Ravi Teja Pagidoju

📝 Abstract

Standard LSTM(Long Short-Term Memory) neural networks provide accurate predictions for sales data in the retail industry, but require a lot of computing power. It can be challenging especially for mid to small retail industries. This paper examines LSTM model compression by gradually reducing the number of hidden units from 128 to 16. We used the Kaggle Store Item Demand Forecasting dataset, which has 913,000 daily sales records from 10 stores and 50 items, to look at the trade-off between model size and how accurate the predictions are. Experiments show that lowering the number of hidden LSTM units to 64 maintains the same level of accuracy while also improving it. The mean absolute percentage error (MAPE) ranges from 23.6% for the full 128-unit model to 12.4% for the 64-unit model. The optimized model is 73% smaller (from 280KB to 76KB) and 47% more accurate. These results show that larger models do not always achieve better results.

💡 Deep Analysis

Figure 1

📄 Full Content

Optimizing LSTM Neural Networks for Resource-Constrained Retail Sales Forecasting: A Model Compression Study Ravi Teja Pagidoju Software and AI Developer in Retail, USA Professional MBA student Campbellsville University Rpagi719@students.campbellsville.edu Abstract—Standard LSTM(Long Short-Term Memory) neural networks provide accurate predictions for sales data in the retail industry, but require a lot of computing power. It can be challenging especially for mid to small retail industries. This paper examines LSTM model compression by gradually reducing the number of hidden units from 128 to 16. We used the Kaggle Store Item Demand Forecasting dataset, which has 913,000 daily sales records from 10 stores and 50 items, to look at the trade-off between model size and how accurate the predictions are. Experiments show that lowering the number of hidden LSTM units to 64 maintains the same level of accuracy while also improving it. The mean absolute percentage error (MAPE) ranges from 23.6% for the full 128-unit model to 12.4% for the 64-unit model. The optimized model is 73% smaller (from 280KB to 76KB) and 47% more accurate. These results show that larger models do not always achieve better results. Index Terms—LSTM compression, neural network optimiza- tion, retail forecasting, edge computing, model efficiency I. INTRODUCTION Forecasting retail sales data is very important for planning day-to-day operations and managing inventory. Retailers lose approximately 1.75% of their annual sales due to stock short- ages and excess inventory, typically caused by poor forecasting [1]. Deep learning models, especially Long Short-Term Mem- ory (LSTM) networks, have outperformed traditional methods by reducing errors by 20-30%. [2]. It is challenging to deploy an LSTM network. According to [3], a standard LSTM with 128 hidden units needs an infrastructure of 4 to 8 GB of memory and particular hardware to support. This can be challenging for small and medium- sized stores to compute and figure out accurate forecast data because they do not have the computing power they need. Medium-sized stores make up 65% of the global retail market, but their IT(Tech) budgets typically range from $50,000 to $100,000 annually [4]. Model compression could address the problem by making neural networks smaller while maintaining the same or higher accuracy. Previous compression research has focused on com- puter vision tasks [5]; however, retail forecasting introduces distinct challenges with temporal dependencies and seasonal 0Code available at: https://github.com/RaviTeja444/sales-forecast-LSTM patterns. No previous study has assessed the correlation be- tween LSTM architecture size and forecast accuracy in the context of retail applications. This paper examines the LSTM compression for forecasting retail sales. We address the following research question: What is the minimal LSTM architecture that preserves or improves forecast accuracy? Our contributions are as follows. • Systematic evaluation of LSTM network sizes from 16 to 128 hidden units on real retail data • Discovery that moderate compression (64 units) actually improves the accuracy • Practical guidelines for model selection based on the accuracy-efficiency trade-off II. RELATED WORK A. LSTM in Retail Forecasting LSTM networks excel at capturing long-term dependencies in sequential data [6]. Bandara et al. [2] showed that the LSTM models reduced the forecast errors by 25% compared to the ARIMA models in the retail industry. They built their architecture with 128 hidden units per layer, and it needed GPU acceleration to work in the real world. Recent research analyzes attention mechanisms to improve LSTM performance. Lim et al. [7] achieved the best results with Temporal Fusion Transformers, which combines LSTM with multi-head attention. But these changes made the compu- tational needs rise to 8GB of memory and 50ms of inference time for each prediction. This made it even harder for stores with limited resources to use them. Deep learning approaches for retail forecasting are further validated by recent surveys of RNN methods for forecasting [8] and results from the M5 competition [9]. B. Neural Network Compression There are different ways to reduce the neural network size through Model Compression techniques: Pruning: According to Han et al. [5], removing unnecessary connections can cut the size of the model by 60 to 80% with little loss of precision. But pruning usually requires special hardware to perform sparse matrix operations quickly. arXiv:2601.00525v1 [cs.LG] 2 Jan 2026 Quantization: Jacob et al. [10] showed that changing 32-bit floating-point weights to 8-bit integers has cut memory use by 75% and maintains accuracy within 1–2%. This method works especially well for edge deployment. Architecture Reduction: Frankle and Carbin [11] proposed the lottery ticket hypothesis, showing that smaller networks can perform similarly to larger networks when they are prop- erly set. Thi

📸 Image Gallery

figure1_lstm_compression_tradeoff.png figure2_prediction_quality.png figure3_comprehensive_analysis.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut