Tuning the burn-in phase in training recurrent neural networks improves their performance
Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.
💡 Research Summary
This paper investigates the practical and theoretical aspects of training recurrent neural networks (RNNs) on long time‑series using Truncated Back‑Propagation Through Time (TBPTT). Standard full‑sequence BPTT becomes prohibitive for very long sequences because it requires a forward and backward pass over the entire data, leading to high memory consumption, increased computational cost, and gradient‑vanishing or exploding problems. TBPTT mitigates these issues by splitting the training data into short overlapping segments of length N and performing BPTT independently on each segment. However, the common practice of initializing the hidden state to zero at the start of every segment introduces a transient “wash‑out” period that can corrupt early predictions and degrade learning.
The authors formalize this transient effect by introducing a burn‑in phase m as an explicit hyper‑parameter. For a given segment Di, the loss is defined as
\
Comments & Academic Discussion
Loading comments...
Leave a Comment