Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma
Bayesian model averaging, model selection and its approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates og convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can br inconsistent. We identify the “catch-up phenomenon” as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch distribution, a modification of the Bayesian marginal distribution. We show that, under broad conditions,model selection and prediction based on the switch distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient implementation. The switch distribution has a data compression interpretation, and can thus be viewed as a “prequential” or MDL method; yet it is different from the MDL methods that are usually considered in the literature. We compare the switch distribution to Bayes factor model selection and leave-one-out cross-validation.
💡 Research Summary
The paper addresses a long‑standing paradox in statistical model selection: Bayesian model averaging (BMA) and criteria derived from it such as BIC are provably consistent, yet in many settings they converge to the optimal predictor more slowly than methods like AIC or leave‑one‑out cross‑validation (LOO), which can be inconsistent. The authors identify the “catch‑up phenomenon” as the root cause. When several nested models are available, a simpler model often predicts better early on because the more complex model must first learn additional parameters. BMA, however, continues to use the simpler model until the marginal likelihood of the complex model overtakes that of the simpler one. During the interval where the complex model already provides superior predictions, BMA’s cumulative log‑loss (or code length) is unnecessarily large, leading to an extra O(log n) term in the convergence rate.
To overcome this, the authors introduce the switch distribution, a modification of the Bayesian marginal that places a prior not only on individual models but on sequences of models together with explicit switch‑points. Formally, a switch‑sequence s = ((t₁,k₁), …, (t_m,k_m)) specifies that model k₁ is used up to observation t₂‑1, then model k₂ up to t₃‑1, and so on. The prior over S (the set of all such sequences) is chosen so that, after observing data, the posterior probability of each model at the current sample size reflects which model is currently most predictive. Consequently the predictive distribution p_sw switches to the better model almost as soon as it becomes better, eliminating the catch‑up delay.
The paper proves several key theoretical results:
-
Consistency – Under mild regularity conditions, model selection based on the switch distribution is consistent: if the true data‑generating distribution belongs to one of the candidate models, the probability that the switch procedure selects that model tends to one (Theorem 1).
-
Optimal convergence rates – The cumulative log‑loss of the switch distribution is never worse than that of BMA and, in many non‑parametric settings (e.g., histogram density estimation), matches the minimax optimal rate without the extra logarithmic penalty. This shows that the switch distribution attains the same fast rates as AIC/LOO while retaining consistency.
-
General applicability – Using a collection of tools developed in Section 5, the authors extend the optimal‑rate result to exponential families and certain non‑linear regression problems, demonstrating that the switch distribution achieves minimax rates in a broad class of parametric and semi‑parametric problems.
-
Computational tractability – An efficient dynamic‑programming algorithm computes the switch posterior in Θ(n·k) time for k candidate models and n observations (Theorem 14). This makes the method practical for real data analysis.
The authors also provide a data‑compression interpretation: the switch distribution corresponds to a prequential (sequential) coding scheme where the code length for each block of data is given by the best model for that block. Unlike traditional MDL, which selects a single model to minimize total code length, the switch‑MDL dynamically changes the model, achieving shorter overall code length when the best model varies with sample size.
In the discussion, the paper reconciles its findings with earlier work (e.g., Yang 2005) and highlights a “strange implication” for Bayes factor model selection: the catch‑up phenomenon can cause Bayes factors to be overly conservative, preferring simpler models far longer than warranted. The switch distribution avoids this conservatism by allowing early adoption of more complex models when they become predictive.
Overall, the paper delivers a unified solution to the AIC–BIC dilemma: it proposes a principled, theoretically sound, and computationally feasible method that is both consistent (like BIC) and rate‑optimal (like AIC/LOO). The switch distribution bridges Bayesian, information‑theoretic, and frequentist perspectives, offering a powerful tool for statisticians and machine‑learning practitioners concerned with both model selection and predictive performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment