Retraining as Approximate Bayesian Inference

https://for ecasters.org/f oresigh t/ FORESIGHT 43 Methodology THE B A YESI AN FR AM EW ORK VS. RE TR AIN ING I n Bayesian infer ence, learning is con- tinuous. A model maintains a distr i- bution ov er par ameters, n ot a point es - timate. As d ata arr ive, the distr ibution updates c ontinuously: prior beliefs com- bine with evidence to form a posterior. I f conditions are stable, upd ates are small. If conditions change, the posterior shifts as ev idence accumulates.  ere is no mo- ment at which the model be comes stale – the model is never “old” or “new . ” It is simply the cur rent posterior, conditioned on all obser ved data .  is is the key insig ht that makes “ W hen should we retrain?” a str ange question within the Bayesian fr amework. Her e, the concept of r etraining doesn’t o ccur . If the data-generating process is stable, updates are incremental. I f the process changes, the posterior adapts as evidence accumu- lates.  ere is no calend ar event at which the system becomes wise again. Howev er , the Bayesian ideal is compu- tationally expensive. Ex act updating is often in tractable for large models. Many production systems ship p oint estimates rather than full posteriors. Organizations face deployment pipelines, validation re- quirements, and governance constraints that make continuous upd ating impracti- cal. So in pr actice, we are for ced to batch. W e accumulate data , retrain per iodically , vali- date, and deploy . But this dis crete cadence is a propert y of compute and process, not of learning itself (Gama e t al., 2014). Once we batch, we crea te a decision prob- lem: W hen is it worth paying the cost of a new training r un?  e traditional fram- ing trea ts this as maintenance: refresh the mo del ever y week , ever y month, or whenever a dr ift alarm " res. An example of such an alarm, using control char ts to detect drift , appeared in Fore s i g ht Issue 56 (K atz, 2020).  e control char t ap proach trig gers re- training when residual patterns are sta- tistically unusual under the assumption of a stable process.  e Bayesian approach trig gers w hen the expected cost of a stale model exceeds the cost of inter vention.  e " rst asks “Is this sur prising?”  e second asks “Is acting wor th it?” Both approaches replace rigid schedules with evidence-based tr ig gers, but the basis for setting the trig ger di  ers. VIE WI NG TR AIN ING AS DE BT RED UC T ION In this article I propose a di  er ent fr am- ing. Consider what we are tr ying to ap- proxima te. With in " nite computational resourc es and zero operational friction, we would maintain a continuously updat- R etraining as Appr oximate Bayesian I nference H A R R I S O N K AT Z PREVIEW Model retr aining is usually treated as an ongoing maintenance task . But as Harrison Katz now argues, retraining can be better understood as approximate Bayesian in- ference under computational constraints . The gap b etween a continuously updated belief state and your fro zen deploy ed model is “learning debt, ” and the retr aining decision is a cost- minimization problem with a threshold that falls out of your loss func tion. In this ar ticle Katz provides a decision-theoretic framework for retraining policies. The result is evidence-based triggers that replac e c alendar schedules and mak e gov ernance auditable . F or readers less fa- miliar with the Bayesian and decision-theoretic langu age, key terms are de  ned in a glossary at the end of the ar ticle. FORESIGHT 2026: Q2 44 ed belie f state. Call this the continuously updated posterior . In practice, the deployed model r e $ ects the posterior fr om the l ast training event. Between training events, d ata ar rive but the b elief state stays frozen.  e gap be- tween the b elief we would have under continuous updating and the belief we a c- tually deploy is accumulated lea rning debt , analogous to technical debt in machine learning systems (Sculley e t al., 2015). In information- theoretic terms, lear ning debt is a divergence bet ween two belief states. K L divergence (a measure of how one probability distribution di  ers from a refer ence pr obability distribution) is a natural choice (Kullback and Leibler, 1951). W e do not ne ed to compute this divergence exactly . W hat matters is that we can b uild proxies that track it and then u se thos e proxies to decide when an inter vention is justi " ed. W e refer to this approach as the learning debt framework . THE D EC ISI ON - THE OR ET IC L A YER  e problem is not only that the model may be stale. W e are also uncer tain about why . Is the world chang ing , or did we ob- ser ve noise? Is the shift t emporar y or per- manent? A useful formaliz ation is to maintain a belief about whether the data-generating process has changed. C all this P(shift). It can be explicit (a Bayesian model over regimes) or implicit (a calibrated drift score).  is shift probability nee d not be con- stant. Re gime changes are more likely around known inter ventions (product launches, pric ing changes, polic y shifts) and macro shocks. Bayesian online changepoint detection formalizes this with a h azard rate, a pr ior on regime change at each step (Ad ams and MacK ay , 2007). A practical approximation is to raise the shift pr ior during turbulent w in- dows and demand stronger evidence dur- ing quiet perio ds. Retraining then becomes a decision prob- lem with asymmetr ic costs: • Retrain when stable: pay chur n cost (compute, deployment risk , p otential regressions). • Do not retrain when shifted: pay bias cost (accumulating for ecast er ror until you catch up). • Retrain when shifted: pay retrain cost, avo id most bias cost. Standard d ecision-the oretic reasoning (Berger, 1985) yields a simple inequality: Ret ra in w h e n P (s h if t) > ( ch urn c ost ) / (b ias c ost )  is is where practitioner guidance of- ten fails. T eams pick thresholds b ecause they se em reasonable or match historical cadence.  e decision-the oretic framing for ces a useful question: W hat are the ac- tual costs of acting too often versus too late? Once we wr ite those costs down, the threshold is no longer arbi trary . I t’ s a design parameter g rounded in the loss function. PROXIE S FO R POS TE RIO R DIV ERG EN CE W e cannot compute the continuously up- dated posterior in most production sys - tems. If we could, we would not b e batch- ing. B ut we can build approximations that aim at the right target .  e key change in mindse t is to treat monitoring metrics as measurements K ey P oints ■ Retrainin g is not maintena nce . It is appr oxi - m a t e B ay es i a n i n fe re n c e u n d e r c o m p u t a t i o n a l and operational constraints . ■ T h e g a p b e t w e e n y o u r d e p l o y e d m o d e l a n d a h y p o t h e t i c a l c o n t i n u o u sl y u p d a t e d m o de l i s “ l e a r n i n g d e b t . ” M o n i t o r i n g m e t r i c s s h o u l d estimate this debt, not just track error . ■ T h e r e t r a i n i n g d e c i s i o n h a s a s y m m e t r i c c o s t s : c h u r n ( r et r a i n i n g wh e n s t a b l e ) a n d bi a s ( n o t retra in ing wh en shi ft ed) . The op tim al thre sh - old follow s from those costs . ■ T h r e s h o l d s s h o u l d b e d e r i v e d fr o m yo u r l o s s f u n c t i o n , n o t p i ck e d b e c a u s e t h e y s e e m reas ona ble or ma tch his tor ica l c ade nce. https://for ecasters.org/f oresight/ FORESIGHT 45 of belie f staleness, not just per formance degradation. Predictive evidence on fresh data  e most direct signal is how sur pr ised the deployed model is by recent out- comes. C ompute p roper scoring r ules (log loss, CRP S) on a rolling w indow of fresh d ata (Gneiting and R af ter y , 2007; Hersbac h, 2000). Systematic sur pr ise indicates a stale belie f state. In Bayesian terms, repeated surpr ise is evidence that the fr ozen posterior no longer aligns with obser ved data . Calibration and distributional mismatches Many models maintain acceptable av er- age er ror while becoming miscalibrated in ways that m atter downstream. Track calibration cur ves, prediction inter val cover age, and group-level residual str uc- ture. P osterior predictive checks provide a useful mindset: compare what the model implies to what the d ata show , and quan- tify the gap (Gelman, Meng, and Stern, 1996).  ese checks are inter pretable.  ey indicate not just that the model is worse, but how. F or example, calibration cur ves reveal whether the model overes- timates or underestimates in spe ci " c re- gimes. Group-level residuals show which customer se gments or product categor ies are drif ting.  ese dia gnostics guide not just the decision to retrain, but what to " x. Shadow  t s and pa r ame te r sta bi lity Fit a lightweight model on a recent win- dow and compar e it to the deploy ed mod- el.  is shadow model need not match the full production architecture. If the shadow wants ver y di  erent par ameters, that is evidence of dr ift (Gama et al., 2014; Bifet and Gavalda, 2007).  e dis - agreement between deployed and shadow models estimates learning debt.  is approach separates t wo failure modes. If both models per form poor- ly , the issue ma y be featur es, labels, or pipeline rather than dr ift. If the shadow improves materially over the deployed model, the world has moved and we are behind. Domain-speci  c distributional signals Some forecasting s ystems fail not be cause demand levels changed, but because tim- ing or composition changed. In travel, a pickup forecast may break because lead- time distribut ions compressed. In retail, product mix may shift while ag gregate demand stays $ at. In ad bidding , auction competition distr ibutions may change faster than con version r ates. In each case, tracking distributional di- vergence (L1, KL , W asserstein) on the rel- evant domain objec t ser ves as a sensitive early indicato r of dr ift. In the fr aming of this article, such divergence is a domain- speci " c pro xy for learning debt. IMPLEMENT A TION Real models are not one-parameter rates. W e u sually cannot compute ex act beliefs. But we can implement the log ic as a policy layer ar ound the training pipeline. Arc hit ectur e • Deployed model: pr oduction model • Fresh-data evaluator: computes proper scoring r ules on a rolling win- dow . • Shadow learner: lightweight model retrained frequently , or a calibra tor/ last-layer update. • Evidence ag gregator: converts per- formance di  erences into an evidence score f or stable versus shifted. • P olicy threshold: compares evidence- adjusted shift belie f to the cost ratio, trig gers retraining when justi " ed. Pol ic y T em pl at e 1. De " ne what a shift means in the do- main: outcomes, inputs, or timing structure. 2. Choose two or three evidence sig nals that approxima te belief staleness. 3. Write down churn cost and bi as cost in common units. Run sensi tivity analysis. 4. Set the threshold implied by those costs. Implement as code, not lore. 5. Backtest on k nown disr uptions and quiet perio ds to validate. FORESIGHT 2026: Q2 46 Notice what is absent: a universal drift threshold.  e thr eshold is a policy pa- rameter der ived from op erational real- ity . If deployments are risk y , se t the bar higher. I f prediction error is ex pensive and fast moving , set it lower .  is makes governance easier. W hen asked why we retrained, we answer in au- ditable terms: the evidenc e for a real shift crossed the decision threshold implied by our costs. A full implementation g uide is beyond the scope of this ar ticle.  e architecture above provides a star ting point . Details will var y by tech stack , gover nance re- quirements, and organizational context. Choosing Costs Cost models need not be perfect to be useful. Rou gh numbers are better than implicit numbers in people ’ s heads. One approach is to de " ne costs in the same units as the business object ive. • Churn cost: eng ineering hours , com- pute, deployment risk , translated into dollars or utility . • Bias cost: exp ected downstream loss per d ay of stale predictions times ex- pected duration until intervention. • Retrain cost : mar ginal cost of train- ing , evaluation, and deployment. For instance, churn cost mig ht include four engineer-hours per retrain at $150/ hour , plus a deployment risk premium based on historical regression rates. Bias cost mig ht be $500 per d ay of de graded for ecasts, derived from downstream in- ventory holding costs or revenue impact from mispr iced promotions. E xact num- bers matter less than order of magnitude. If churn cost is around $1,000 and bi as cost is $500/day , the threshold implies retraining is worthwhile if you expect the model to remain stale for mor e than two day s.  en r un sensitivity analysis. If the polic y changes only under extreme a s- sumptions, the decision is robust. I f it $ ips under small changes, that identi- " es where better measurement pays o  . In practice, sensit ivity analysis is a few hours of sprea dsheet work: var y the cost assumptions by 2x in each direction and see whether the polic y $ ips.  is is not a major under taking. Stylized Examples Tr avel demand Consider a tra vel demand f orecasting sys- tem. Each morning we obser ve on-the- books reser vations for future check-in dates and add an expected pickup cur ve to predict " nal totals. At the last training event, the model implied that by 30 d ays before arr ival, 60% of eventual demand is already booked. Now book ing windows shift. Trav elers book later . At 30 days out, only 40% of eventual demand is booked.  e featur e pipeline is correct.  e data-generating process f or timing has changed. W ai ting for " nal outcomes reveals this too late. But tracking distributional di- vergence b etween the current lead-t ime histogram and a baseline from the train- ing era detects the change immediately . Distribution-level monitoring catches changes that are missed by aver ages and medians. In the lang uage of this ar ticle, s ustained lead-time divergence is evidence that the deployed belief stat e is stal e. A sha dow timing model " t on recent weeks s er ves as the continuously updated alternative we approximate.  e action r ule is un- changed: r etrain when expected cost of stale for ecasts, weighted by P(shift), ex- ceeds churn cost. Re tail promotion r esp onse Consider a retail demand forecasting s ys- tem for a consumer product.  e model was trained on two years of histor y in- cluding seasonal patterns an d promo- tional lif ts. At the last training event , the model lear ned that a 20% pr ice promo- tion typically generat es a 2.5x demand multiplier . Now the competitive environment shifts. A new entr ant runs frequent pr omotions, eroding the lif t from your ow n discounts.  e same 20% promotion now generat es only a 1.6x multiplier . A g gregate demand may look stable, b ut the promotional re- sponse has changed. https://for ecasters.org/f oresight/ FORESIGHT 47 Tr acking point for ecasts alone may miss this. But monitor ing the distribution of for ecast errors conditional on promotion status reveals the gap immediately .  e model systematically overpredicts dur- ing pr omotions and und er predicts during regular per iods. In the lang uage of this art icle, the promotional response model has accumulated lear ning debt . A shadow model " t on recent pr omotions estimates what the upd ated belie f would b e.  e action r ule is unchanged: retrain when expected cost of biased promotional fore- casts exceeds churn cost. A note on t emporar y vs. perm anent shifts  e learning debt framework does not re- quire knowing in advance whether a shift is temporary or p ermanent. I t responds to evidence as it accumulates. If the shift re- verses quickly , the divergence signal fa des and no intervention is tr iggered. I f it per- sists, evidence accumulates until the cost threshold is crossed.  e model that was “ correct” b efore the shift is not assumed to be per fect – only that it repr esented the best a vailable belief at the time. W hat matters is whether the cur rent deployed belief has drifted far enough from what continuous updating would yield to jus- tify the cost of inter vention. LIMIT A TIONS  e learning debt framework assumes retraining is a meaningful discrete event with nontr ivial c ost. It is less useful in two boundar y cas es. First, when updates are nearly continu- ous. High- frequency systems (ad bidding, recommendations, tra ding ) sometimes update on ever y batch.  e question “ W hen t o retrain?" dissolves into “H ow much to learn from each obser vation, ” better handle d by learning rate schedules and online learning theor y . Second, when the cost ratio is unknow- able. Some organizations cannot estima te bias cost becau se downstr eam e  ects are di  use or contested.  e framework still helps by making disag reement explicit , but it will not resolve it. We can identif y what must be true for di  er ent policies to be justi " ed and let stakeholders arg ue as - sumptions rather than thr esholds.  is framework als o assumes we can build proxies cheaper than full r etraining. If the only way to know whether the model is stale is to retrain and compare, the moni- toring layer adds overhead w ithout de ci- sion value. In practice this is rare. Most systems have cheap sig nals (scoring r ule degradation, shadow disagreement, dis - tributional shift) that inform before full retraining cost is incur red.  e monitor- ing layer is lightweight: scor ing r ules and distributional comparisons scale linearly with the number of ser ies. For systems with thousands or millions of time series, the marginal cost of monitoring is neg li- gible.  e expense is in the retraining it- self, which this framework aims t o trig ger less often, not more.  is framework requires enou gh his - tor y to establish stable ba selines for the monitoring signals and to estimate costs with reasonable con " dence. In practice, this is typically whatever was su % cient to train the original mo del. Systems with ver y shor t histories may need to rely on domain priors for cost estimates until enough d ata accumulat es. CO N C LUS I O N  e lear ning debt framework o  ers sev - eral advantages: • It replaces calendar schedules wi th evi- dence and costs. • It turns drift detection from a grab bag of metr ics into a coherent attempt to estimate belief staleness. • It handles predictable turbulence by design.  e shift prior can be higher around launches, policy c hanges, and calendar events (A dams and MacK ay , 2007). • It makes thresholds defensible, audit- able, and tunable to risk tolerance. • It cl ari " es what retraining buys: not novelty , but reduced lear ning debt ( W ald, 1950; B erger, 1985). Production ML is not only a modeling problem. It is a sequential decision prob- lem un der unc ertainty .  e Bayesian ideal supplies the conceptual target. Decision theor y supplies the action rule. Ever y - thing in be tween is eng ineering: " nding FORESIGHT 2026: Q2 48 the cheapest , safest approximations that move towar d the ideal. Glossar y of Bayesian and Decision- Theoretic T erms Belief state:  e cur rent probability dis- tribution representing what the model “knows. ” In pr oduction systems, this is typically frozen between training events. Calibration: A probabilistic forecast is well calibrated if events predicted with X% probability occur X% of the time. CRP S (Continuous R anked Probabili- ty Scor e): A proper scoring rule for prob- abilistic for ecasts that generalizes mean absolute er ror to f ull predictive distr ibu- tions. Data-gen erating process (DGP):  e underlying mechanism that produces ob- ser ved data. A regime shift means this mechanism has changed. Hazard rate: In changepoint detection, the pr ior probability that a regime shift occurs at any given time step. KL divergence: A measure of how dif- fer ent two probability distr ibutions are. Used here to quantify learning debt . Lear ning deb t: A ccumulated divergence between a continuously updated belief and the deployed (frozen) belie f. Analo- gous to technical debt in s oftware sys- tems. P osterior: U pdated belief abo ut model parameters after obser ving d ata. Com- bines the prior with obser ved evidence. P osterior predictive: D istribution of future outcomes implied by the model and its uncertainty . Prior: Belie f about model parameters be- for e obser ving d ata. Proper scoring r ule: Metric that re- wards accurate probabilistic predictions. Examples include lo g loss and CRPS. Un- like point-forecast metrics, pr oper scor- ing r ules incentivize honest uncertainty estimates. Reg ime shift: C hange in the data-gener- ating process that makes older data less informative. Shadow model: A lightweight model trained on recent data , used to estimate what param eters a continuously upd ated system would ha ve learned. REF ERENC ES Adams, R.P . & MacKay , D .J .C. (2007). Bayesian online Berger, J .O . (1985). Statistical Deci sion  eor y and Bayesian A nalysis , 2nd ed. Springer. Bifet, A . & Gavalda, R . (2007). L earning from time-chang ing data with adaptive windowing . Proceed- ings of the SIAM International Conf erence on Data Mining. Gama, J ., et al. (2014). A sur vey on concept drift adap- tation. ACM Computing Sur veys , 46(4), Article 44. Gelman, A ., et al. (2013). B ayesian Data A nalysis , 3rd ed. Chapman and Hall/CRC. Gelman, A ., Meng, X.L., & Stern, H. (1996). P osterior predictive assessment of model " tn ess via realized dis- crepancies. Statistica Sinica , 6(4), 733-807. Gneiting , T . & R after y , A .E . (2007). Str ictly proper scoring rules, prediction, and estimation. J ournal of the American Statistical A ssociation , 102(477), 359-378. Hersbach, H. (2000). Decomposition of the continuous ranked probability score for ens emble prediction sys- tems. W ea ther and F orecasting , 15(5), 559-570. Katz , J .H. (2020). Monitoring forecast models using control charts. Fo res i g h t , 56, 20-25. Kullback, S. & Leibler, R .A . (1951). O n information and su % cienc y . A nnals of Mathematical Statis tics , 22(1), 79-86. Quinonero-Candela , J ., Sugiyama, M., Schwaighofer , A ., and L awrence, N.D . (Eds.) (2009). Dataset Shi ft in Machine Lear ning . MIT Press. Sculley , D . et al. (2015). Hidden technical debt in ma- chine learning systems. A dvances in Neural Information Processing Systems . W ald, A . (1950). Statistical Decision Functions . W iley . Harrison Katz holds a PhD in statistics from UCLA. He is Head of Finance Data Sci- ence & Strategy at Airbnb, leading the fore- casting team that supports earnings, treasury operations, and strat egic planning and advis- ing on enterprise data strategy . His research focuses on Bay esian methods for composi- tional and hierarchical time series, including the B-DARMA and B-DARCH frameworks f or lead-time and volatility for ecasting. Prior to Airbnb, he held research positions at the Federal Reserve Board of Governors. harrison.katz@airbnb .com

Retraining as Approximate Bayesian Inference

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment