Compound decisions and empirical Bayes via Bayesian nonparametrics

We study the Gaussian sequence compound decision problem and analyze a Bayesian nonparametric estimator from an empirical Bayes, regret-based perspective. Motivated by sharp results for the classical nonparametric maximum likelihood estimator (NPMLE)…

Authors: Nikolaos Ignatiadis, Sid Kankanala

Comp ound decisions and empirical Ba y es via Ba y esian nonparametrics Nik olaos Ignatiadis ignat@uchicago.edu Sid Kank anala sid.kankanala@chicagobooth.edu Draft man uscript: F ebruary , 2026 Abstract W e study the Gaussian sequence compound decision problem and analyze a Bay esian nonparametric estimator from an empirical Bay es, regret-based persp ective. Motiv ated b y sharp results for the classical nonparametric maximum lik eliho od estimator (NPMLE), w e ask whether an analogous guarantee can b e obtained using a standard Bay esian non- parametric prior. W e sho w that a Dirichlet-process-based Bay esian procedure achiev es near-optimal regret b ounds. Our main results are stated in the comp ound decision frame- w ork, where the mean vector is treated as fixed, while we also provide parallel guarantees under a hierarc hical mo del in which the means are drawn from a true unkno wn prior dis- tribution. The posterior mean Bay es rule is, a fortiori, admissible, whereas we show that the NPMLE plug-in rule is inadmissible. 1 In tro duction Empirical Ba yes (EB) metho ds [ Robbins , 1956 , Efron , 2019 ] provide a principled framew ork for b orro wing strength across a large n umber of related estimation problems. One of the most imp ortant results in the EB literature is due to Jiang and Zhang [ 2009 ] who established precise risk guaran tees for denoising in the Gaussian sequence mo del using the nonparametric maxim um likelihoo d estimator (NPMLE) of Robbins [ 1950 ] and Kiefer and W olfowitz [ 1956 ]. The risk b ounds of Jiang and Zhang [ 2009 ] are purely frequentist in nature and so provide frequen tist credence to EB metho ds. It is natural to ask whether similar guaran tees hold for a fully Bay esian approach. Perhaps surprisingly , this question has received little atten tion since the work of Datta [ 1991a ]. In this pap er, we demonstrate that strong frequen tist risk guarantees can b e obtained using a standard Bay esian nonparametric (BNP) approach based on the Dirichlet pro cess (DP) prior of F erguson [ 1973 ]. Bey ond these risk guarantees, we inherit the usual b enefits of a fully nonparametric Bay esian approach. These b enefits include, among other things, admissibilit y , explicit regularization through a user-sp ecified prior (as compared to the implicit regularization of the NPMLE [ Poly anskiy and W u , 2020 ]) and Bay esian uncertaint y quan tification. 1.1 Statistical setting and preview of main result W e consider the Gaussian sequence mo del with unknown mean vector µ = ( µ 1 , . . . , µ n ): Z i   µ i ind ∼ N( µ i , 1) , i = 1 , . . . , n. (CD) 1 W e also write Z = ( Z 1 , . . . , Z n ), so that Z   µ ∼ N( µ , I n ). Our main analysis will b e fully frequen tist, i.e., with µ = ( µ 1 , . . . , µ n ) fixed. T o mak e this absolutely clear, we will write P µ [ · ] and E µ [ · ] for probability and exp ectations under ( CD ) with µ fixed. Our comp ound decision (CD) problem is to estimate µ based on the observ ations Z as b µ = t ( Z ) for some decision rule t : R n → R n with risk measured by the root mean squared error (RMSE): R ( b µ , µ ) := r 1 n E µ h ∥ b µ − µ ∥ 2 2 i . (1) The EB approach studied by Jiang and Zhang [ 2009 ] pro ceeds as follows. First, one p osits the w orking mo del that all parameters are iid draws from an unknown distribution G : µ i   G iid ∼ G, i = 1 , . . . , n. (B) By working mo del we mean that ( B ) may not hold. If the ab o v e mo del were true, then the Ba yes estimator of µ under squared error loss is the p osterior mean b µ B = E G [ µ | Z ] with ˆ µ B i = E G [ µ i | Z i ]. The subscript G indicates that the exp ectation is taken under the working mo del ( CD ) and ( B ). In the G-mo deling approac h to EB [ Efron , 2014 ], one instead esti- mates G from the data Z , say using the NPMLE b G , and finally one forms the plug-in rule b µ EB := E b G [ µ | Z ]. When G in ( B ) exists as a physical ob ject, namely , the frequency distribu- tion of ( µ 1 , µ 2 , . . . ), then b µ EB is naturally exp ected to enjo y strong risk guarantees under ( CD ) and ( B ), provided b G ≈ G in a suitable sense. This “physical G ” persp ectiv e in EB go es back to Robbins [ 1956 ]; Efron [ 2019 ] refers to it as “finite Ba yes”. Moreo ver, Jiang and Zhang [ 2009 ] pro ve a stronger statement: the same NPMLE plug-in pro cedure enjoys strong frequentist risk guaran tees under ( CD ) alone. This is the classical comp ound decision setting of Robbins [ 1951 ], and is also referred to as “oracle Bay es” in Efron’s terminology . Under ( B ), the true prior G is a parameter of interest. Thus, in a fully Bay esian analysis, w e would also endow G with a prior Π, G ∼ Π . (BB) F or example, Π = DP( α, H ) could b e a Dirichlet pro cess prior with concentration parameter α > 0 and base distribution H (see Supplemen t A for the definition). The three lay ers ( CD ), ( B ) and ( BB ) induce the p osterior distribution of G . A BNP analysis is typically interested in the frequen tist prop erties of this p osterior when data-generation is go v erned by ( CD ) and ( B ) only , e.g., do es the p osterior contract around the true G in a certain metric and at what rate? The three-lay er BNP mo del immediately suggests an estimator for µ [ Kuo , 1986 , Escobar , 1994 , MacEachern , 1994 ], namely the p osterior mean b µ BB ≡ b µ BB (Π) := E Π [ µ | Z ] , (2) where E Π [ · | Z ] denotes expectation with resp ect to all levels of the hierarc hy ( CD ), ( B ), and ( BB ). It is natural to ask whether b µ BB has similar regret guarantees to b µ EB when data- generation is gov erned by ( CD ) only [ Jiang and Zhang , 2009 ]. The next theorem provides suc h a frequentist regret guarantee for b µ BB . Before stating the theorem, we first sp ecify the choice of Π. Sp ecification 1.1 (DP) . Supp ose that Π = DP( α , H ) where α > 0 and the base distribution H is supp orted on [ − M , M ] for some M > 0 and has density b ounded below by η for some η > 0. 2 Theorem 1.2. Supp ose that Sp ecification 1.1 holds for Π. Then there exists a constant C > 0 dep ending only on ( M , α, η ) such that for all n ≥ 2, sup µ ∈ [ − M ,M ] n  R  b µ BB (Π) , µ  − inf e Π n R  b µ BB ( e Π) , µ o  ≤ C log 5 / 2 n √ n where the infimum is ov er all p ossible priors e Π and R ( · , µ ) is defined as the frequen tist risk in ( 1 ) when data is generated according to ( CD ) with µ . In particular, Theorem 1.2 implies that, for estimating µ ∈ [ − M , M ] n , one cannot substan- tially improv e on the class of p osterior means induced b y Diric hlet pro cess priors b y instead adopting any alternative prior e Π. Moreov er, as w e explain b elo w, this theorem can be strength- ened to sho w that w e cannot impro ve substan tially o ver an y permutation-equiv ariant estimator t ( Z ), even if it has oracle knowledge of the full vector µ . 1.2 Structure of this pap er The pap er is organized as follows. Section 2 discusses related work. In Section 3 , we analyze the BNP estimator under the empirical Ba yes mo del where b oth ( CD ) and ( B ) gov ern data generation; this serves as a warm up for our main results. Section 4 contains our main theoret- ical contribution: risk b ounds for the BNP estimator under the purely frequentist comp ound decision mo del ( CD ). In b oth sections, we also establish that the BNP estimator is admissible while the NPMLE is not. Section 5 situates our work within the broader hierarch y of Ba yesian and empirical Bay esian approaches using the framework of Go o d [ 1992 ]. Section 6 presents n umerical results. Pro ofs are collected in the Supplement. 2 Related w ork The idea of using a fully Ba yes approach to comp ound decision problems dates back to the early days of EB. T o wit, the idea already app ears in the pap er that introduced com- p ound decision theory [ Robbins , 1951 ]. Robbins studies mo del ( CD ) with µ ∈ {± 1 } n . He prop oses to estimate µ b y first using Z to estimate p n := { # i : µ i = 1 } /n via ˆ p n = ( ¯ Z + 1) / 2, where ¯ Z = n − 1 P n i =1 Z i , and then applying the Bay es rule for the mo del in whic h ( µ i + 1) / 2 iid ∼ Bernoulli( ˆ p n ). Moreov er, Robbins shows that this estimator asymptotically out- p erforms the naive estimator Z in terms of risk as long as p n is not to o close to 1 / 2 (with a purely frequentist ev aluation of risk). Robbins ho wev er ackno wledges that his prop osed esti- mator is not admissible, and moreov er, he suggests a “p ossible candidate for a rule sup erior to [his]” by pursuing a fully Bay es approach in which one places of a prior on µ ∈ {± 1 } n . Robbins’ conjecture is confirmed by Gilliland et al. [ 1976 ]. In discussing Efron [ 2019 ], v an der V aart [ 2019 ] men tions that “Preferably theory [for BNP metho ds, e.g., inference] should cov er the frequentist setup [of ( CD )].” Datta [ 1991a ] already established a v ersion of our Theorem 1.2 , but without an explicit rate, replacing log 5 / 2 n/ √ n b y an unsp ecified o (1) term. Our main result thus sharp ens Datta’s guarantee and, we b eliev e, brings ov erdue atten tion to this line of w ork. Sharp frequentist risk guaran tees for Bay esian hierarc hical metho ds in the context of ( CD ) hav e also b een established in the sparse setting, where most µ i are zero [ Castillo and v an der V aart , 2012 , Roˇ ck ov´ a , 2018 ]. Our relationship to these works is analogous to how Jiang and Zhang [ 2009 ] relates to Johnstone and Silv erman [ 2004 ], yielding mean estimation guarantees without endo wing a sp ecial role to µ i = 0 in the estimation strategy . 3 Going beyond ( CD ), to thinking about b oth ( CD ) and ( B ) brings us to the “Ba y es empirical Ba yes” (BEB) dictum of Deely and Lindley [ 1981 ]. Classical implementations of this dictum using the Dirichlet pro cess include Antoniak [ 1974 ] and Berry and Christensen [ 1979 ]. The dictum has recen tly seen renewed interest across several directions. One line of work extends the reach of EB b eyond the sequence mo del, notably to high-dimensional generalized linear mo dels [ W einstein et al. , 2025 ] 1 ; and to general probabilistic symmetries [ W u et al. , 2025 ]; our results complement these by strengthening the theoretical foundations within the sequence mo del. The Bay es EB p erspective has also seen other recent developmen ts. Cannella et al. [ 2026 ], in parallel and indep enden t work, develop theory closely related to ours under ( CD ) and ( B ); in particular, their Theorem 3.3 is analogous to our Theorem 3.1 but do es not co ver the comp ound setting of Theorem 1.2 . They use this to explain the empirical observ ation of T eh et al. [ 2025 ] that transformers pretrained on synthetic data ac hieve low regret in EB problems. F a v aro and F ortini [ 2025 ] dev elop a sequential approac h to EB using Newton’s algorithm in its interpretation as a Bay esian predictiv e learning rule [ F ortini and Petrone , 2020 ]. Regarding inference, Ghosal [ 2022 ] suggests using BNP to construct confidence in terv als for empirical Ba yes estimands suc h as the p osterior mean ˆ µ B i as an alternative to the confidence in terv als of Ignatiadis and W ager [ 2022 ], and Ignatiadis and Ma [ 2025 ] extend the empirical partially Ba yes testing approach of Ignatiadis and Sen [ 2025 ] via BNP to the case wherein b oth the prior and the lik eliho o d are unkno wn. Lee and Sui [ 2025 ] consider a hierarchical Bay es v ersion of the log-spline G-mo deling approac h of Efron [ 2016 ].F or the binomial EB problem, Gu and Ko enk er [ 2017 ] compare the NPMLE to a BNP approac h and find comparable p erformance. In settings where ( CD ) and ( B ) hold, recov ery of the true mixing measure has received some attention in the nonparametric Bay es literature. F or Bay esian estimation of a latent densit y in deconv olution with a known error distribution, see Donnet et al. [ 2018a ], Rousseau and Scricciolo [ 2024 ]. Kank anala [ 2025b ] studies con traction rates for quasi-Bay es p osteriors obtained by up dating a prior with a momen t-based quasi-lik eliho od in a v ariety of latent v ariable mo dels. 3 Empirical Ba y es via Bay esian nonparametrics 3.1 Empirical Bay es regret W e first discuss the more straigh tforward risk guarantees that arise when b oth ( CD ) and ( B ) go vern data generation. Our goal is to estimate the mean vector µ , and so we consider decision rules t : R n → R n and estimators of the form b µ = t ( Z ), with p erformance measured by the RMSE R ( b µ , G ) := r 1 n E G h ∥ b µ − µ ∥ 2 i . (3) 1 An early version of this pap er w as circulated under the title ““Hierarchical Ba yes mo deling for large-scale inference.” Its abstract reads as follows: “[...] As an alternativ e to empirical Bay es metho ds, in this pap er we propose hierarchical Bay es mo deling for large-scale problems, and address tw o separate p oin ts that, in our opinion, deserve more attention. The first is nonparametric ‘deconv olution’ metho ds that are applicable also outside the sequence mo del. The sec ond p oin t is the adequacy of Bay esian mo deling for situations where the parameters are b y assumption deterministic. [...]” 4 The subscript G indicates that the expectation is tak en under both levels of the hierarc hy ( CD ) and ( B ). 2 Under this notion of risk, the optimal rule is the Bay es rule: b µ B ≡ b µ B ( G ) := ( δ G ( Z 1 ) , . . . , δ G ( Z n )) , with δ G ( z ) := R µφ ( z − µ ) d G ( µ ) R φ ( z − µ ) d G ( µ ) , (4) where φ is the standard normal densit y . The Ba yes rule has risk R ( b µ B , G ) = p E G [V ar G [ µ i | Z i ]] and is an oracle estimator b ecause it dep ends on the unkno wn G in ( B ). W e next study the risk prop erties of the BNP estimator b µ BB defined in ( 2 ). W e hav e the follo wing result, analogous to Theorem 1.2 previewed earlier. Theorem 3.1. Supp ose that Sp ecification 1.1 holds for Π. Then there exists a constant C > 0 dep ending only on ( M , α, η ) such that for all n ≥ 2, sup G ∈P ([ − M ,M ])  R ( b µ BB (Π) , G ) − p E G [V ar G [ µ i | Z i ]]  ≤ C log 5 / 2 n √ n . where P ([ − M , M ]) is the set of all probability measures on [ − M , M ]. W e note that in indep endent concurrent work, Cannella et al. [ 2026 ] prov e an analogous result for a different choice of Π. 3 3.2 Ba y es prop erties and admissibilit y In this section, we discuss additional prop erties of the BNP estimator b µ BB . In particular, we establish that this estimator is admissible with resp ect to the risk R ( · ) defined in ( 3 ). W e b egin with a result from Datta [ 1991a , Equation (3.4)], which provides a useful lea ve- one-out (LOO) interpretation of b µ BB i : it can b e view ed as an empirical Bay es estimator in whic h the prior G is learned from Z − i , while Z i en ters only through the final decision rule. Prop osition 3.2 (LOO Representation) . W e hav e that b µ BB i = δ G ( Z − i ) ( Z i ), where G ( Z − i ) := E Π [ G | Z − i ] is the p osterior mean of G given Z − i = ( Z 1 , . . . , Z i − 1 , Z i +1 , . . . , Z n ). This LOO representation con trasts with the classical NPMLE empirical Bay es estimator b µ EB i , which estimates G using all observ ations, including Z i . W e find the separation of roles b et w een Z i and Z − i in the LOO representation app ealing: it highlights the roles of direct information (from Z i ) and indirect information (from Z − i ) in the estimation of µ i . W e also note that, for the NPMLE, there is some empirical evidence that leav e-one-out estimation can p erform b etter [ Ho , 2025 ]; how ever, it requires solving the optimization problem n times, which can be computationally prohibitiv e, whereas the Ba yes rule delivers this LOO structure for free. W e now turn to admissibility of the NPMLE and BNP estimators. W e start by formalizing admissibilit y in our EB setting under ( CD ) and ( B ); see also Bo yer and Gilliland [ 1980 ], Balder et al. [ 1983 ] for background on admissibility of empirical Bay es decisions. Definition 3.3 (Admissibilit y) . An estimator b µ is inadmissible in the EB setting if there exists another estimator e µ such that: 2 W e distinguish betw een the t wo types of risk in ( 1 ) and ( 3 ) based on the second argumen t of R ( · , · ), which can be either a fixed vector µ or a fixed distribution G . W e hav e that R 2 ( b µ , G ) = E G  R 2 ( b µ , µ )  . 3 Specifically , Cannella et al. [ 2026 ] choose Π = Π n as a function of n that is sp ecified as follows. T o sample G ∼ Π n , let k = ⌈ c 0 log n/ log log n ⌉ for large c 0 > 0, and then draw λ 1 , . . . λ k iid ∼ Unif [ − M , M ] and ( w 1 , . . . , w k ) ∼ Dir(1 , . . . , 1). Finally , set G = P k j =1 w j δ λ j . 5 1. R ( e µ , G ) ≤ R ( b µ , G ) for all G ∈ P ([ − M , M ]), and 2. R ( e µ , G 0 ) < R ( b µ , G 0 ) for some G 0 ∈ P ([ − M , M ]). Otherwise, we say that b µ is admissible. Our first result establishes that the BNP estimator b µ BB is admissible in the sense of Defi- nition 3.3 . Prop osition 3.4. The estimator b µ BB is admissible in the EB setting with data generating pro cess giv en by both ( CD ) and ( B ). W e next show that the NPMLE-based EB estimator b µ EB is inadmissible. T o state this result precisely , we first define the NPMLE of the mixing distribution G . Let b G ≡ b G ( Z ) ∈ argmax G ∈P (Θ) ( n Y i =1 f G ( Z i ) ) , f G ( z ) := Z φ ( z − µ ) d G ( µ ) . (5) where P (Θ) denotes the set of all probability distributions supp orted on Θ, with Θ = R in the unconstrained case and Θ = [ − M , M ] in the constrained case. Here, f G ( · ) is the marginal densit y of Z i induced by the hierarchical mo del ( CD )–( B ). The EB estimator for the i -th co ordinate is the plug-in Bay es rule b µ EB i ( Z ) := δ b G ( Z i ). Prop osition 3.5. Let b µ EB denote the NPMLE EB estimator defined ab o ve, where b G in ( 5 ) is either constrained to P ([ − M , M ]) or taken ov er P ( R ). Then b µ EB is inadmissible. W e note that Datta and Polson [ 2025 ] also consider admissibilit y questions in a closely re- lated EB setting, but from a different p erspective. They fo cus on F-mo deling strategies [ Efron , 2014 ], in which one directly estimates the marginal density f G b y ˆ f and then plugs ˆ f into the Eddington/Tw eedie formula [ Dyson , 1926 , Efron , 2011 ], δ G ( z ) = z + f ′ G ( z ) f G ( z ) , (6) yielding an estimator of the form ˆ µ i = Z i + ˆ f ′ ( Z i ) / ˆ f ( Z i ). F or sev eral p opular F-mo deling pro cedures, such as the p olynomial log-marginal prop osal of Efron [ 2011 ], they ask whether there exists an implicit (data-driv en) prior ˜ G such that ˆ µ i = δ ˜ G ( Z i ). Ho wev er, ev en when such an implicit prior ˜ G exists, this do es not by itself imply that the corresp onding estimator is Ba yes, or admissible, for the full decision problem. 3.3 P osterior con traction and pro of of Theorem 3.1 W e now describ e the main results used in the pro of of Theorem 3.1 . Our starting point is a p osterior contraction result for the Diric hlet pro cess mixture mo del under the data generating pro cess ( CD )–( B ). Under ( CD )–( B ), there exists a true mixing distribution G ⋆ (and hence a true marginal densit y f G ⋆ ). This ob ject will b e important for the p osterior contraction results that we state b elow. F or t wo densities f and h define their Hellinger distance as: H 2 ( f , h ) := 1 2 Z  p f ( z ) − p h ( z )  2 d z . (7) The following result is a minor strengthening of Ghosal and v an der V aart [ 2001 , Theorem 5.1]: in addition to the contraction rate itself, it trac ks the corresp onding high-probability even ts. 6 Theorem 3.6 (Posterior contraction) . Supp ose that Sp ecification 1.1 holds, the true prior G ⋆ is supp orted on [ − M , M ]. Then there exist constants C, c > 0 dep ending only on ( M , α , η ) suc h that for all sufficiently large n , P G ⋆  Π  G : H ( f G ⋆ , f G ) ≥ C log n √ n    Z  ≤ exp  − c log 2 n   ≥ 1 − 1 n . Moreo ver, the p osterior mean marginal density ¯ f = R f G dΠ( G | Z ) satisfies P G ⋆  H ( f G ⋆ , ¯ f ) ≥ C ′ log n √ n  ≤ 1 n . T o lev erage Theorem 3.6 in the proof of Theorem 3.1 , w e next connect Hellinger contraction of the marginal densities to the discrepancy b et ween the BNP estimator b µ BB and the Bay es oracle b µ B . T o that end, note that for an y priors G and Q we hav e Z { δ G ( z ) − δ Q ( z ) } 2 f G ( z ) d z = Z  f ′ G ( z ) f G ( z ) − f ′ Q ( z ) f Q ( z )  2 f G ( z ) d z = : F ( f G || f Q ) , (8) where F ( f G || f Q ) is the Fisher divergence; see Ghosh et al. [ 2025 ] for discussion of its relationship to empirical Bay es estimation. Next, b y com bining Lemma 6.1 of Zhang [ 2005 ] with Theorem E.1 of Saha and Guntuboyina [ 2020 ] (which builds on Theorem 3 of Jiang and Zhang [ 2009 ]), we can relate Fisher divergence to Hellinger distance as follows. Lemma 3.7. There exists a universal constant C > 0 such that for any ρ ∈ (0 , (2 π e ) − 1 / 2 ), M > 0 and any tw o distributions G and Q supp orted on [ − M , M ], we ha ve that F ( f G || f Q ) ≤ C  H 2 ( f G , f Q ) · max n | log ρ | 3 , | log H ( f G , f Q ) | o + ( M + 1) ρ | log ρ |  . W e now hav e all the ingredients to prov e Theorem 3.1 . Since it is short, we give the pro of here. Pr o of of The or em 3.1 . Fix G ⋆ with supp( G ⋆ ) ⊆ [ − M , M ] and write ˆ µ B i := δ G ⋆ ( Z i ), so that R ( b µ B , G ⋆ ) = p E G ⋆ [V ar G ⋆ [ µ | Z ]]. By the triangle inequality , R ( b µ BB , G ⋆ ) − p E G ⋆ [V ar G ⋆ [ µ | Z ]] ≤ ( 1 n n X i =1 E G ⋆ h  ˆ µ BB i − ˆ µ B i  2 i ) 1 / 2 . (9) By Prop osition 3.2 , ˆ µ BB i = δ G ( Z − i ) ( Z i ). Let ¯ f − i := f G ( Z − i ) = R f G dΠ( G | Z − i ). Conditioning on Z − i and using ( 6 ) and ( 8 ), we hav e that E G ⋆ [( ˆ µ BB i − ˆ µ B i ) 2 ] = E G ⋆  E G ⋆ [( ˆ µ BB i − ˆ µ B i ) 2 | Z − i ]  = E G ⋆  F  f G ⋆ || ¯ f − i  . Consider the even t A i :=  H ( f G ⋆ , ¯ f − i ) < C ′ log n/ √ n  , where C ′ is the constant from Theo- rem 3.6 and note that P G ⋆ [ A c i ] < 1 /n . On the even t A i , Lemma 3.7 with ρ = n − 1 yields F  f G ⋆ || ¯ f − i  ≲ (log 5 n ) /n , while alwa ys F  f G ⋆ || ¯ f − i  ≤ 4 M 2 since ˆ µ BB i and ˆ µ B i are b oth in [ − M , M ]. Hence E G ⋆  ( ˆ µ BB i − ˆ µ B i ) 2  ≲ (log 5 n ) /n uniformly in i ∈ { 1 , . . . , n } . Plugging in to ( 9 ) and then taking the supremum o ver G ⋆ ∈ P ([ − M , M ]) completes the pro of. 7 4 Comp ound decisions via Bay esian nonparametrics 4.1 Comp ound decisions regret In this section, we supp ose that data-generation is given by ( CD ) only and measure risk b y the RMSE R ( b µ , µ ) defined in ( 1 ). In Section 1.1 w e preview ed our main regret result, namely Theorem 1.2 . Below we state a slightly stronger result by b enc hmarking b µ BB against an even broader class of estimators, that of all permutation equiv ariant decision rules [ W einstein , 2021 ], T PE :=  t : R n → R n | t ( z π (1) , . . . , z π ( n ) ) = ( t π (1) ( z ) , . . . , t π ( n ) ( z )) for all z , π ∈ S n  , (10) where S n is the set of p erm utations of { 1 , . . . , n } . In the absence of further information on µ , p erm utation equiv ariant rules are natural in the setting of ( CD ). Theorem 4.1. Supp ose that Sp ecification 1.1 holds for Π. Then there exists a constant C > 0 dep ending only on ( M , α, η ) such that for all n ≥ 2, sup µ ∈ [ − M ,M ] n  R  b µ BB (Π) , µ  − inf t ∈T PE { R ( t ( Z ) , µ ) }  ≤ C log 5 / 2 n √ n where the infim um is ov er all p erm utation equiv ariant decision rules defined in ( 10 ) and R ( · , µ ) is defined as the frequentist risk in ( 1 ) when data is generated according to ( CD ) with µ . The reason that Theorem 4.1 is a strengthening of Theorem 1.2 is that for any prior e Π, it holds that b µ B ( e Π) can b e written as t ( Z ) for a certain t ∈ T PE . The regret b ound of Theorem 4.1 matches the regret b ound for the NPMLE shown by Jiang and Zhang [ 2009 ]. 4.2 Ba y es prop erties and admissibilit y W e now turn to the question of admissibility in the comp ound setting, complementing the results of Section 3 . Here admissibilit y is understo o d in the usual sense for m ultiv ariate normal mean estimation under squared error loss. W e consider t wo parameter spaces: the compact set [ − M , M ] n (whic h aligns with our regret b ounds) and the full space R n . Definition 4.2 (Admissibility) . Let Θ ⊆ R n b e the parameter space (either Θ = [ − M , M ] n or Θ = R n ). An estimator b µ is inadmissible o ver Θ if there exists another estimator e µ such that: 1. R ( e µ , µ ) ≤ R ( b µ , µ ) for all µ ∈ Θ, and 2. R ( e µ , µ 0 ) < R ( b µ , µ 0 ) for some µ 0 ∈ Θ. Otherwise, we say that b µ is admissible ov er Θ. Our first result is that the BNP estimator b µ BB is admissible in the ab o ve sense. Prop osition 4.3. The estimator b µ BB is admissible ov er R n and ov er [ − M , M ] n in the com- p ound decision setting with data generating pro cess giv en by ( CD ). By contrast, the classical NPMLE-based empirical Bay es rule is not admissible. Prop osition 4.4. Let b µ EB b e the NPMLE-based EB estimator defined ab o ve with the NPMLE b G in ( 5 ) either constrained to distributions supp orted on [ − M , M ] or unconstrained. Then: 8 (i) F or the compact parameter space [ − M , M ] n : b µ EB is inadmissible for all n ≥ 1, regardless of whether b G is constrained or unconstrained. (ii) F or the full parameter space R n : b µ EB is inadmissible for n ≥ 2 when b G is unconstrained, and for all n ≥ 1 when b G is constrained to [ − M , M ]. Remark 4.5. F or n = 1, the unconstrained NPMLE is admissible ov er R . The reason is that the NPMLE b G is a p oint mass at Z 1 , and so b µ EB 1 ≡ Z 1 whic h is admissible for n = 1. Pr o of sketch. There are four inadmissibility claims ab ov e. Let us sk etch the pro of of the inad- missibilit y of the unconstrained NPMLE-based EB estimator ov er R n for n ≥ 2. The key idea is to observe the following. Fix n ≥ 2. F or z ∈ R n , let ¯ z := n − 1 P n i =1 z i and define the set U :=  z ∈ R n : max 1 ≤ i ≤ n | z i − ¯ z | ≤ 1  . (11) By the first order optimalit y condition of the NPMLE in ( 5 ) ov er P ( R ), w e hav e that b G ( z ) = δ ¯ z for all z ∈ U . Thus, b µ EB ( z ) = ¯ z · 1 n = ( ¯ z , . . . , ¯ z ) for all z ∈ U . No w supp ose that b µ EB is admissible. Then by Theorem 3.1.1. of Brown [ 1971 ], b µ EB is a generalized Ba yes estimator and so it is analytic as a function of z . Hence, b µ EB ( z ) = ¯ z · 1 n for almost all z ∈ R n b y the identit y theorem. How ev er, this can easily b e shown to b e a con tradiction. Thus, b µ EB is inadmissible. Remark 4.6. The pro of technique ab o ve has b een used b efore to show inadmissibility of estimators in the multiv ariate normal mean estimation problem. Recall the James-Stein [ 1961 ] estimator and its p ositive-part, b µ JS := 1 − ( n − 2) ∥ Z ∥ 2 2 ! Z , b µ JS+ := 1 − ( n − 2) ∥ Z ∥ 2 2 ! + Z , where ( x ) + = max { x, 0 } . F or n ≥ 3, the James-Stein estimator dominates the MLE Z , and th us the latter is inadmissible. How ever, b µ JS is itself inadmissible since it is dominated by b µ JS+ [ Efron and Morris , 1973 ], and b µ JS+ is inadmissible as well. T o show inadmissibility of the latter, Bro wn [ 1986 , Example 2.9 and Theorem 4.16] observes that b µ JS+ ( z ) = 0 for ∥ z ∥ 2 2 ≤ n − 2. By the same con tinuation argument as ab o v e, if b µ JS+ w ere admissible, then b µ JS+ ( z ) = 0 for almost all z ∈ R n , which is a contradiction. Thus, b µ JS+ is inadmissible. 4.3 P osterior con traction and pro of of Theorem 4.1 The ov erall pro of strategy has four main elements, eac h of which we outline in turn: 1. reduction to separable decision rules; 2. the fundamental theorem of comp ound decision; 3. p osterior contraction under ( CD ); 4. control of p osterior means following Jiang and Zhang [ 2009 ]. T o b egin, we define the class of simple separable decision rules: T S := { t : R n → R n s.t. t ( Z ) = ( t ( Z 1 ) , . . . , t ( Z n ) for t : R → R } . (12) As a preliminary step, w e note the following quantitativ e b ound, due to Greenshtein and Ritov [ 2009 , Theorem 5.1], that sharp ens earlier results of Hannan and Robbins [ 1955 ]; see also Han et al. [ 2025 , Theorem 4.1] and Liang and Han [ 2025 ]. 9 Theorem 4.7. There exists a constant C M > 0 that dep ends only on M such that sup µ ∈ [ − M ,M ] n  inf t ∈T S { R ( µ , t ( Z )) } − inf t ∈T PE { R ( µ , t ( Z )) }  ≤ C M 1 √ n . The upshot of Theorem 4.7 is that it suffices to comp ete with the b est simple separable estimator, rather than with the b est p erm utation equiv ariant estimator. Moreov er, the opti- mal simple separable rule admits an explicit characterization via the fundamental theorem of comp ound decision theory , whic h will b e useful in later steps of the pro of. A k ey idea in comp ound decision theory is that the comp ound model with fixed µ in ( CD ) b eha v es, in certain resp ects, similarly to the Ba yes mo del in ( CD )–( B ) with µ i iid ∼ G n , where G n is the empirical distribution of µ , G n ≡ G n ( µ ) := 1 n n X i =1 δ µ i , (13) and where δ u denotes the Dirac mass at u . This similarity also underlies the result of Theo- rem 4.7 ab o ve and can b e formalized by comparing the marginal distribution of a randomly p erm uted v ersion of Z under the tw o mo dels; see, e.g., Han and Niles-W eed [ 2026 ]. As we sho w b elo w, man y useful consequences of this connection follow already from the linearit y of exp ectation. W e illustrate this by giving a self-contained pro of of the fundamental theorem of comp ound decisions [ Robbins , 1951 , Zhang , 2003 ]. Theorem 4.8 (F undamen tal theorem of comp ound decisions) . Let t : R → R b e measurable and consider the separable estimator with ˆ µ i = t ( Z i ) for all i . Then under ( CD ), R ( b µ , µ ) = R ( b µ , G n ( µ )) , (14) where the left-hand side risk is defined as in ( 1 ) and the right-hand side risk is defined in ( 3 ). Consequen tly , t ⋆ µ ( z ) = δ G n ( µ ) ( z ) yields the separable estimator that minimizes R ( t ( Z ) , µ ) ov er all t ∈ T S . Pr o of. Let b µ b e a separable estimator as ab o ve with co ordinate-wise function t ( · ). W riting G n = G n ( µ ) as in ( 13 ), then we can write R 2 ( b µ , µ ) as follows, 1 n n X i =1 E µ  ( t ( Z i ) − µ i ) 2  = 1 n n X i =1 Z ( t ( z ) − µ i ) 2 φ ( z − µ i ) d z = Z Z ( t ( z ) − ν ) 2 φ ( z − ν ) d z d G n ( ν ) . Analogously , we can write R 2 ( b µ , G n ) as 1 n n X i =1 E G n  ( t ( Z i ) − µ i ) 2  = E G n  ( t ( Z i ) − µ i ) 2  = Z Z ( t ( z ) − ν ) 2 φ ( z − µ ) d z d G n ( ν ) . Th us indeed R ( b µ , µ ) = R ( b µ , G n ). As explained b efore ( 4 ), R ( b µ , G n ) is minimized ov er all estimators b µ by b µ B ( G n ), whic h is in fact a separable estimator with co ordinate-wise function δ G n ( · ) (also defined in ( 4 )). Th us the b est separable estimator is also the same, and also minimizes the equiv alen t ob jective R ( t ( Z ) , µ ) ov er all separable estimators. Giv en Theorem 4.8 and the results of Section 3 , it is natural to exp ect that the BNP p osterior Π( · | Z ) ma y concentr ate around the empirical mixing distribution G n in ( 13 ). The 10 next result makes this intuition precise. T o that end, define the marginal density induced by G n (equiv alently , by ( CD ) with µ fixed) as f µ ( z ) := 1 n n X i =1 φ ( z − µ i ) . (15) A result similar to the next one (but without a rate) was shown b y Datta [ 1991b ]. Theorem 4.9 (Posterior contraction) . Assume max 1 ≤ i ≤ n | µ i | ≤ M and supp ose Sp ecifica- tion 1.1 holds. Then there exist constants C, c > 0 dep ending only on ( M , α , η ) such that for all sufficiently large n , P µ  Π  G : H ( f µ , f G ) ≥ C log n √ n    Z  ≤ exp  − c log 2 n   ≥ 1 − 1 n . Pr o of sketch. Fix C > 0 (we will choose it later), let ε n := log n/ √ n , and consider the set U := { G : H ( f µ , f G ) ≥ C ε n } . By the simplest Sch wartz p osterior contraction argument, Π( U | Z ) = R U Q n i =1 f G ( Z i )dΠ( G ) R Q n i =1 f G ( Z i )dΠ( G ) = R U Q n i =1 f G ( Z i ) f µ ( Z i ) dΠ( G ) R Q n i =1 f G ( Z i ) f µ ( Z i ) dΠ( G ) . (16) W e hav e to show tw o facts: that with high probability under P µ [ · ], b oth the numerator is “small” and the denominator is “large.” Note that Q n i =1 f µ ( Z i )  = Q n i =1 φ ( Z i − µ i ) , where the latter is the lik eliho o d under the data generating pro cess in ( CD ), and so the ratio in- tro duced in ( 16 ) is not a likelihoo d ratio under P µ [ · ]. W e can sho w that the p osterior con- tracts despite this missp ecification. 4 T o see why , consider the following argument. Fix f G for G ∈ P ([ − M , M ]). T o understand both the numerator and denominator in ( 16 ), w e m ust under- stand the b ehaviour of the log-likelihoo d ratio P n i =1 log { f G ( Z i ) /f µ ( Z i ) } . T aking exp ectations under P µ [ · ], we get E µ " n X i =1 log  f G ( Z i ) f µ ( Z i )  # = n X i =1 Z log  f G ( z ) f µ ( z )  φ ( z − µ i ) d z = n Z log  f G ( z ) f µ ( z )  1 n n X i =1 φ ( z − µ i ) | {z } = f µ ( z ) d z = − n KL ( f µ || f G ) , where KL ( · || · ) is the Kullback-Leibler div ergence defined as follows for tw o densities f , h , KL ( f || h ) := Z f ( z ) log f ( z ) h ( z ) d z . (17) By comparison, if data generation w ere giv en by b oth ( CD ) and ( B ) with true prior G ⋆ (as in the setting of Theorem 3.6 ), then we would ha ve E G ⋆ [ P n i =1 log { f G ( Z i ) /f G ⋆ ( Z i ) } ] = − n KL ( f G ⋆ || f G ), i.e., effectively f µ pla ys the role of f G ⋆ . Given suitable concentration uni- formly ov er f G , w e can now surmise the follo wing. In the numerator of ( 16 ), fix G ∈ U (within the integral). Note that − n KL ( f µ || f G ) ≤ − 2 n H 2 ( f µ , f G ) ≤ − 2 C 2 nε 2 n , and so the 4 Similar arguments also app ear in the analysis of nonparametric Bay esian pro cedures that utilize quasi- likelihoods [ Kato , 2013 , Kank anala , 2025a ]. 11 n umerator should b e small. Lemma S4 in the supplement makes this argument precise. The argumen t closely trac ks the argumen ts sho wing Hellinger rates for the NPMLE b y Zhang [ 2009 ] under ( CD ). Similarly , for the denominator, as long as the prior Π puts enough mass around f µ in KL neigh b orho ods, the denominator will b e sufficiently large with high probabilit y . Lemma S3 in the supplemen t makes this argument precise and closely tracks the arguments of Ghosal and v an der V aart [ 2001 ] under the standard frequen tist BNP setup where b oth ( CD ) and ( B ) determine the data generating pro cess. The key difference here is that the “true” density f µ dep ends on µ in a non-iid wa y , but this do es not impact the prior mass calculations. The fourth (and last) step of the pro of amounts to controlling E µ [ ∥ b µ B ( G n ( µ )) − b µ B ∥ 2 ]. This term can b e controlled using Theorem 4.9 , Lemma 3.7 as well as results of Jiang and Zhang [ 2009 ] on regularized Bay es rules. Supplement E.2 explains this step and puts together the ov erall pro of of Theorem 4.1 . 5 I. J. Go o d’s staircase The hierarch y set forth through ( CD ), ( B ), and ( BB ) can b e contin ued further, say as, Π ∼ Γ . (BBB) where Γ is a hyperhyperprior on Π, e.g., on the parameters α and/or H of the Dirichlet process and so forth. At each stage of the hierarch y , say , at ( B ), w e ha ve three options: (i) fix that lev el, treating the distribution at that level as known, (ii) estimate the distribution at that lev el by empirical Bay es (EB), or (iii) pro ceed to the next level of the hierarch y . Go od [ 1992 ] uses the terminology “B”, “EB”, “BB”, “EBB”, “BBB”, etc., to denote these v arious options. Th us, standard EB as in tro duced by Robbins [ 1956 ] is also called EB in Go o d’s staircase notation. Meanwhile, EB that estimates b Π (e.g., b α and/or b H when Π = DP( α, H )) in ( BB ) as in e.g., Liu [ 1996 ], McAuliffe et al. [ 2006 ], Donnet et al. [ 2018b ] is called EBB. F ully Bay esian nonparametric approac hes that put hyperpriors on α and/or H as in e.g., Escobar and W est [ 1995 ] are called BBB. Although imperfect and not alwa ys applicable, 5 w e find this terminology useful in communicating the v arious levels of hierarc hy and estimation. In our setting, w e could analyze an y of these sc hemes under ( CD ) (as in Section 4 ), or under b oth ( CD ) and ( B ) (as in Section 3 ). F or instance, an EBB sc heme could b e analyzed by using tec hniques in e.g., Petrone et al. [ 2014 ], Rousseau and Szab o [ 2017 ], Donnet et al. [ 2018b ]. W e also refer to v an der V aart [ 2023 ] and Ignatiadis and Ma [ 2025 ] for further discussion on the v arious levels of the hierarch y and how frequen tist guaran tees can b e obtained b y treating randomness up to a certain level of the hierarch y in a frequentist manner. 6 Numerical results 6.1 Remarks on implemen tation Throughout, to compute the BNP estimator of µ , w e mak e the following choices for the h y- p erparameters of the Dirichlet Process in ( BB ): H = Unif [ − 10 , 10] , α ∼ Γ(0 . 01 , 100) , 5 The hierarchy is not alwa ys clear-cut. F or instance, one could treat ( BB ) and ( BBB ) as a single level given by the composition of the t wo. Or if one is not interested in the µ i in ( CD ) but only in the induced marginal densities f G ( · ), then it mak es sense to merge ( CD ) with ( B ) and call ( BBB ) as BB instead of BBB. 12 T able 1: Unnormalized mean squared error P n i =1 ( b µ i − µ i ) 2 a veraged ov er 100 Mon te Carlo replicates, with n = 1 , 000. Eac h column corresp onds to a sparse-normal configuration with the indicated num b er of nonzero means and common signal strength µ . Three estimators are compared: the prop osed BNP estimator, the NPMLE, and the separable oracle. # nonzero 5 50 500 µ 3 4 5 7 3 4 5 7 3 4 5 7 BNP 41 33 15 3 152 103 47 6 449 278 121 12 NPMLE 36 28 18 7 156 107 52 11 455 287 125 22 Oracle 26 21 11 1 147 99 44 4 445 274 118 9 T able 2: Mean w all time in seconds per Mon te Carlo replicate for each estimator and sim ulation configuration. Timings w ere recorded using Julia 1.10.10 on a single thread of an Intel Xeon Gold 6248R no de. # nonzero 5 50 500 µ 3 4 5 7 3 4 5 7 3 4 5 7 BNP 5.013 5.378 5.539 5.853 5.778 6.532 8.558 11.541 8.320 12.623 9.473 8.951 NPMLE 1.029 1.008 0.925 0.864 1.111 0.936 0.889 0.771 1.137 1.001 0.927 0.911 Oracle 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 where Γ( a, b ) is the Gamma distribution with shap e a and scale b . W e use the Gibbs sampler of Algorithm 2 in Neal [ 2000 ] along with up dates for α describ ed in Escobar and W est [ 1995 ]. 6 The reason w e can directly use the Gibbs sampler of Neal [ 2000 ] is that we can explicitly conduct the required marginalization and p osterior up dates in closed form, see Supplement B . W e use 2,000 burn-in iterations and 10,000 p ost burn-in iterations (with iterations defined as full sweeps through the data) to compute the p osterior mean of µ . F or the NPMLE, we use the b y-now-standard approac h of Ko enker and Mizera [ 2014 ] using discretization and the MOSEK [ MOSEK ApS , 2025 ] interior po in t conv ex optimization solver 6.2 Sim ulation study W e consider a standard sparse simulation setup [ Johnstone and Silv erman , 2004 , Jiang and Zhang , 2009 , Koenker and Mizera , 2014 ] under ( CD ). W e take n = 1 , 000 throughout. In each sim ulation setting, w e hav e that, µ i = ( µ, for i = 1 , . . . , n 1 0 , for i = n 1 + 1 , . . . , n, where n 1 ∈ { 5 , 50 , 500 } and µ ∈ { 3 , 4 , 5 , 7 } are simulation parameters. W e consider three metho ds: the prop osed BNP estimator, the NPMLE (as describ ed in Section 6.1 ), and the separable oracle from Theorem 4.8 . F ollo wing standard practice, w e rep ort results in terms of unnormalized MSE, E µ [ ∥ b µ − µ ∥ 2 ] which here we estimate by av eraging ov er 100 Mon te Carlo replicates. Results are sho wn in T able 1 . Moreov er since the BNP approach to EB has b een criticized for b eing computationally slow, we include timings in T able 2 . 6 In writing our initial protot yp e, w e follow ed the co de in the P articles.jl pack age [ Kleinschmidt , 2019 ]. 13 W e observe that the BNP estimator p erforms comp etitiv ely with the NPMLE across all configurations, and b oth metho ds track the oracle w ell. F or w ell-separated signals ( µ = 7), the BNP estimator is closer to the oracle than the NPMLE in every sparsit y regime. The compu- tational cost is higher, roughly a five-to tenfold increase in wall time, but remains mo derate in absolute terms for n = 1 , 000, suggesting that the BNP approach is a practical alternative to the NPMLE Ac knowledgemen ts. W e thank Asaf W einstein for helpful discussions on Ba yes empirical Ba yes. This w ork was completed in part with resources pro vided b y the Univ ersity of Chicago’s Researc h Computing Center. N.I. gratefully ackno wledges supp ort from the U.S. National Science F oundation (DMS-2443410). References C. E. Antoniak. Mixtures of Dirichlet pro cesses with applications to Bay esian nonparametric problems. The Annals of Statistics , 2(6):1152–1174, 1974. E. Balder, D. Gilliland, and J. V. Hou welingen. On the essential completeness of Bay es empirical Ba yes decision rules. Statistics & Risk Mo deling , 1(4-5):503–509, 1983. D. A. Berry and R. Christensen. Empirical Ba yes estimation of a binomial parameter via mixtures of Dirichlet processes. The Annals of Statistics , 7(3):558–568, 1979. J. E. Boy er and D. C. Gilliland. Admissibility considerations in the finite state comp ound and empirical Bay es decision problems. Statistic a Ne erlandic a , 34(3):151–159, 1980. L. D. Brown. Admissible estimators, recurren t diffusions, and Insoluble b oundary v alue prob- lems. The Annals of Mathematic al Statistics , 42(3):855–903, 1971. L. D. Bro wn. F undamentals of Statistic al Exp onential F amilies with Applic ations in Statistic al De cision The ory . Number 9 in Lecture Notes-Monograph Series. Institute of Mathematical Statistics, Hayw ard, CA, 1986. N. Cannella, A. T eh, Y. Han, and Y. Poly anskiy . Universal priors: Solving empirical Bay es via Ba yesian inference and pretraining. arXiv pr eprint , arXiv:2602.15136, 2026. I. Castillo and A. v an der V aart. Needles and straw in a haystac k: P osterior concen tration for p ossibly sparse sequences. The Annals of Statistics , 40(4):2069–2101, 2012. J. Datta and N. G. Polson. Polynomial log-marginals and Tweedie’s form ula: When Is Bay es p ossible? arXiv pr eprint , arXiv:2509.05823, 2025. S. Datta. Asymptotic optimality of Bay es comp ound estimators in compact exp onen tial fami- lies. The Annals of Statistics , 19(1):354–365, 1991a. S. Datta. On the consistency of p osterior mixtures and its applications. The A nnals of Statistics , 19(1):338–353, 1991b. J. J. Deely and D. V. Lindley . Bay es empirical Bay es. Journal of the Americ an Statistic al Asso ciation , 76(376):833–841, 1981. 14 S. Donnet, V. Rivoirard, J. Rousseau, and C. Scricciolo. Posterior concentration rates for empirical bay es pro cedures with applications to diric hlet pro cess mixtures. Bernoul li , 24(1): 231–256, 2018a. S. Donnet, V. Rivoirard, J. Rousseau, and C. Scricciolo. Posterior concentration rates for empirical Bay es pro cedures with applications to Dirichlet pro cess mixtures. Bernoul li , 24 (1):231–256, 2018b. F. Dyson. A metho d for correcting series of parallax observ ations. Monthly Notic es of the R oyal Astr onomic al So ciety , 86:686, 1926. B. Efron. Tweedie’s form ula and selection bias. Journal of the A meric an Statistic al Asso ciation , 106(496):1602–1614, 2011. B. Efron. Tw o mo deling strategies for empirical Bay es estimation. Statistic al Scienc e , 29(2): 285–301, 2014. B. Efron. Empirical Bay es deconv olution estimates. Biometrika , 103(1):1–20, 2016. B. Efron. Ba yes, oracle Bay es and empirical Bay es. Statistic al Scienc e , 34(2):177–201, 2019. B. Efron and C. Morris. Stein’s estimation rule and its comp etitors—an empirical Ba yes approac h. Journal of the Americ an Statistic al Asso ciation , 68(341):117–130, 1973. M. D. Escobar. Estimating normal means with a Diric hlet pro cess prior. Journal of the A meric an Statistic al Asso ciation , 89(425):268–277, 1994. M. D. Escobar and M. W est. Bay esian densit y estimation and inference using mixtures. Journal of the Americ an Statistic al Asso ciation , 90(430):577–588, 1995. S. F av aro and S. F ortini. Quasi-Bay es empirical Bay es: A sequential approach to the Poisson comp ound decision problem. arXiv pr eprint , arXiv:2411.07651, 2025. T. S. F erguson. A Ba yesian analysis of some nonparametric problems. The A nnals of Statistics , 1(2):209–230, 1973. S. F ortini and S. Petrone. Quasi-Bay es prop erties of a pro cedure for sequen tial learning in mixture mo dels. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 82(4):1087–1114, 2020. S. Ghosal. Discussion of “Confidence interv als for nonparametric empirical Bay es analysis” by Ignatiadis and W ager. Journal of the Americ an Statistic al Asso ciation , 117(539):1171–1174, 2022. S. Ghosal and A. W. v an der V aart. Entropies and rates of con vergence for maxim um lik eliho o d and Bay es estimation for mixtures of normal densities. The A nnals of Statistics , 29(5):1233– 1263, 2001. S. Ghosal and A. W. v an der V aart. F undamentals of Nonp ar ametric Bayesian Infer enc e . Num b er 44 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge Univ ersity Press, Cambridge ; New Y ork, 2017. S. Ghosh, N. Ignatiadis, F. Ko ehler, and A. Lee. Stein’s un biased risk estimate and Hyv¨ arinen’s score matching. arXiv pr eprint , arXiv:2502.20123, 2025. 15 D. C. Gilliland, J. Hannan, and J. S. Huang. Asymptotic solutions to the tw o state comp onen t comp ound decision problem, Bay es versus diffuse priors on prop ortions. The A nnals of Statistics , 4(6):1101–1112, 1976. E. Gin´ e and R. Nickl. Mathematic al foundations of infinite-dimensional statistic al mo dels . Cam bridge universit y press, 2021. I. J. Go od. The Ba yes/Non-Ba yes compromise: A brief review. Journal of the Americ an Statistic al Asso ciation , 87(419):597–606, 1992. E. Greenshtein and Y. Ritov. Asymptotic efficiency of simple decisions for the comp ound decision problem. In Institute of Mathematic al Statistics L e ctur e Notes - Mono gr aph Series , pages 266–275. Institute of Mathematical Statistics, Beach woo d, Ohio, USA, 2009. J. Gu and R. Koenker. Empirical Ba y esball remixed: Empirical Ba yes metho ds for longitudinal data. Journal of Applie d Ec onometrics , 32(3):575–599, 2017. Y. Han and J. Niles-W eed. Approximate indep endence of p erm utation mixtures. Annals of Statistics (forthc oming) , 2026. Y. Han, J. Niles-W eed, Y. Shen, and Y. W u. Besting Go od–T uring: Optimalit y of non-parametric maximum likelihoo d for distribution estimation. arXiv pr eprint , arXiv:2509.07355, 2025. J. F. Hannan and H. Robbins. Asymptotic solutions of the comp ound decision problem for t wo completely sp ecified distributions. The Annals of Mathematic al Statistics , 26(1):37–51, 1955. S. C. Ho. Large-scale estimation under unknown heteroskedasticit y . a rXiv pr eprint , arXiv:2507.02293, 2025. N. Ignatiadis and L. Ma. Partially Ba yes p-v alues for large-scale inference. arXiv pr eprint , arXiv:2512.08847, 2025. N. Ignatiadis and B. Sen. Empirical partially Ba yes multiple testing and comp ound χ 2 decisions. The Annals of Statistics , 53(1):1–36, 2025. N. Ignatiadis and S. W ager. Confidence in terv als for nonparametric empirical Bay es analy- sis (with discussion and a rejoinder b y the authors). Journal of the Americ an Statistic al Asso ciation , 117(539):1149–1166, 2022. W. James and C. Stein. Estimation with quadratic loss. In Pr o c e e dings of the F ourth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , volume 1, pages 361–379, 1961. W. Jiang and C.-H. Zhang. General maximum lik eliho o d empirical Bay es estimation of normal means. The Annals of Statistics , 37(4):1647–1684, 2009. I. M. Johnstone and B. W. Silverman. Needles and stra w in haystac ks: Empirical Bay es estimates of p ossibly sparse sequences. The Annals of Statistics , 32(4):1594–1649, 2004. S. Kank anala. Generalized ba yes in conditional moment restriction mo dels. arXiv pr eprint arXiv:2510.01036 , 2025a. S. Kank anala. Quasi-Bay es in latent v ariable mo dels. arXiv:2311.06831, 2025b. 16 K. Kato. Quasi-bay esian analysis of nonparametric instrumen tal v ariables mo dels. The A nnals of Statistics , 41(5):2359, 2013. J. Kiefer and J. W olfo witz. Consistency of the maxim um lik eliho od estimator in the presence of infinitely man y incidental parameters. The Annals of Mathemat ic al Statistics , 27(4):887–906, 1956. D. F. Kleinschmidt. P articles.jl: nonparametric clustering with Sequential Monte Carlo. https: //github.com/kleinschmidt/Particles.jl , 2019. R. Ko enk er and I. Mizera. Con vex optimization, shap e constraints, comp ound decisions, and empirical Bay es rules. Journal of the Americ an Statistic al Asso ciation , 109(506):674–685, 2014. L. Kuo. A note on Bay es empirical Bay es estimation by means of Dirichlet processes. Statistics & Pr ob ability L etters , 4(3):145–150, 1986. J. Lee and D. Sui. F ully Bay esian inference for meta-analytic deconv olution using Efron’s log-spline prior. Mathematics , 13(16):2639, 2025. Y. Liang and Y. Han. Sharp mean-field analysis of p erm utation mixtures and p ermutation- in v ariant decisions. arXiv pr eprint , arXiv:2509.12584, 2025. B. G. Lindsay . The geometry of mixture lik eliho o ds: A general theory . The A nnals of Statistics , 11(1):86–94, 1983. B. G. Lindsay and K. Ro eder. Uniqueness of estimation and identifiabilit y in mixture mo dels. Canadian Journal of Statistics , 21(2):139–147, 1993. J. S. Liu. Nonparametric hierarc hical Bay es via sequential imputations. The Annals of Statis- tics , 24(3):911–930, 1996. S. N. MacEachern. Estimating normal means with a conjugate st yle Diric hlet pro cess prior. Communic ations in Statistics - Simulation and Computation , 23(3):727–741, 1994. J. D. McAuliffe, D. M. Blei, and M. I. Jordan. Nonparametric empirical Bay es for the Dirichlet pro cess mixture mo del. Statistics and Computing , 16(1):5–14, 2006. MOSEK ApS. The MOSEK Optimization Suite. V ersion 11.0. , 2025. URL https://docs. mosek.com/11.0/intro.pdf . R. M. Neal. Marko v chain sampling metho ds for Dirichlet pro cess mixture mo dels. Journal of Computational and Gr aphic al Statistics , 9(2):249–265, 2000. S. Petrone, S. Rizzelli, J. Rousseau, and C. Scricciolo. Empirical Bay es metho ds in classical and Bay esian inference. METRON , 72(2):201–215, 2014. Y. Poly anskiy and Y. W u. Self-regularizing prop ert y of nonparametric maximum lik eliho o d estimator in mixture mo dels. arXiv pr eprint , arXiv:2008.08244, 2020. H. Robbins. A generalization of the metho d of maximum likelihoo d: Estimating a mixing distribution (abstract). The Annals of Mathematic al Statistics , 21:314–315, 1950. H. Robbins. Asymptotically subminimax solutions of comp ound statistical decision problems. In Pr o c e e dings of the Se c ond Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , v olume 2, pages 131–149. Universit y of California Press, 1951. 17 H. Robbins. An empirical Bay es approach to statistics. In Pr o c e e dings of the Thir d Berke- ley Symp osium on Mathematic al Statistics and Pr ob ability, V olume 1: Contributions to the The ory of Statistics , pages 157–163. The Regents of the Universit y of California, 1956. V. Roˇ cko v´ a. Bay esian estimation of sparse signals with a con tinuous spike-and-slab prior. The A nnals of Statistics , 46(1):401–437, 2018. J. Rousseau and C. Scricciolo. W asserstein conv ergence in ba yesian and frequentist deconv o- lution mo dels. The Annals of Statistics , 52(4):1691–1715, 2024. J. Rousseau and B. Szab o. Asymptotic b eha viour of the empirical Ba yes posteriors asso ciated to maximum marginal likelihoo d estimator. The Annals of Statistics , 45(2):833–865, 2017. S. Saha and A. Guntuboyina. On the nonparametric maximum lik eliho o d estimator for Gaus- sian location mixture densities with application to Gaussian denoising. The Annals of Statis- tics , 48(2):738–762, 2020. A. T eh, M. Jabb our, and Y. Poly anskiy . Solving empirical Bay es via T ransformers. arXiv pr eprint , arXiv:2502.09844, 2025. A. v an der V aart. Comment: Ba yes, Oracle Bay es and Empirical Ba yes. Statistic al Scienc e , 34 (2):214–218, 2019. A. v an der V aart. F requentism. The New England Journal of Statistics in Data Scienc e , pages 138–141, 2023. A. W einstein. On p erm utation inv ariant problems in large-scale inference. arXiv pr eprint , arXiv:2110.06250, 2021. A. W einstein, J. W allin, D. Y ekutieli, and M. Bogdan. Nonparametric shrink age estimation in high dimensional generalized linear mo dels via P oly a trees. arXiv pr eprint , 2025. B. W u, E. N. W einstein, and D. M. Blei. Bay esian empirical Bay es: Sim ultaneous inference from probabilistic symmetries. arXiv pr eprint , arXiv:2512.16239, 2025. C.-H. Zhang. Comp ound decision theory and empirical Bay es metho ds: In vited pap er. The A nnals of Statistics , 31(2):379–390, 2003. C.-H. Zhang. General empirical Bay es wa velet metho ds and exactly adaptive minimax estima- tion. The Annals of Statistics , 33(1):54–100, 2005. C.-H. Zhang. Generalized maximum likelihoo d estimation of normal mixture densities. Statis- tic a Sinic a , 19:1297–1318, 2009. A Definition of Diric hlet Pro cess The Dirichlet Pro cess DP( α, H ) with concentration parameter α > 0 and base distribution H is a probabilit y distribution ov er probability distributions. A random measure G ∼ DP( α , H ) satisfies the prop ert y that for any finite measurable partition ( A 1 , . . . , A k ) of the supp ort of H , ( G ( A 1 ) , . . . , G ( A k )) ∼ Dir( αH ( A 1 ) , . . . , α H ( A k )) , where Dir( · ) denotes the Dirichlet distribution. Almost surely , G is a discrete probability measure. The parameter α controls ho w concen trated G is around H : larger α yields draws closer to H , while smaller α yields more v ariable draws with fewer distinct atoms. 18 B Computational details Consider the following hierarc hical mo del with a < b and σ 2 > 0: µ ∼ G = Unif [ a, b ] , Z | µ ∼ N( µ, σ 2 ) . Then the marginal density of Z is equal to, f G ( z ) = Z b a φ (( z − µ ) /σ ) · 1 σ ( b − a ) dµ = Φ(( b − z ) /σ ) − Φ(( a − z ) /σ ) b − a . Here Φ( · ) denotes the standard normal CDF and φ ( · ) denotes the standard normal PDF. W e denote by TN [ a,b ] ( u, τ 2 ) the truncated normal distribution on [ a, b ] with lo cation u and scale τ 2 > 0, having densit y p ( x ; u, τ 2 , a, b ) = φ (( x − u ) /τ ) τ (Φ(( b − u ) /τ ) − Φ(( a − u ) /τ )) · 1 [ a,b ] ( x ) . Using this notation, the p osterior distribution of µ given Z = z is simply µ | Z = z ∼ TN [ a,b ] ( z , σ 2 ) . C Pro ofs for (in)admissibilit y C.1 Pro of of Prop osition 3.4 Pr o of. Without loss of generality we may assume that an y estimator e µ ∈ [ − M , M ] n almost surely: otherwise replace it by its co ordinatewise pro jection onto [ − M , M ] n , whic h cannot increase squared error loss b ecause µ i ∈ [ − M , M ] almost surely under every G ∈ P ([ − M , M ]). No w supp ose that b µ BB is inadmissible. Then there exists another estimator e µ such that R ( e µ , G ) ≤ R ( b µ BB , G ) for all G ∈ P ([ − M , M ]), and strict inequality holds at some G 0 . Equiv- alen tly , there exists G 0 suc h that η := R 2 ( b µ BB , G 0 ) − R 2 ( e µ , G 0 ) > 0 . By the argument abov e we may also assume e µ ( Z ) ∈ [ − M , M ] n a.s. By Lemma S1 b elo w, the map G 7→ R 2 ( b µ BB , G ) − R 2 ( e µ , G ) is contin uous w.r.t. weak con vergence of G . Therefore there exists a weak neighborho od U of G 0 suc h that R 2 ( b µ BB , G ) − R 2 ( e µ , G ) ≥ η / 2 for all G ∈ U . (S1) By Ghosal and v an der V aart [ 2017 , Theorem 4.15], Π( U ) > 0. Hence, Z R 2 ( b µ BB , G ) dΠ( G ) − Z R 2 ( e µ , G ) dΠ( G ) = Z n R 2 ( b µ BB , G ) − R 2 ( e µ , G ) o dΠ( G ) ≥ Z U n R 2 ( b µ BB , G ) − R 2 ( e µ , G ) o dΠ( G ) ≥ ( η / 2) Π( U ) > 0 . On the other hand, b µ BB is Bay es for µ under the (prop er) prior G Π on [ − M , M ] n giv en by G Π ( A 1 , . . . , A n ) = Z n Y i =1 G ( A i ) dΠ( G ) , A 1 , . . . , A n ⊆ [ − M , M ] . 19 Moreo ver, for any estimator t , 1 n E G Π h ∥ µ − t ( Z ) ∥ 2 2 i = Z R 2 ( t , G ) dΠ( G ) , where under E G Π [ · ] we hav e µ ∼ G Π and Z | µ ∼ N( µ , I n ). Therefore the strict inequalit y ab o v e implies E G Π     µ − b µ BB ( Z )    2 2  > E G Π h ∥ µ − e µ ( Z ) ∥ 2 2 i , whic h con tradicts the fact that b µ BB is the Bay es estimator under G Π for squared error loss. Th us b µ BB m ust b e admissible. Lemma S1 (Contin uity of squared risk in G ) . Fix M > 0 and let t : R n → [ − M , M ] n b e measurable. Then the map P ([ − M , M ]) → R + : G 7→ R 2 ( t , G ) (with R defined in ( 3 )) is contin uous w.r.t. weak conv ergence of G . In particular, if G k ⇒ G weakly for G k , G ∈ P ([ − M , M ]), then R 2 ( t , G k ) → R 2 ( t , G ). Pr o of. F or µ ∈ [ − M , M ] n define ψ t ( µ ) := 1 n E µ  ∥ t ( Z ) − µ ∥ 2 2  , Z | µ ∼ N( µ , I n ) . Then 0 ≤ ψ t ( µ ) ≤ 4 M 2 for all µ . Moreov er, ψ t ( µ ) = 1 n Z R n ∥ t ( z ) − µ ∥ 2 φ n ( z − µ ) d z , where φ n ( z ) := (2 π ) − n/ 2 exp( −∥ z ∥ 2 2 / 2). F rom this represen tation and dominated conv ergence, w e immediately see that ψ t ( · ) is contin uous. Finally notice that R 2 ( t , G ) = Z [ − M ,M ] n ψ t ( µ )d G ⊗ n ( µ ) . No w if G k ⇒ G , then G ⊗ n k ⇒ G ⊗ n and so we conclude by definition of weak con vergence. C.2 Pro of of Prop osition 3.5 Pr o of. W e use the inadmissibility result for the comp ound setting, i.e., Prop osition 4.4 . Ac- cording to that result, there exists an estimator e µ such that for all µ ∈ [ − M , M ] n , R ( e µ , µ ) ≤ R ( b µ EB , µ ) , (S2) and R ( e µ , µ 0 ) < R ( b µ EB , µ 0 ) , (S3) for some µ 0 ∈ [ − M , M ] n . W e claim that e µ also dominates b µ EB in the EB setting. T o see this, first note that for any G ∈ P ([ − M , M ]), we ha ve that R 2 ( b µ EB , G ) = Z R 2 ( b µ EB , µ ) d G ⊗ n ( µ ) ≤ Z R 2 ( e µ , µ ) d G ⊗ n ( µ ) = R 2 ( e µ , G ) , where in the middle inequality w e used ( S2 ). Next, take (cf. ( 13 )) G 0 := G n ( µ 0 ) = 1 n n X i =1 δ µ 0 ,i . 20 Observ e that P G 0 [ µ = µ 0 ] = Z 1 { µ = µ 0 } d G ⊗ n 0 ( µ ) > 0 . Th us, R 2 ( b µ EB , G 0 ) = Z R 2 ( b µ EB , µ ) d G ⊗ n 0 ( µ ) = P G 0 [ µ = µ 0 ] R 2 ( b µ EB , µ 0 ) + Z µ  = µ 0 R 2 ( b µ EB , µ ) d G ⊗ n 0 ( µ ) > P G 0 [ µ = µ 0 ] R 2 ( e µ , µ 0 ) + Z µ  = µ 0 R 2 ( e µ , µ ) d G ⊗ n 0 ( µ ) = R 2 ( e µ , G 0 ) . C.3 Pro of of Prop osition 4.3 Pr o of. F or the compact parameter space Θ = [ − M , M ] n , we may assume without loss of generalit y that an y comp eting estimator ˜ µ takes v alues in [ − M , M ] n almost surely . Supp ose for contradiction that b µ BB is inadmissible ov er Θ = [ − M , M ] n . Then there exists an estimator ˜ µ such that R ( ˜ µ , µ ) ≤ R ( b µ BB , µ ) ∀ µ ∈ [ − M , M ] n , and strict inequality holds at some µ 0 ∈ [ − M , M ] n . Define η := R 2 ( b µ BB , µ 0 ) − R 2 ( ˜ µ , µ 0 ) > 0 . F or any decision rule t : R n → [ − M , M ] n , define ψ t ( µ ) := 1 n E µ  ∥ t ( Z ) − µ ∥ 2 2  , Z | µ ∼ N ( µ , I n ) . By dominated conv ergence, it is straightforw ard to verify that ψ t ( µ ) is contin uous in µ ∈ [ − M , M ] n . Applying this to t = b µ BB and t = ˜ µ shows that the map µ 7→ R 2 ( b µ BB , µ ) − R 2 ( ˜ µ , µ ) is con tinuous on [ − M , M ] n . Since this difference equals η > 0 at µ 0 , there exists an op en neigh b orho od U of µ 0 in [ − M , M ] n suc h that R 2 ( b µ BB , µ ) − R 2 ( ˜ µ , µ ) ≥ η / 2 for all µ ∈ U. (S4) Define the induced prior G Π on [ − M , M ] n b y G Π ( A 1 , . . . , A n ) := Z n Y i =1 G ( A i ) dΠ( G ) , A 1 , . . . , A n ⊂ [ − M , M ] . Since H has a density b ounded b elo w on [ − M , M ], the Diric hlet pro cess prior Π = DP( α, H ) has full weak supp ort on P ([ − M , M ]); he nce the induced prior G Π ( · ) has full supp ort on 21 [ − M , M ] n and in particular G Π ( U ) > 0 for an y nonempty op en U ⊂ [ − M , M ] n . Integrat- ing ( S4 ) with resp ect to G Π yields Z n R 2 ( b µ BB , µ ) − R 2 ( ˜ µ , µ ) o d G Π ( µ ) ≥ ( η / 2) G Π ( U ) > 0 . (S5) On the other hand, under the joint law µ ∼ G Π and Z | µ ∼ N ( µ , I n ), we hav e for any estimator t ( Z ) that 1 n E G Π  ∥ t ( Z ) − µ ∥ 2 2  = Z R 2 ( t ( Z ) , µ ) d G Π ( µ ) . Therefore, ( S5 ) is equiv alent to E G Π h ∥ b µ BB ( Z ) − µ ∥ 2 2 i > E G Π  ∥ ˜ µ ( Z ) − µ ∥ 2 2  . But b µ BB ( Z ) = E G Π [ µ | Z ] is the Ba yes estimator under G Π for squared error loss, so it minimizes E G Π [ ∥ t ( Z ) − µ ∥ 2 2 ] o v er all measurable decision rules t . This con tradicts the preceding inequalit y . Hence b µ BB is admissible ov er [ − M , M ] n . Under the induced prior G Π on R n (concen trated on [ − M , M ] n ), b µ BB ( Z ) = E G Π [ µ | Z ] is a prop er Bay es rule for squared error loss in the normal means mo del Z | µ ∼ N ( µ , I n ). F or the result on R n w e can argue as follows. Supp ose as ab o ve that b µ BB ( Z ) is not admis- sible. Then there exists another estimator ˜ µ = ˜ µ ( Z ) such that R ( ˜ µ , µ ) ≤ R ( b µ BB , µ ) for all µ ∈ R n . By integrating ov er G Π (whic h is supp orted on [ − M , M ] n , this implies that E G Π     ˜ µ − b µ BB    2  = E G Π h ∥ ˜ µ − µ ∥ 2 i − E G Π     b µ BB − µ    2  ≤ 0 . Th us e µ ( z ) = b µ BB ( z ) for almost all z ∈ R n (with resp ect to Leb esgue measure). But this implies that R ( ˜ µ , µ ) = R ( b µ BB , µ ) for all µ ∈ R n , and so ˜ µ do es not dominate b µ BB . C.4 Pro of of Prop osition 4.4 Pr o of. As mentioned in the pro of sketc h of the prop osition in the main text, the statement of the prop osition entails four inadmissibility results. W e only pro vide a pro of for the inadmissibility of the unconstrained NPMLE-based EB estimator ov er R n for n ≥ 2; the other cases are similar. W e only hav e to fill in tw o arguments that were not justified in the pro of sk etch: (i) to v erify the explicit solution of the NPMLE ov er the op en set ( 11 ), and (ii) to show that there exists another op en set on which the NPMLE solution takes a different form. (i). Throughout let ℓ ( G ; z ) := n X i =1 log f G ( z i ) . Since G 7→ f G ( z i ) is linear and log is c onca v e, G 7→ ℓ ( G ; z ) is concav e ov er the conv ex set of probabilit y measures on R . 22 Let ¯ z := n − 1 P n i =1 z i and recall the op en set defined in ( 11 ): U :=  z ∈ R n : max 1 ≤ i ≤ n | z i − ¯ z | < 1  . Fix z ∈ U and consider the point mass G 0 := δ ¯ z . According to a standard first-order optimality condition for the NPMLE [ Lindsa y , 1983 , Theorem 4.1.], G 0 maximizes ℓ ( · ; z ) if and only if for ev ery µ ∈ R , n X i =1 φ ( z i − µ ) f G 0 ( z i ) ≤ n. (S6) (T o see this, consider the directional deriv ative along (1 − λ ) G 0 + λδ µ at λ = 0 + .) No w f G 0 ( z i ) = φ ( z i − ¯ z ). Let s i := z i − ¯ z , so that P n i =1 s i = 0 and | s i | < 1 for all i . T ak e an y µ ∈ R and write µ = ¯ z + u , then, φ ( z i − µ ) φ ( z i − ¯ z ) = exp  − ( s i − u ) 2 − s 2 i 2  = exp  − u 2 2 + u s i  Th us ( S6 ) is equiv alent to e − u 2 / 2 n X i =1 e us i ≤ n ⇐ ⇒ 1 n n X i =1 e us i ≤ e u 2 / 2 for all u ∈ R . But ( s 1 , . . . , s n ) has empirical mean 0 and is supp orted in [ − 1 , 1], so Ho effding’s lemma implies 1 n n X i =1 e us i ≤ exp  4 u 2 8  = exp  u 2 2  for all u ∈ R , v erifying ( S6 ). Hence G 0 = δ ¯ z is an NPMLE for every z ∈ U with uniqueness due to Lindsay and Ro eder [ 1993 ] and consequently , ˆ µ EB ( z ) = ¯ z 1 n for all z ∈ U . (S7) (ii). W e now exhibit an open set on which the NPMLE is not a p oin t mass. Fix ε ∈ (0 , 1) and let a > 0 b e large. Consider the three-p oin t mixing distribution G ε := (1 − ε ) δ 0 + ε 2 δ − a + ε 2 δ a , and the data vector z ⋆ := ( − a, a, 0 , . . . , 0) ∈ R n , for which ¯ z ⋆ = 0. Let L ( G ; z ) := Q n i =1 f G ( z i ) denote the marginal likelihoo d. W e b ound: f G ε (0) = (1 − ε ) φ (0) + εφ ( a ) ≥ (1 − ε ) φ (0) , and f G ε ( ± a ) = (1 − ε ) φ ( a ) + ε 2 φ (0) + ε 2 φ (2 a ) ≥ ε 2 φ (0) . Therefore L ( G ε ; z ⋆ ) ≥  ε 2 φ (0)  2  (1 − ε ) φ (0)  n − 2 = φ (0) n  ε 2  2 (1 − ε ) n − 2 . On the other hand, among p oint masses G = δ µ , the likelihoo d is maximized at µ = ¯ z ⋆ = 0, so sup µ ∈ R L ( δ µ ; z ⋆ ) = L ( δ 0 ; z ⋆ ) = φ ( a ) 2 φ (0) n − 2 = φ (0) n e − a 2 . 23 Hence L ( G ε ; z ⋆ ) sup µ L ( δ µ ; z ⋆ ) ≥  ε 2  2 (1 − ε ) n − 2 e a 2 , whic h is > 1 for a sufficiently large. Thus for such a c hoice of a , L ( G ε ; z ⋆ ) > sup µ L ( δ µ ; z ⋆ ) , so no p oin t mass can b e an NPMLE at z ⋆ . Define the contin uous function H ( z ) := log L ( G ε ; z ) − log  sup µ L ( δ µ ; z )  . Since H ( z ⋆ ) > 0 and H is contin uous, there exists an op en neighborho o d V of z ⋆ suc h that H ( z ) > 0 for all z ∈ V . In particular, for every z ∈ V , the NPMLE b G ( z ) cannot be a p oin t mass. F or any (nondegenerate) mixing distribution G , the map z 7→ δ G ( z ) is strictly increasing: writing m k ( z ) := R µ k φ ( z − µ ) d G ( µ ) so that δ G = m 1 /m 0 , one computes δ ′ G ( z ) = m 2 ( z ) m 0 ( z ) − m 1 ( z ) 2 m 0 ( z ) 2 = V ar G [ µ | Z = z ] > 0 . Therefore, for every z ∈ V (where b G ( z ) is nondegenerate) the co ordinates ˆ µ EB i ( z ) = δ b G ( z ) ( z i ) cannot all b e equal unless the z i are all equal, which do es not o ccur on the neighborho o d V . In particular, ˆ µ NPMLE ( z )  = ¯ z 1 n for all z ∈ V . (S8) D P osterior con traction pro ofs (Theorems 3.6 and 4.9 ) D.1 Preliminary lemmata Before stating the first lemma, we recall the relev ant notions of metric entrop y . An ε -brack et [ l, u ] with resp ect to H is a pair of functions l ≤ u such that H ( l , u ) ≤ ε . A function f b elongs to the brac ket if l ≤ f ≤ u p oin twise. Given a class F , the ε -brack eting num b er N [] ( ε, F , H ) is the minim um num b er of ε -brac kets needed to cov er F , and the brac keting entrop y is its logarithm. Note that in ( 7 ), the Hellinger distance was defined for probability densities. In the brac keting context the b ounding functions l and u need not integrate to one, how ever, we note that ( 7 ) also makes sense for any non-negativ e, square-ro ot-in tegrable functions and coincides with the usual Hellinger distance when b oth arguments are densities. Lemma S1 (Entrop y of b ounded normal mixtures) . Let F M :=  f G ( · ) = Z φ ( · − µ ) d G ( µ ) : G ∈ P ([ − M , M ])  . There exists a constant C = C ( M ) suc h that for all 0 < ε < 1 / 2, log N [] ( ε, F M , H ) ≤ C  log(1 /ε )  2 . (S9) Consequen tly also log N ( ε, F M , H ) ≤ C (log (1 /ε )) 2 . 24 Pr o of. This is an immediate sp ecialization of Theorem 3.1 in Ghosal and v an der V aart (2001) to (i) the fixed scale σ ≡ 1 and (ii) a fixed supp ort radius a = M . F or fixed a one may take γ = 1 / 2 in their statement, yielding brac keting Hellinger en tropy of order (log (1 /ε )) 2 . Since co vering num b ers are b ounded b y brack eting n umbers, the cov ering en tropy ob eys the same b ound. F or the follo wing lemma, define: V ( f || h ) := Z f ( z )  log f ( z ) h ( z ) − KL ( f || h )  2 d z . (S10) B ( f µ , ε ) =  G : KL ( f µ || f G ) < ε 2 , V ( f µ || f G ) < ε 2  . (S11) The following Lemma is shown in Ghosal and v an der V aart [ 2001 , Equation (5.17)]. Lemma S2 (Prior mass of KL neighborho o ds around f µ ) . Assume Sp ecification 1.1 holds. Then there exist constants c 1 , c 2 > 0 dep ending only on ( M , α, η ) suc h that for all sufficiently small ε > 0, Π ( B ( f µ , ε )) ≥ c 1 exp n − c 2 (log(1 /ε )) 2 o . (S12) Lemma S3. Assume max 1 ≤ i ≤ n | µ i | ≤ M and that the Diric hlet base measure H is supp orted on [ − M , M ] (so that Π-a.s. the mixing distribution G is supp orted on [ − M , M ]). Fix any D > 0. Then for all sufficiently large n , the follo wing holds with probability at least 1 − (3 /n 2 ): Z n Y i =1 f G ( Z i ) f µ ( Z i ) d Π( G ) ≥ exp  − (1 + D ) nε 2 n  · Π  B ( f µ , ε n )  , (S13) where ε n := log n/ √ n . Pr o of. Fix D > 0 as the lemma statement. Let Π 0 denote Π restricted to B ( f µ , ε ). Define ψ ( z ) := Z log  f G ( z ) f µ ( z )  dΠ 0 ( G ) , U := n X i =1 ψ ( Z i ) . W e hav e: log Z n Y i =1 f G ( Z i ) f µ ( Z i ) dΠ( G ) ! ≥ log Z B ( f µ ,ε n ) n Y i =1 f G ( Z i ) f µ ( Z i ) dΠ( G ) ! = log Z n Y i =1 f G ( Z i ) f µ ( Z i ) dΠ 0 ( G ) ! + log (Π( B ( f µ , ε n ))) . Mean while by Jensen’s inequality , log Z n Y i =1 f G ( Z i ) f µ ( Z i ) dΠ 0 ( G ) ! ≥ Z n X i =1 log  f G ( Z i ) f µ ( Z i )  dΠ 0 ( G ) = U. No w recall the argument presented in the main text: E µ " n X i =1 log  f G ( Z i ) f µ ( Z i )  # = n X i =1 Z log  f G ( z ) f µ ( z )  φ ( z − µ i ) d z = n Z log  f G ( z ) f µ ( z )  1 n n X i =1 φ ( z − µ i ) | {z } = f µ ( z ) d z = − n KL ( f µ || f G ) , 25 By the ab o ve, F ubini’s theorem, and b ecause Π 0 is supp orted on B ( f µ , ε n ), we then hav e that: E µ [ U ] = − n Z KL ( f µ || f G ) dΠ 0 ( G ) ≥ − nε 2 n . Moreo ver, V ar µ [ U ] = n X i =1 V ar µ i  Z log  f µ ( Z i ) f G ( Z i )  dΠ 0 ( G )  ≤ n X i =1 E µ i "  Z  log  f µ ( Z i ) f G ( Z i )  − KL ( f µ || f G )  dΠ 0 ( G )  2 # ≤ Z n X i =1 E µ i "  log  f µ ( Z i ) f G ( Z i )  − KL ( f µ || f G )  2 # dΠ 0 ( G ) = Z n X i =1 Z  log  f µ ( z ) f G ( z )  − KL ( f µ || f G )  2 φ ( z − µ i )d z dΠ 0 ( G ) = n Z Z  log  f µ ( z ) f G ( z )  − KL ( f µ || f G )  2 f µ ( z )d z dΠ 0 ( G ) = n Z V ( f µ || f G ) dΠ 0 ( G ) ≤ nε 2 n , where the second inequality follows from Jensen’s inequality and F ubini. Because every f G for G in the supp ort of Π is a normal lo cation mixture with mixing distribution supp orted on [ − M , M ], w e hav e for all z : φ ( | z | + M ) = inf u ∈ [ − M ,M ] φ ( z − u ) ≤ f G ( z ) ≤ sup u ∈ [ − M ,M ] φ ( z − u ) = φ  max {| z | − M , 0 }  . The same inequalities also hold for f µ and so,     log f G ( z ) f µ ( z )     ≤ log φ (max {| z | − M , 0 } ) φ ( | z | + M ) = ( | z | + M ) 2 − max {| z | − M , 0 } 2 2 ≤ 2 M | z | + 2 M 2 . Av eraging ov er Π 0 yields the env elop e | ψ ( z ) | ≤ 2 M | z | + 2 M 2 , z ∈ R . (S14) Let T n := M + √ 6 log n and set A n := { max 1 ≤ i ≤ n | Z i | ≤ T n } . Since Z i ∼ N( µ i , 1) with | µ i | ≤ M , P µ [ A c n ] ≤ n X i =1 P µ i [ | Z i | > T n ] ≤ 2 n exp  − ( T n − M ) 2 2  = 2 n − 2 . Next define the truncated, centered summands Y i,n :=  ψ ( Z i ) − E µ i [ ψ ( Z i )]  1 {| Z i | ≤ T n } , i = 1 , . . . , n. On {| Z i | ≤ T n } , ( S14 ) and E µ i [ | Z i | ] ≤ | µ i | + p 2 /π ≤ M + 1 imply that | E µ i [ ψ ( Z i )] | ≤ E µ i [ | ψ ( Z i ) | ] ≤ 2 M ( M + 1) + 2 M 2 , hence for n large enough | Y i,n | ≤ 2 M T n + 4 M 2 + 2 M ( M + 1) ≤ 10 M p log n =: B n . 26 Moreo ver, V ar µ i [ Y i,n ] ≤ E µ i  ( ψ ( Z i ) − E µ i [ ψ ( Z i )]) 2  = V ar µ i [ ψ ( Z i )] . Summing ov er i gives n X i =1 V ar µ i [ Y i,n ] ≤ V ar µ [ U ] ≤ nε 2 n , (S15) using the v ariance b ound established ab o ve. Now observe that on A n w e hav e U − E µ [ U ] = P n i =1 Y i,n , so { U − E µ [ U ] ≤ − Dnε 2 n } ∩ A n ⊆ n n X i =1 Y i,n ≤ − Dnε 2 n o . Consequen tly , P µ  U − E µ [ U ] ≤ − Dnε 2 n  ≤ P µ [ A c n ] + P µ " n X i =1 Y i,n ≤ − Dnε 2 n # . (S16) W e next try to control P n i =1 | E µ [ Y i,n ] | . Notice that | E µ i [ Y i,n ] | =   E µ i  ( ψ ( Z i ) − E µ i [ ψ ( Z i )]  1 {| Z i | ≤ T n }    = | E µ i [ ψ ( Z i ) 1 {| Z i | ≤ T n } ] − E µ i [ ψ ( Z i )] P µ i [ | Z i | ≤ T n ] | = |− E µ i [ ψ ( Z i ) 1 {| Z i | > T n } ] + E µ i [ ψ ( Z i )] (1 − P µ i [ | Z i | ≤ T n ]) | ≤ E µ i [ | ψ ( Z i ) | 1 {| Z i | > T n } ] + E µ i [ | ψ ( Z i ) | ] P µ i [ | Z i | > T n ] . F or the first term, we ha ve E µ i  (2 M | Z i | + 2 M 2 ) 1 {| Z i | > T n }  ≲ M 1 n 3 . Mean while, by the previous calculations, E µ i [ | ψ ( Z i ) | ] P µ i [ | Z i | > T n ] ≲ M 1 n 3 . Putting these results together we hav e P n i =1 | E µ [ Y i,n ] | ≲ M n − 2 = o ( nε 2 n ). W rite e Y i,n := Y i,n − E µ [ Y i,n ]. Then n n X i =1 Y i,n ≤ − Dnε 2 n o ⊆ n n X i =1 e Y i,n ≤ − D 2 nε 2 n o . The summands e Y i,n are indep enden t, mean-zero, and satisfy | e Y i,n | ≤ 2 B n where B n = 10 M √ log n . Applying Bernstein’s inequalit y (e.g., [ Gin´ e and Nickl , 2021 , Theorem 3.1.7]) and using ( S15 ) yields P µ " n X i =1 e Y i,n ≤ − D 2 nε 2 n # ≤ exp  − c 0 D 2 n 2 ε 4 n nε 2 n + B n D nε 2 n  ≤ exp  − c 1 D nε 2 n B n  , for constants c 0 , c 1 > 0 dep ending only on M . Com bining this with ( S16 ) and P µ [ A c n ] ≤ 2 n − 2 giv es P µ  U − E µ [ U ] ≤ − Dnε 2 n  ≤ 2 n − 2 + exp  − c 1 D nε 2 n B n  . 27 F or ε n = (log n ) / √ n we hav e nε 2 n = log 2 n , hence the RHS is b ounded by 2 n − 2 +exp {− cD log 3 / 2 n } for some c > 0. Finally , since E µ [ U ] ≥ − nε 2 n , the even t { U ≤ − (1 + D ) nε 2 n } implies { U − E µ [ U ] ≤ − Dnε 2 n } , and so, P µ  U ≤ − (1 + D ) nε 2 n  ≤ 2 n − 2 + exp  − cD log 3 / 2 n  ≤ 3 n 2 . (S17) The b ound ( S13 ) is then immediate from the low er b ound Z n Y i =1 ( f G /f µ )( Z i ) d Π( G ) ≥ e U Π( B ( f µ , ε n )) . W e define for any nonnegative in tegrable function h , L n ( h ) := n Y i =1 h ( Z i ) f µ ( Z i ) , Lemma S4. F or any D > 0, there exists C > 0 suc h that if we define U := { G : H ( f µ , f G ) ≥ C ε n } , then for sufficiently large n with probability at least 1 − 1 /n 2 , Z U L n ( f G ) dΠ( G ) ≤ exp( − Dnε 2 n ) . Pr o of. First, for notational con venience, define r n := C ε n . Cho ose κ ∈  0 , 1 / (2 √ 2)  and define a brack eting radius δ br n := κ r 2 n = κC 2 ε 2 n . Let { [ l k , u k ] } N k =1 b e a δ br n –Hellinger brac k eting co ver of F M (with respect to H ), where l k , u k are merely nonnegative functions (not necessarily densities), and where the brack eting condition is understo od in the standard wa y: for eac h f ∈ F M there exists k with l k ≤ f ≤ u k p oin t wise and H ( l k , u k ) ≤ δ br n . Retain only those brac kets that intersect { f G : G ∈ U } ; the n umber of retained brack ets is still ≤ N . By Lemma S1 , log N ≤ log N []  δ br n , F M , H  ≤ C 0  log  1 /δ br n  2 ≲ M log 2 n = nε 2 n , (S18) where C 0 = C 0 ( M ) and the implicit constant dep ends only on M . F or eac h retained brac k et [ l k , u k ], pic k a densit y f k = f G k for G k ∈ U such that f k ∈ [ l k , u k ]. Then f k ≤ u k p oin t wise and H ( f k , u k ) ≤ H ( l k , u k ) ≤ δ br n , and H ( f µ , f k ) ≥ r n . 28 W e next b ound the (1 / 2)-moment of L n ( u k ) under P µ [ · ]. By indep endence and the AM–GM inequalit y , for any nonnegative in tegrable h , E µ h p L n ( h ) i = n Y i =1 Z s h ( z ) f µ ( z ) φ ( z − µ i ) d z ≤  Z q h ( z ) f µ ( z ) d z  n . (S19) No w we control the affinity R p u k ( z ) f µ ( z )d z . By Cauch y–Sch w arz and the definition of Hellinger distance, Z q u k ( z ) f µ ( z )d z = Z q f k ( z ) f µ ( z )d z + Z ( p u k ( z ) − p f k ( z )) q f µ ( z )d z ≤ Z q f k ( z ) f µ ( z )d z +  Z ( p u k ( z ) − p f k ( z )) 2 d z  1 / 2  Z f µ ( z )d z  1 / 2 = 1 − H 2 ( f µ , f k ) + √ 2 H ( f k , u k ) ≤ 1 − r 2 n + √ 2 δ br n . Using δ br n = κr 2 n and κ < 1 / (2 √ 2), we obtain Z q u k ( z ) f µ ( z )d z ≤ 1 −  1 − √ 2 κ  r 2 n =: 1 − b r 2 n , for some b = b ( κ ) ∈ (0 , 1). Com bining with ( S19 ) yields E µ h p L n ( u k ) i ≤ (1 − br 2 n ) n ≤ exp  − b nr 2 n  . (S20) Define the even t E n :=  max 1 ≤ k ≤ N L n ( u k ) ≤ exp  − ( b/ 2) nr 2 n   . By Marko v’s inequality applied to p L n ( u k ) and ( S20 ), for each fixed k , P µ  L n ( u k ) > exp  − ( b/ 2) nr 2 n  = P µ h p L n ( u k ) > exp  − ( b/ 4) nr 2 n  i ≤ exp  ( b/ 4) nr 2 n  E µ h p L n ( u k ) i ≤ exp  − ( b/ 4) nr 2 n  . The union b ound ov er k ≤ N gives P µ [ E c n ] ≤ N exp  − ( b/ 4) nr 2 n  ≤ exp  log N − ( b/ 4) nr 2 n  . (S21) Since r 2 n = C 2 ε 2 n and log N ≲ nε 2 n b y ( S18 ), choosing C sufficiently large implies that the abov e term can b e made ≤ 1 /n 2 . Finally , on E n , for every G ∈ U c ho ose a retained brack et [ l k , u k ] containing f G . Since f G ≤ u k p oin t wise, L n ( f G ) ≤ L n ( u k ), and hence Z U L n ( f G ) dΠ( G ) ≤ N X k =1 L n ( u k ) ≤ N exp  − ( b/ 2) nr 2 n  = exp  log N − ( b/ 2) nr 2 n  , (S22) whic h also can b e made ≤ exp( − D nε 2 n ) by large enough choice of C . D.2 Pro of of Theorem 4.9 Pr o of. W rite ε n := (log n ) / √ n as b efore. F or any measurable A we ma y write Π( A | Z ) = R A L n ( f G ) dΠ( G ) R L n ( f G ) dΠ( G ) . 29 Step 1: W e first b ound the denominator. Fix D > 0 (e.g. D = 1) and define the even t D n :=  Z L n ( f G ) dΠ( G ) ≥ exp  − (1 + D ) nε 2 n  · Π ( B ( f µ , ε n ))  . By Lemma S3 , P µ [ D c n ] ≤ n − 1 for all sufficiently large n . Moreov er, by Lemma S2 there exist constan ts c 1 , c 2 > 0 dep ending only on ( M , α , η ) such that Π ( B ( f µ , ε n )) ≥ c 1 exp  − c 2 (log(1 /ε n )) 2  . Since log(1 /ε n ) ≍ log n , the righ t-hand side is bounded below b y exp  − C KL nε 2 n  for a constan t C KL = C KL ( M , α, η ) > 0 and all large n . Hence, on D n , Z L n ( f G ) dΠ( G ) ≥ exp  − C den nε 2 n  , (S23) for a constant C den = C den ( M , α, η ) > 0. Step 2: Let C den b e as in the conclusion of Step 1. Fix any D > C den . By Lemma S4 , there exists a C > 0 such that if we define U := { G : H ( f µ , f G ) ≥ C ε n } , then for sufficiently large n , there exists a set E n with probabilit y at least 1 − 1 /n 2 , such that Z U L n ( f G ) dΠ( G ) ≤ exp( − Dnε 2 n ) . Step 3: On D n ∩ E n , by combining Step 1 and Step 2, we hav e Π( U | Z ) = R U L n ( f G ) dΠ( G ) R L n ( f G ) dΠ( G ) ≤ exp  ( C den − D ) nε 2 n  . Note that c = D − C den > 0 and nε 2 n = log 2 n . It remains to verify that D n ∩ E n holds with high P µ [ · ]–probabilit y . By construction of D n and E n , we hav e P µ  Π( U | Z ) > exp  − c log 2 n  ≤ P µ [ D c n ] + P µ [ E c n ] ≤ 1 n 2 + 1 n 2 ≤ 1 n , This prov es the first display of Theorem 4.9 . D.3 Pro of of Theorem 3.6 Pr o of. W e omit the pro of of the first statement of the theorem b ecause it is very similar (but easier to prov e) to Theorem 4.9 . No w let us show the second part ab out the p osterior mean, ¯ f = R f G dΠ( G | Z ). Consider the even t,  Π  G : H ( f G ⋆ , f G ) ≥ C log n √ n    Z  ≤ exp  − c log 2 n   , 30 whic h has probabilit y at least 1 − 1 /n . On that even t, since H 2 ( f G ⋆ , · ) is conv ex and H 2 ≤ 1, H 2 ( f G ⋆ , ¯ f ) ≤ Z H 2 ( f G ⋆ , f G ) dΠ( G | Z ) ≤ C 2 log 2 n n + exp  − c log 2 n  ≤ 2 C 2 log 2 n n , where the last inequality holds for large enough n . This yields the p osterior-mean statement with C ′ = √ 2 C . E Regret pro ofs E.1 Pro of of Lemma 3.7 Lemma 3.7 follows immediately from the following tw o lemmata. Lemma S1 ( L 2 distance b et w een regularized score and actual score , Lemma 6.1 of Zhang [ 2005 ]) . F or 0 < ρ < 1 / √ 2 π , it holds that (where G is supp orted on [ − M , M ]): Z     f ′ G ( z ) f G ( z ) − f ′ G ( z ) f G ( z ) ∨ ρ     2 f G ( z ) dz ≤ 2 M ρ max  − log(2 π ρ 2 ) , 2  + 2 ρ p − log(2 π ρ 2 ) + 2 . F or this, we need the following results. Lemma S2 (Theorem E.1. in Saha and Gun tub o yina [ 2020 ]) . Suppose that 0 < ρ < (2 πe ) − 1 / 2 . Let G, H b e t wo distributions on R . Then: Z     f ′ G ( z ) f G ( z ) ∨ ρ − f ′ H ( z ) f H ( z ) ∨ ρ     2 f G ( z ) dz ≲ H 2 ( f G , f H ) · max  ( − log(2 π ρ 2 )) 3 , | log H ( f G , f H ) |  . E.2 Pro of of Theorem 4.1 Pr o of. Step 1: Fix µ ∈ [ − M , M ] n and write G n := G n ( µ ) as in ( 13 )and f µ = f G n . Let b µ ⋆ := b µ B ( G n ) =  δ G n ( Z 1 ) , . . . , δ G n ( Z n )  , b e the oracle separable rule, so that, by Theorem 4.8 , inf t ∈T S R  t ( Z ) , µ  = R  b µ ⋆ , µ  . (S24) By Theorem 4.7 , for a constant C M dep ending only on M , inf t ∈T S R  t ( Z ) , µ  ≤ inf t ∈T PE R  t ( Z ) , µ  + C M √ n . (S25) Consequen tly , it suffices to pro ve that for some C > 0 dep ending only on ( M , α, η ), sup µ ∈ [ − M ,M ] n n R  b µ BB , µ  − R  b µ ⋆ , µ  o ≤ C log 5 / 2 n √ n , (S26) since combining ( S26 ) with ( S25 ) yields the stated theorem (after absorbing C M / √ n into C log 5 / 2 n/ √ n for n ≥ 2). 31 Step 2: By Minko wski’s inequality for the L 2 -norm, R  b µ BB , µ  − R  b µ ⋆ , µ  ≤ ( 1 n n X i =1 E µ   ˆ µ BB i − δ G n ( Z i )  2  ) 1 / 2 . (S27) Th us it remains to show that the a verage inside the braces is ≲ (log 5 n ) /n uniformly in µ ∈ [ − M , M ] n . Step 3: F or every i , by iterated exp ectation and conditional indep endence giv en G , ˆ µ BB i = E Π [ µ i | Z ] = E Π [ E Π [ µ i | Z , G ] | Z ] = E Π [ E G [ µ i | Z i ] | Z ] = Z δ G ( Z i ) dΠ( G | Z ) . Note that this representation of ˆ µ BB i is differen t from the LOO representation of Prop osition 3.2 . Since x 7→ ( x − a ) 2 is conv ex, Jensen’s inequality gives, almost surely ,  ˆ µ BB i − δ G n ( Z i )  2 ≤ Z  δ G ( Z i ) − δ G n ( Z i )  2 dΠ( G | Z ) . (S28) Av eraging ( S28 ) ov er i yields the p oin twise b ound 1 n n X i =1  ˆ µ BB i − δ G n ( Z i )  2 ≤ Z b ∆ n ( G ) dΠ( G | Z ) , b ∆ n ( G ) := 1 n n X i =1  δ G ( Z i ) − δ G n ( Z i )  2 . (S29) Note that δ G ( · ) ∈ [ − M , M ] whenever supp( G ) ⊆ [ − M , M ], so b ∆ n ( G ) ≤ 4 M 2 alw ays. Step 4: Let r n := C 0 log n/ √ n , where C 0 is the constant from Theorem 4.9 , and define the Hellinger ball H n :=  G : H  f µ , f G  ≤ r n  . By Theorem 4.9 , there exist constants c > 0 and (for all n ≥ 2) an even t C n with P µ [ C c n ] ≤ 1 /n suc h that on C n , Π( H c n | Z ) ≤ exp  − c log 2 n  . (S30) On C n , using b ∆ n ( G ) ≤ 4 M 2 and ( S29 ) gives Z b ∆ n ( G ) dΠ( G | Z ) ≤ sup G ∈H n b ∆ n ( G ) + 4 M 2 Π( H c n | Z ) ≤ sup G ∈H n b ∆ n ( G ) + 4 M 2 e − c log 2 n . (S31) Therefore, it suffices to show that E µ  sup G ∈H n b ∆ n ( G )  ≲ log 5 n n , uniformly in µ ∈ [ − M , M ] n . (S32) Step 5: Let T n := M + √ 6 log n and define the truncation even t A n :=  max 1 ≤ i ≤ n | Z i | ≤ T n  . Since Z i ∼ N( µ i , 1) and | µ i | ≤ M , a union b ound yields P µ [ A c n ] ≤ 2 n exp {− ( T n − M ) 2 / 2 } ≤ 2 /n 2 . On A n , all Z i ∈ [ − T n , T n ]. 32 F or any G supp orted on [ − M , M ] and any | z | ≤ T n , f G ( z ) = Z φ ( z − u ) d G ( u ) ≥ inf u ∈ [ − M ,M ] φ ( z − u ) = φ ( | z | + M ) ≥ φ ( T n + M ) =: ρ n . F ollo wing Jiang and Zhang [ 2009 ], for any ρ > 0, define the regularized Bay es rule δ ρ G ( z ) := z + f ′ G ( z f G ( z ) ∨ ρ . The regularized Bay es rule at ρ n and any | z | ≤ T n satisfies δ ρ n G ( z ) = δ G ( z ). In particular, on A n , b ∆ n ( G ) = 1 n n X i =1  δ ρ n G ( Z i ) − δ ρ n G n ( Z i )  2 . By Prop osition 3 of Jiang and Zhang [ 2009 ] (applied with M there replaced b y T n and ρ = ρ n ), there exists a finite set of mixing distributions { H 1 , . . . , H N n } ⊆ H n with log N n ≲ log 2 n such that for all G ∈ H n min 1 ≤ j ≤ N n sup | z |≤ T n    δ ρ n G ( z ) − δ ρ n H j ( z )    ≤ n − 2 , (S33) where the implicit constant dep ends only on M (and hence only on ( M , α , η ) through M ). F or G ∈ H n , pick j ( G ) ∈ argmin j sup | z |≤ T n | δ ρ n G ( z ) − δ ρ n H j ( z ) | . On A n , using δ ρ n G ( Z i ) = δ G ( Z i ) ∈ [ − M , M ] and δ ρ n H j ( G ) ( Z i ) ∈ [ − M , M ], we hav e for eac h i ,  δ G ( Z i ) − δ G n ( Z i )  2 ≤  δ H j ( G ) ( Z i ) − δ G n ( Z i )  2 + 8 M · n − 2 , and hence sup G ∈H n b ∆ n ( G ) ≤ max 1 ≤ j ≤ N n b ∆ n ( H j ) + 8 M n − 2 on A n . (S34) No w fix j ≤ N n and define X i,j :=  δ H j ( Z i ) − δ G n ( Z i )  2 ∈ [0 , 4 M 2 ] so that b ∆ n ( H j ) = n − 1 P n i =1 X i,j . Since the Z i are indep enden t under P µ [ · ], the X i,j are indep enden t. Moreov er, E µ h b ∆ n ( H j ) i = Z  δ H j ( z ) − δ G n ( z )  2 f µ ( z ) d z = F  f µ || f H j  , where the last equality follows from δ G ( z ) = z + f ′ G ( z ) /f G ( z ) and the definition of Fisher div ergence. Since H j ∈ H n , we hav e H ( f µ , f H j ) ≤ r n and th us, by Lemma 3.7 (tak e ρ = n − 1 , sa y), E µ h b ∆ n ( H j ) i = F  f µ || f H j  ≲ log 5 n n . (S35) Moreo ver, note that 0 ≤ X i,j ≤ 4 M 2 and V ar µ i [ X ij ] ≤ E µ i  X 2 ij  ≤ 4 M 2 E µ i [ X ij ] . Th us, n X i =1 V ar µ i [ X ij ] ≤ 4 M 2 n E µ h b ∆ n ( H j ) i ≲ log 5 n. 33 Applying Bernstein’s inequality to b ∆ n ( H j ) yields, for a sufficiently large constant K > 0 (dep ending only on M ), P µ  b ∆ n ( H j ) > K log 5 n n  ≤ exp  − c 1 log 5 n  , for some c 1 = c 1 ( M ) > 0 and all large n . By the union b ound ov er j ≤ N n and ( S33 ) (so log N n ≲ log 2 n ), we conclude that for all large n , P µ  max 1 ≤ j ≤ N n b ∆ n ( H j ) > K log 5 n n  ≤ N n exp  − c 1 log 5 n  ≤ exp  C log 2 n − c 1 log 5 n  ≤ 1 n 2 . Com bining the ab o ve with ( S34 ), and using P µ [ A c n ] ≤ 2 /n 2 , we obtain for all n ≥ 2 (after adjusting constants), P µ  sup G ∈H n b ∆ n ( G ) ≳ log 5 n n  ≤ P µ [ A c n ] + P µ  max j ≤ N n b ∆ n ( H j ) ≳ log 5 n n  ≲ 1 n 2 . Since sup G ∈H n b ∆ n ( G ) ≤ 4 M 2 alw ays, this implies ( S32 ). Step 6: T aking exp ectations in ( S29 )–( S31 ) and using ( S32 ) and P µ [ C c n ] ≤ 1 /n yields 1 n n X i =1 E µ   ˆ µ BB i − δ G n ( Z i )  2  ≲ log 5 n n , uniformly in µ ∈ [ − M , M ] n . Plugging this into ( S27 ) gives ( S26 ). Combining with ( S25 ) completes the pro of. 34

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment