Amortized Inference for Correlated Discrete Choice Models via Equivariant Neural Networks

Discrete choice models are fundamental tools in management science, economics, and marketing for understanding and predicting decision-making. Logit-based models are dominant in applied work, largely due to their convenient closed-form expressions fo…

Authors: Easton Huch, Michael Keane

Amortized Inference for Correlated Discrete Choice Models via Equivariant Neural Networks
Amortized Inference for Correlated Discrete Choice Mo dels via Equiv arian t Neural Net w orks Easton Huc h ∗ Mic hael Keane † Marc h 2026 Abstract Discrete c hoice mo dels are fundamen tal to ols in managemen t science, economics, and mark eting for understanding and predicting decision-making. Logit-based mo dels are dominant in applied w ork, largely due to their conv enien t closed-form expressions for c hoice probabilities. Ho wev er, these mo dels en tail restrictive assumptions on the sto c hastic utilit y comp onen t, constraining our abilit y to capture realistic and theoreti- cally grounded choice b ehavior—most notably , substitution patterns. In this work, we prop ose an am ortized inference approach using a neural net w ork emulator to approxi- mate c hoice probabilities for general error distributions, including those with correlated errors. Our proposal includes a sp ecialized neural net work arc hitecture and accompany- ing training pro cedures designed to resp ect the inv ariance prop erties of discrete choice mo dels. W e pro vide group-theoretic foundations for the arc hitecture, including a pro of of universal approximation given a minimal set of inv arian t features. Once trained, the emulator enables rapid likelihoo d ev aluation and gradient computation. W e use Sob olev training, augmenting the lik eliho o d loss with a gradient-matc hing p enalty so that the emulator learns b oth choice probabilities and their deriv atives. W e show that em ulator-based maximum likelihoo d estimators are consistent and asymptotically nor- mal under mild approximation conditions, and we pro vide sandwich standard errors that remain v alid ev en with imp erfect lik e liho o d approximation. Sim ulations sho w significan t gains o ver the GHK simulator in accuracy and sp eed. Keyw ords: amortized inference; DeepSet; discrete choice; inv arian t theory; m ultino- mial probit; neural netw ork emulator; p ermutation equiv ariance; Sob olev training ∗ P ostdo ctoral F ellow, Johns Hopkins Carey Business Sc ho ol, ehuc h@jhu.edu. † Carey Distinguished Professor, Johns Hopkins Carey Business Sc ho ol, mkeane14@jh u.edu. 1 1 In tro duction Discrete choice mo dels are a widely used to ol in management science, economics, marketing, and other fields for understanding how individuals and organizations make decisions among finite sets of alternatives ( McF adden , 1974 ). These models provide a structural framew ork that deliv ers in terpretable parameters—suc h as willingness-to-pay and demand elasticities— and that also enables predictions ab out the effects of hypothetical in terven tions, including pricing c hanges, pro duct introductions, and p olicy mo difications. The dominan t discrete c hoice mo del in applied w ork is the multinomial logit (MNL), whic h assumes that the utility consumer i obtains from alternativ e j takes the form U j = v j + ϵ j where v j is the deterministic comp onen t of utilit y , t ypically written as a linear function x ⊤ j β of alternative j ’s attributes x j , and ϵ j is a Type I extreme v alue (Gumbel) error that is independent and identically distributed (iid) across alternativ es j = 1 , . . . , K . This setup yields simple closed-form choice probabilities: the probabilit y of choosing alternativ e j is giv en by the “softmax” function exp( v j ) / P K k =1 exp( v k ). The resulting computational con venience has made MNL the default choice in countless applications. Ho wev er, this conv enience comes at a w ell-known cost. The iid Gum b el error assumption implies the restrictive indep endence of irrelev an t alternativ es (I IA) property: the o dds ratio b et w een an y tw o alternativ es is unaffected b y the presence or attributes of other alternativ es, whic h can lead to unrealistic substitution patterns. A p opular generalization of MNL is the mixed logit mo del (mixed-MNL), in which the utility weigh t β is allo wed to be heterogeneous across consumers. Estimation is straigh tforw ard via simulation metho ds ( T rain , 2009 ). The mixed-MNL relaxes IIA at the aggregate lev el, but I IA still holds at the lev el of consumer i ’s individual c hoices given the corresp onding random utility weigh t, β i . The multinomial probit (MNP) is a well-kno wn alternative to MNL that relaxes I IA and allo ws for flexible substitution patterns b y assuming ϵ 1 , . . . , ϵ K come from a m ultiv ariate nor- mal distribution with cross-alternative correlations. Despite this adv an tage, MNP has seen relativ ely limited adoption, largely due to its computational demands. MNP c hoice probabil- ities hav e no closed form and require ev aluating multiv ariate normal rectangle probabilities. Estimation relies on more sophisticated simulation metho ds than needed for mixed-MNL: the GHK simulator in a classical setting ( Gewek e , 1989 ; Ha jiv assiliou and McF adden , 1998 ; Keane , 1994 ) or MCMC in a Bay esian setting ( McCullo ch and Rossi , 1994 ). W e prop ose a fundamentally differen t approach: Rather than sim ulating c hoice probabilities anew for eac h likelihoo d ev aluation, we train a neural netw ork emulator to directly approx- 2 imate the c hoice probabilit y function. This strategy—known as amortized inference—shifts the computational burden from inference time to a one-time training phase. Once trained, the emulator provides rapid deterministic approximations to c hoice probabilities via simple function calls. The amortized inference framew ork has prov en highly successful in multiple scien tific domains for appro ximating computationally in tensive sim ulation mo dels ( Lueck- mann et al. , 2019 ; Cranmer et al. , 2020 ). Here, we adapt it to discrete c hoice mo dels, extending b ey ond MNL and MNP to general correlated error distributions. In the econometrics literature, Norets ( 2012 ) emplo ys a related strategy , appro ximating the exp ected v alue function in dynamic discrete c hoice mo dels via a neural netw ork. He applies this strategy in the context of a sp ecific parametric mo del, so changes in the dynamic mo del require mo difying and retraining the neural net work. In con trast, our emulator op erates on the generic c hoice problem given only the deterministic utilities and the parameters of the error distribution, enabling mo difications to the parametric mo del (e.g., the functional form of the deterministic utilities) without altering or retraining the emulator. Our metho dology relies on sev eral new contributions. First, w e dev elop a neural net work arc hitecture sp ecifically designed for choice mo dels. It resp ects their fundamen tal inv ariance prop erties, which w e formalize as symmetries (group actions) of the c hoice-probabilit y map. 1 Choice probabilities are inv ariant to lo cation shifts (adding a constant to all utilities) and scale transformations (m ultiplying utilities b y a p ositive constant), and equiv ariant with resp ect to p ermutations of c hoice alternativ es. Our architecture incorp orates a preprocessing transformation that reduces the size of the feature space b y enforcing lo cation and scale in v ariance. The pro cessed features are then passed to a per-alternative enco der mo dule based on the DeepSet architecture ( Zaheer et al. , 2017 ). The output is then concatenated across alternativ es and pro cessed through equiv ariant la y ers that imp ose a sum-to-one constraint on the output probabilities while resp ecting the p erm utation equiv ariance prop erty . Second, we establish the theoretical foundations of our architecture. W e prov e that it can univ ersally appro ximate choice probabilities on compact subsets of the parameter space, outside a measure-zero exceptional set. This result connects our architecture to the theory of orbit separation under group actions, extending the universal appro ximation results of Blum- Smith et al. ( 2025 ) on symmetric matrices to the joint space of utility vectors and cen tered co v ariance matrices. Blum-Smith et al. ( 2025 ) emplo y Galois theory to pro ve generic orbit separation, while w e provide a direct pro of based on an inv ariant reconstruction argument. 1 If these prop erties are not embedded in the architecture, the NN must learn them during training. As a con tin uum of mo dels generate equiv alent c hoice probabilities, this would slo w learning considerably . 3 Third, w e establish the statistical prop erties of maximum lik eliho o d estimators (MLEs) formed using an em ulator approximation of the true discrete c hoice probabilities. W e sho w that if the em ulator approximates the true log lik eliho o d sufficien tly well—specifically , if the av erage appro ximation error is o p ( n − 1 )—then the em ulator-based estimator inherits the consistency and asymptotic normalit y of the exact MLE. When this condition is not met, w e show that v alid inference can still b e obtained via sandwich standard errors, treating the em ulator as a working mo del in the quasi-maximum likelihoo d framework. By careful design, our prepro cessing transformations and architecture enforce smo othness of c hoice probabilities with resp ect to the inputs: the deterministic utilities and corresp onding scale (co v ariance) matrix. W e use Sob olev training ( Czarnec ki et al. , 2017 ) to match em ulator gradien ts to those of the log c hoice probabilities. T ogether, these design c hoices enable reliable use of automatic differen tiation for downstream mo del-fitting and inference tasks. With a pretrained emulator, generalizing from logit to probit (or an y other error distribution) requires only the replacemen t of closed-form softmax probabilities with em ulator ev aluations of c hoice probabilities when forming the lik eliho o d. In the probit case, given a fixed compu- tational budget, we sho w via simulations that an ML estimator using our amortized inference pro cedure matc hes or exceeds the p erformance of one using the GHK algorithm in terms of estimation error and cov erage rates. F urthermore, our approac h can easily handle mo dels other than MNP where an efficien t simulation algorithm like GHK is una v ailable. F or concreteness, muc h of our exp osition fo cuses on MNP mo dels. Ho wev er, the emulator approac h and supp orting theory are largely agnostic to the assumed parametric form of the errors. Generalizing to other error distributions (e.g., correlated Gumbel) is straightforw ard, requiring only mo dest c hanges to the training data generation and emulator inputs. The remainder of this pap er is organized as follo ws. Section 2 summarizes related literature on choice modeling. Section 3 describ es the MNP mo del and our inferential goals. Sec- tion 4 presents the em ulator arc hitecture and training pro cedure. Section 5 establishes the theoretical prop erties of the arc hitecture and em ulator-based estimators. Section 6 presen ts sim ulation results, and Section 7 concludes. 2 Related Literature The literature on applying mac hine learning metho ds to discrete choice can b e divided into t wo broad streams. The first stream maintains the classic random utilit y mo del (RUM) of c hoice, in whic h choice probabilities are generated by a p opulation of rational consumer 4 t yp es with different (but stable) preference orderings ov er the univ erse of choice ob jects, as explained by Blo ch and Marschak ( 1960 ) and McF adden and Rich ter ( 1990 ). 2 MNL, MNP , and mixed-MNL are all mem b ers of the RUM class pro vided one main tains the RUM assumption that the utility of an option dep ends only on its own c haracteristics. Key pap ers in this strand start from this basic MNL structure and use neural net w orks to generalize the functional form of utilit y: Ben tz and Merunk a ( 2000 ), Sifringer et al. ( 2020 ), W ang et al. ( 2020 ), Han et al. ( 2022 ), and Singh et al. ( 2023 ). Tw o imp ortan t recen t pap ers extend this w ork: Aouad and Desir ( 2025 ) develop an architecture that implements the mixed logit mo del with a flexible distribution of taste heterogeneit y (R UMnet), and Bagheri et al. ( 2025 ) dev elop another architecture that generalizes the Gumbel error assumption (RUM-NN). Both pap ers use softmax-smo othed sample a v erages to approximate c hoice probabilities within the loss function, resulting in increased computation time relative to pure logit-based mo dels. Aouad and Desir ( 2025 ) build on the generic appro ximation prop erty of mixed-MNL mo dels sho wn in McF adden and T rain ( 2000 ). This result relies on a flexible basis expansion of the utility function and a p oten tially nonparametric mixing distribution. In practice, these mo del attributes are unkno wn to analysts. As emphasized in b oth Aouad and Desir ( 2025 ) and McF adden and T rain ( 2000 ), this appro ximation prop ert y is not unique to Gum b el errors. Rather, it holds more generally for a larger class of hierarchical discrete choice mo dels, including mixed-MNP mo dels. McF adden and T rain ( 2000 ) use this appro ximation prop ert y to justify adoption of mixed-MNL as a computationally conv enient alternative to other mo dels lacking closed-form c hoice probabilities, suc h as MNP . But this justification of mixed-MNL becomes less compelling giv en a practical, general-purpose alternativ e lik e the em ulators we prop ose here. The second stream dispenses with the R UM structure, often motiv ated by a desire to relax restrictiv e MNL assumptions like I IA and to allow more flexible substitution patterns. Some pap ers in this stream view discrete c hoice as a general classification problem that is amenable to machine learning metho ds. This is exemplified by W ang and Ross ( 2018 ), Lh ´ eritier et al. ( 2019 ), Rosenfeld et al. ( 2020 ), Chen and Mi ˇ si´ c ( 2022 ), and Chen et al. ( 2025 ). Others main tain an MNL structure at the top lev el—that is, choice probabilities are determined b y a vector of alternativ e-sp ecific utilities that enter a softmax function—but a neural net is used to construct the alternative-specific utility functions in flexible wa ys that deviate from RUM assumptions. F or instance, the utility of alternative j is allow ed to dep end on 2 Supp ose there are ¯ K ob jects in the univ erse and a consumer is presented with a choice set (or “assort- men t”) that contains K ≤ ¯ K elemen ts. A k ey implication of RUM is that the utilit y a consumer derives from pro duct k is inv arian t to the c hoice set, ruling out context or assortmen t effects. 5 attributes of other alternativ es to generate con text effects. This is exemplified by W ang et al. ( 2021 ), W ong and F aro o q ( 2021 ), Cai et al. ( 2022 ), Pfannsc hmidt et al. ( 2022 ), and Berb eglia and V enk ataraman ( 2025 ). Bet ween these tw o streams, a fundamental tension emerges: While RUM mo dels are fa v ored for their interpretabilit y and grounding in economic theory , the dominant R UM mo dels in empirical w ork (linear MNL and mixed-MNL) enforce constraints on deterministic utilities and substitution patterns that may b e to o restrictiv e in some applications; in particular, they assume I IA at the level of individual c hoices. The first stream aims to address this c hallenge by adding additional flexibilit y to R UM mo dels. How ev er, many of these prop osals main tain the assumption of indep enden t logit errors, constraining substitution patterns and p oten tially biasing utility estimates. The exceptions—namely , RUMnet and RUM-NN— address substitution, but they do so at the exp ense of interpretabilit y and/or computational efficiency due to their (a) (p otentially) nonlinear utility functions and (b) exp ensive sample- a verage appro ximations of the lik eliho o d function. The second stream abandons the R UM structure altogether, making it difficult or imp ossible to obtain reliable inferences for man y economically meaningful quantities, suc h as consumer w elfare, willingness-to-pay measures, demand elasticities, and substitution effects. In summary , this tension results in a tradeoff b et w een mo deling flexibility and interpretation, ranging from rigid interpretable mo dels like MNL on one hand to flexible atheoretic classification metho ds on the other. MNP mo dels constitute a notable exception to the ab ov e tradeoff as they allo w for flexi- ble substitution patterns via interpr etable cov ariance relationships. Moreov er, if additional flexibilit y is desired, they can b e extended to allow the deterministic utilities to follow the functional form of a neural netw ork ( Hrusc hk a , 2007 ). Conv ersely , if researchers desire a more parsimonious or interpretable mo del, analysts can constrain the cov ariance structure via p enalization metho ds as in Jiang et al. ( 2025 ) or via factor structures as we illustrate in Section 6 . Despite these virtues, empirical applications of MNP are relativ ely sparse in the literature due, in large part, to the difficulty of ev aluating MNP c hoice probabilities. More broadly , MNP mo dels are just one mem b er of a larger class of R UM mo dels featuring correlated error terms. Analogous to MNP , other mem b ers can b e generated by assuming errors of the form ϵ = Σ 1 / 2 ϵ ∗ , where Σ 1 / 2 is a square-ro ot scale matrix and ϵ ∗ is a vector of exc hangeable errors. F or example, ϵ ∗ j iid ∼ Gumbel(0 , 1) results in a correlated Gum b el distribu- tion with scale matrix Σ . As another example, we could assume ϵ ∗ ∼ Multiv ariate- t ( 0 , I , ν ) to capture hea vy tail behavior via the degrees-of-freedom parameter ν > 0 ( I denotes an iden tity matrix). Bagheri et al. ( 2025 ) consider correlated error distributions formed in this 6 manner, resorting to exp ensiv e sim ulation metho ds to appro ximate the choice probabilities b ecause an efficien t algorithm (GHK) is av ailable only in the Gaussian case. The primary contribution of this w ork is an amortized inference framew ork that pro duces accurate, reusable, and computationally efficient estimates of c hoice probabilities for RUM mo dels with general error distributions, including those featuring non trivial correlation struc- tures. The framew ork presents a p oten tial resolution to the flexibility–in terpretabilit y trade- off highligh ted abov e, enabling practical estimation of flexible choice mo dels without sacri- ficing the economically meaningful insigh ts and parsimony of RUM mo dels. The framew ork is supp orted b y strong theoretical justification, including a universal approximation guaran- tee and asymptotic inference results under mild approximation conditions (see Section 5 ). Moreo ver, the framew ork is largely complementary to recent adv ances in mac hine learning metho ds, including the tw o streams discussed ab ov e. In many cases, these metho ds could b e enhanced with emulator-based likelihoo d ev aluations, resulting in flexible mo dels with in terpretable substitution patterns and manageable computational demands. 3 Problem Setup This section summarizes the problem setup, including the in v ariance prop erties of discrete c hoice mo dels and our inferen tial goals. 3.1 Discrete Choice Mo dels W e consider a decision-mak er choosing among K mutually exclusive alternatives. The decision-mak er assigns a latent utility U j to each alternative j ∈ { 1 , . . . , K } and selects the alternativ e with the highest utility: Y = arg max j ∈{ 1 ,...,K } U j . (1) The laten t utilities decomp ose into deterministic and sto chastic comp onen ts: U j = v j + ϵ j , j = 1 , . . . , K, (2) where v j is the deterministic (or systematic) utility that dep ends on observ able c haracteris- tics, and ϵ j is a cen tered random error capturing unobserv ed factors. As explained in Section 1 , the deterministic utility t ypically takes a linear form v j = x ⊤ j β , where x j is a v ector of alternativ e-sp ecific attributes and β is a parameter vector to b e estimated. 7 Differen t distributional assumptions on the error v ector ϵ = ( ϵ 1 , . . . , ϵ K ) ⊤ yield different c hoice models. The MNL mo del assumes that the ϵ j are indep endent and identically dis- tributed with the Gum b el(0 , 1) distribution. In contrast, the MNP mo del assumes that ϵ follo ws a m ultiv ariate normal distribution with mean zero and co v ariance matrix Σ . Belo w, w e consider correlated mo dels in which ϵ = Σ 1 / 2 ϵ ∗ , where Σ 1 / 2 is a square-ro ot scale matrix as describ ed in Section 2 . W e imp ose the follo wing assumption on ϵ ∗ . Assumption 1. The error v ector ϵ ∗ admits a density , f ϵ ∗ : R K → [0 , ∞ ), with resp ect to Leb esgue measure. Moreov er, the elemen ts of ϵ ∗ are exc hangeable so that f ϵ ∗ ( P π ϵ ′ ) = f ϵ ∗ ( ϵ ′ ) for all ϵ ′ ∈ R K and an y p erm utation matrix, P π . This setup and assumption include the MNP as a leading example with f ϵ ∗ equal to the PDF of the N ( 0 K , I K ) distribution. Ho wev er, the setup is general enough to include many other p ossibilities, such as the correlated Gum b el and m ultiv ariate- t distributions describ ed in Section 2 . W e defer treatmen t of additional regularity conditions until Section 5 when we dev elop the theoretical prop erties of our framework in the con text of an assumed parametric mo del with parameter v ector θ ∈ Θ ⊂ R p . W e denote the c hoice probabilit y for alternative j as P j ( v , Σ ) = Pr( U j ≥ U k for all k  = j ) = Pr( ϵ k − ϵ j ≤ v j − v k for all k  = j ) , (3) where v = ( v 1 , . . . , v K ) ⊤ . In practice, we ma y also c ho ose to include additional parameters go verning the distribution of ϵ ∗ , such as the degrees-of-freedom parameter, ν , describ ed in Section 2 , but these parameters are suppressed in the notation for simplicity . The c hoice probabilit y can b e expressed as the following integral: P j ( v , Σ ) = Z R j f ϵ ∗ ( ϵ ∗ ) d ϵ ∗ , (4) R j =  ϵ ∗ ∈ R K : v j +  Σ 1 / 2  j • ϵ ∗ ≥ v k +  Σ 1 / 2  k • ϵ ∗ for all k  = j  , (5) where R j is the region of the error space where alternative j is chosen and  Σ 1 / 2  j • denotes the j th row of Σ 1 / 2 so that  Σ 1 / 2  j • ϵ ∗ = ϵ j . In general, this in tegral is not analytically tractable. In MNP mo dels, in particular, it do es not hav e a closed-form solution for K ≥ 3, necessitating n umerical metho ds, such as the GHK algorithm, whic h simulates these probabilities via recursiv e conditioning. F or each ev aluation, GHK requires R sim ulation dra ws, with the appro ximation error decreasing at rate O ( R − 1 / 2 ). 8 3.2 In v ariance Prop erties Choice probabilities satisfy three fundamen tal inv ariance prop erties. Lo cation in v ariance. Adding a constant to all utilities do es not c hange choice probabilities: P j ( v + C 1 K , Σ ) = P j ( v , Σ ) (6) where 1 K is the K -vector of ones and C ∈ R may b e random. This follo ws b ecause choices dep end only on utilit y differences. Scale in v ariance. Scaling all utilities and the scale matrix preserves choice probabilities: P j ( α v , α 2 Σ ) = P j ( v , Σ ) for all α > 0 . (7) This follo ws b ecause the scaled mo del is equiv alent to the original mo del with v and ϵ replaced b y α v and α ϵ , resp ectively . P ermutation equiv ariance. Relab eling alternatives permutes the choice probabilities cor- resp ondingly . F or an y p erm utation π of { 1 , . . . , K } with asso ciated p ermutation matrix P π : P j ( v , Σ ) = P π − 1 ( j ) ( P π v , P π ΣP ⊤ π ) . (8) A naive attempt at constructing an emulator may not resp ect these prop erties. T ranslating, rescaling, or p ermuting the alternatives could pro duce different emulator predictions, result- ing in slo w learning and p o or generalization b ecause the emulator would need to learn these prop erties manually . In Section 4 , we propose a solution that automatically satisfies these prop erties based on tw o design comp onen ts: a prepro cessing step that fixes the lo cation and scale, and an equiv ariant arc hitecture that ensures p erm utation equiv ariance without sacrificing expressivit y . The ab ov e in v ariance prop erties imply that the choice mo del is not identified without normal- ization. W e assume throughout that the mo del has b een appropriately normalized s o that the parameter v ector θ is identified. This is t ypically ac hiev ed in practice b y setting the utility of one alternativ e to zero deterministically and fixing one of the nonzero diagonal elemen ts in Σ (see Keane ( 1992 )). Similarly , we assume that the data exhibit sufficient v ariation to iden tify the remaining parameters, whic h t ypically requires v ariation in alternativ e-sp ecific co v ariates, such as prices. 9 3.3 Inferen tial Goals Giv en observ ations { ( y i , X i ) } n i =1 , where y i ∈ { 1 , . . . , K } is the observ ed c hoice and X i rep- resen ts the co v ariates, w e seek to estimate the parameter vector, θ . This parameter vector includes b oth the co efficients affecting deterministic utilities and additional parameters gov- erning the scale matrix. The (a veraged) log-likelihoo d function is ℓ n ( θ ) = 1 n n X i =1 log P y i { v i ( X i , θ ) , Σ i ( X i , θ ) } , (9) where v i ( X i , θ ) and Σ i ( X i , θ ) denote the utilit y vector and scale matrix for observ ation i as functions of the co v ariates and parameters. Maxim um likelihoo d estimation requires maximizing ℓ n ( θ ) o v er Θ, which in turn requires rep eated ev aluation of ℓ n ( θ ) and its gradien t ∇ θ log P j ( v , Σ ). Similarly , Bay esian metho ds such as Hamiltonian Monte Carlo require rep eated ev aluations of b oth ℓ n ( θ ) and ∇ θ log P j ( v , Σ ), often thousands of times throughout the estimation pro cess. Our goal is to construct a neural net w ork em ulator ˆ P j ( v , Σ ; ϕ ) that provides fast, accurate appro ximations of the true c hoice probabilities, along with analytic gradients via automatic differen tiation. The em ulator is trained once on sim ulated data spanning the relev ant input space, after whic h it can b e used for rapid inference on many datasets and mo dels. 4 Em ulator Design This section outlines the emulator preprocessing transformations, neural netw ork architec- ture, training pro cedure, and inference pro cess. 4.1 Design Goals The em ulator must satisfy four desiderata: 1. Resp ect in v ariance prop erties. The em ulator should resp ect the in v ariance prop- erties describ ed in Section 3.2 : in v ariance with resp ect to lo cation shifts and scale transformations, and equiv ariance with resp ect to p ermutations. 2. Provide smo oth appro ximations. The em ulator should b e differen tiable with re- sp ect to its inputs, enabling gradient-based optimization and automatic differen tiation. 10 3. Generalize across sp ecifications. A single trained emulator should w ork for div erse utilit y and cov ariance structures. 4. Be computationally efficient. Emulator ev aluation should b e fast enough to enable routine use in estimation. W e ac hiev e these goals through a com bination of prepro cessing transformations and a care- fully designed neural net work architecture. 4.2 Prepro cessing: Centering and Scaling W e address lo cation and scale in v ariance through prepro cessing transformations that pro ject the inputs on to a canonical subspace. Cen tering. Define the cen tering matrix M = I K − 1 K 1 K 1 ⊤ K , (10) whic h w e emplo y to pro ject the vector of utilities, U ∈ R K , onto the subspace orthogo- nal to 1 K , resulting in transformed utilities ˜ U = M ( v + ϵ ) = v + ϵ − 1 K ( ¯ v + ¯ ϵ ), where ¯ v = 1 K P K j =1 v j and ¯ ϵ = 1 K P K j =1 ϵ j . These transformed utilities sum to zero within each realization: P K j =1 ˜ U j = 0. In effect, this transformation results in mo dified em ulator inputs: ˜ v = M v = v − 1 K ¯ v, ˜ Σ = MΣM ⊤ . (11) Scaling. W e normalize by the trace of the cen tered scale matrix, pro ducing transformed utilities U ∗ = q K/ tr  ˜ Σ  ˜ U . The final transformed parameters are then v ∗ = s K tr( ˜ Σ ) ˜ v , Σ ∗ = K tr( ˜ Σ ) ˜ Σ . (12) W e can omit the case tr( ˜ Σ ) = 0 as this results in kno wn, deterministic c hoices. The complete transformation pro duces normalized inputs ( v ∗ , Σ ∗ ) satisfying: 1. P K j =1 v ∗ j = 0 (cen tered deterministic utilities), 2. Σ ∗ 1 K = 0 (centered scale matrix), 11 3. tr( Σ ∗ ) = K (scale-normalized), 4. Σ ∗ is p ositiv e semidefinite with rank at most K − 1. This normalization reduces the size of the emulator’s input space, accelerating learning and impro ving generalization. Imp ortantly , this transformation commutes with p erm utation ma- trices: for any permutation matrix P π , applying the transformation to ( P π v , P π ΣP ⊤ π ) yields ( P π v ∗ , P π Σ ∗ P ⊤ π ). This ensures that the prepro cessing preserves p ermutation equiv ariance. 4.3 Neural Net w ork Arc hitecture After preprocessing, we construct a neural net work that maps ( v ∗ , Σ ∗ ) to choice probabilities while mainta ining p ermutation equiv ariance. The architecture consists of three comp onents: a p er-alternative enco der, p erm utation-equiv ariant la yers, and an output la y er. Belo w, we in terpret Σ as a co v ariance matrix to simplify the exposition, referring to its diagonal and off- diagonal elemen ts as v ariances and cov ariances, resp ectively . This interpretation is correct for MNP mo dels. F or other mo dels, it is only illustrativ e as these parameters may not exactly coincide with the v ariances and co v ariances. P er-alternative enco der. F or each alternative j ∈ { 1 , . . . , K } , we construct a repre- sen tation z j ∈ R d z that captures ho w alternativ e j relates to all other alternatives. This represen tation is built from t wo complementary DeepSet netw orks—a diagonal DeepSet and an off-diagonal DeepSet—whose outputs are combined with alternativ e j ’s own features. The diagonal DeepSet pro cesses pairwise relationships b etw een j and eac h other alterna- tiv e, while the off-diagonal DeepSet summarizes the co v ariance structure among alternatives other than j . In the descriptions b elow, the features lab eled as “base inputs” are sufficien t for universal approximation of c hoice probabilities (see Section 5.1 ). W e include additional features to improv e the expressivity of the emulator, as we find this improv es its predictions. Diagonal DeepSet. F or each pair ( j, k ) with k  = j , we construct a feature v ector d j k con taining differen tiable functions of the utilities and co v ariances asso ciated with alternatives j and k . The base inputs are the utilities v ∗ j and v ∗ k , the v ariances Σ ∗ j j and Σ ∗ kk , and the co v ariance Σ ∗ j k . F rom these, we derive additional feature transformations that facilitate learning, including standard deviations σ j = p Σ ∗ j j and σ k = p Σ ∗ kk , the correlation ρ j k = Σ ∗ j k / ( σ j σ k ), and the standardized utilit y difference (z-score): z j k = v ∗ j − v ∗ k p Σ ∗ j j + Σ ∗ kk − 2Σ ∗ j k . (13) 12 The diagonal DeepSet pro cesses these features using the architecture of Zaheer et al. ( 2017 ): h diag j = ρ diag X k  = j ϕ diag ( d j k ) ! , (14) where ϕ diag is a multi-la y er p erceptron (MLP) applied to eac h pair, the sum aggregates ov er all k  = j , and ρ diag is another MLP that pro duces a learned nonlinear representation of the sum. W e use the notation ϕ diag and ρ diag follo wing Zaheer et al. ( 2017 ); the MLPs ρ diag and ρ off (defined b elo w) can b e distinguished from the correlation ρ j k b y their subscripts. Singh et al. ( 2023 ) also emplo y a v ariant of the DeepSet arc hitecture. Ho wev er, their metho d assumes iid errors and operates on problem-sp ecific features, which requires their neural net work to b e mo dified and retrained to accommo date new data structures. In contrast, our arc hitecture relies on generic RUM mo del inputs, such as deterministic utilities. Off-diagonal DeepSet. F or each pair ( k , l ) with k < l and k , l  = j , w e construct a feature v ector o kl con taining differen tiable, symmetric functions comparing alternatives k and l ; we require symmetry b ecause the pairs do not p ossess a natural ordering. The base input is Σ ∗ kl , and we include other deriv ed features, suc h as the correlation ρ kl , squared utility difference ( v ∗ k − v ∗ l ) 2 , the squared z-score z 2 kl , and the v ariance sum Σ ∗ kk + Σ ∗ ll . The off-diagonal DeepSet has an analogous structure to that of the diagonal DeepSet: h off j = ρ off    X k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment