Differentiable Zero-One Loss via Hypersimplex Projections

Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections Camilo Gomez 1 ⋆ , P engyang W ang 2 , and Liansheng T ang 1 1 Sc ho ol of Data, Mathematical, and Statistical Sciences, Universit y of Central Florida, Orlando, USA camilo.gomez@ucf.edu, liansheng.tang@ucf.edu 2 Departmen t of CIS, Universit y of Macau, Macao, China pywang@um.edu.mo Abstract. Recen t adv ances in machine learning ha ve emphasized the in tegration of structured optimization comp onen ts into end-to-end dif- feren tiable mo dels, enabling richer inductive biases and tighter alignment with task-speciﬁc ob jectiv es. In this w ork, w e in tro duce a no vel diﬀeren- tiable approximation to the zero–one loss—long considered the gold stan- dard for classiﬁcation performance, yet incompatible with gradient-based optimization due to its non-diﬀerentiabilit y . Our metho d constructs a smo oth, order-preserving pro jection onto the ( n, k ) -dimensional hyper- simplex through a constrained optimization framework, leading to a new op erator we term Soft-Binary-Argmax. After deriving its mathematical prop erties, w e show how its Jacobian can b e eﬃciently computed and in- tegrated into binary and multiclass learning systems. Empirically , our ap- proac h achiev es signiﬁcant improv ements in generalization under large- batc h training by imposing geometric consistency constrain ts on the out- put logits, thereb y narrowing the p erformance gap traditionally observed in large-batch training. Our co de is av ailable here h ttps://github.com/ camilog04/Diﬀeren tiable- Zero- One- Loss- via- Hypersimplex- Pro jections . Keyw ords: Diﬀerentiable Optimization in Deep Learning · Diﬀeren- tiable programming · Large-batch generalization. 1 In tro duction Recen t developmen ts in machine learning hav e demonstrated that optimization pro cedures can b e used as fundamen tal comp onen ts within end-to-end diﬀer- en tiable systems [ 1 , 6 , 10 , 11 ]. Rather than relying solely on traditional neural net work la yers, these approac hes incorporate more structured, often nontriv- ial computations—such as constrained optimization—directly into the learning pip eline. These comp onen ts usually hav e structural or computational prop erties that hav e been pro v en useful in do wnstream tasks. F or instance, Sparsemax, a diﬀeren tiable pro jection onto the simplex, pro duces sparse p osterior distri- butions that are eﬀectiv e as atten tion mec hanisms [ 17 ]. Similarly , Csoftmax, ⋆ Corresp onding author 2 C. Gomez et al. a pro jection on to the budget p olytop e, has demonstrated utilit y in sequence tagging [ 18 ]. This reﬂects a growing shift tow ard viewing learning systems as diﬀeren tiable computational frameworks that blend elements of traditional sta- tistical mo deling with algorithmic computation [ 5 ]. In this pap er, we fo cus on crafting a diﬀeren tiable, order-preserving pro jection into the n, k -dimensional h yp ersimplex [ 7 ]–a w ell-studied combinatorics p olytop e–in comp osition with a squared loss to generate a close approximation to the zero-one, misclassiﬁcation loss compatible with mo dern large-scale diﬀerentiable systems. F rom a theoretical p erspective in machine learning, the goal is to minimize the exp ected v alue of a task-sp eciﬁc loss function ov er a data distribution. F or classiﬁcation, the most natural choice is the zero-one loss, which directly mea- sures misclassiﬁcation error. Ho wev er, zero-one loss is non-diﬀeren tiable and dis- con tinuous, as it depends on a hard threshold decision—yielding gradients of zero almost ev erywhere and rendering it incompatible with gradient-based op- timization. T o enable tractable training, mo dern approaches rely on surrogate losses (e.g., cross-en trop y , hinge loss) that are smooth and diﬀeren tiable [ 3 ]. These surrogates serve as proxies that approximate the zero-one loss while fa- cilitating eﬃcien t optimization. Despite their practicality , such surrogates often exhibit a mismatc h with the true ev aluation metric, esp ecially under large-batch regimes. This degradation in p erformance with large batch sizes is the so-called generalization gap [ 19 ], leading to gro wing interest in tighter, more faithful ap- pro ximations to the zero-one loss, under the hypothesis that closer surrogates yield b etter generalization. This work introduces a fully diﬀerentiable approximation to the zero–one loss, featuring an eﬃcient forward pass with complexity O ( n log n ) and a backw ard pass with O ( n ) complexity . This is achiev ed through our no vel diﬀeren tiable pro- jection lay er, Soft-Binary-Argmax@k . Rather than treating the output scores as indep enden t, our lay er explicitly enforces that the largest k logits corresp ond to the predicted p ositiv e classes. This design ensures that small p erturbations in the input pro duce coherent, structurally consisten t adjustments in the out- put, making the Jacobian of the transformation inherently p ositional ly awar e with resp ect to the most conﬁdent predictions. The b eneﬁts of this approach are t wofold: it allows binary classiﬁers to express multiple p ositiv e outcomes within a single forward pass, and it extends naturally to the multiclass setting by ap- plying the same pro jection principle across one-hot enco ded class dimensions. Imp ortan tly , the geometric constraints imp osed on the output logits act as a form of regularization that mitigates the generalization degradation typically observ ed under large-batch training, enabling stable optimization and improv ed predictiv e p erformance. More precisely , our contributions and nov elty can b e summarized as follows: 1. W e introduce a diﬀerentiable pro jection lay er—a smo oth thresholding op- erator realized via pro jection onto the interior of the n, k -dimensional hy- p ersimplex—termed Soft-Binary-Argmax@k . It provides a diﬀerentiable relaxation of the binary ar gmax , reduces to isotonic regression, and enables eﬃcien t forward and backw ard computation on b oth CPU and GPU. Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections 3 2. W e prop ose a smo oth, almost-everywhere diﬀerentiable loss function for bi- nary classiﬁcation, the Hyp erSimplex Loss , and derive its mathematical prop erties. The loss couples the mean squared error with our pro jection la yer and extends naturally to m ulticlass. 3. Through rigorous exp erimentation, we pro vide empirical evidence that the prop osed loss mitigates the generalization gap and impro ves p erformance across m ultiple classiﬁcation b enc hmark datasets. 2 Related work 2.1 Diﬀeren tiable optimization-based ML Optimization-based mo deling integrates structure and constrain ts into machine learning architectures by embedding parameterized ar gmin / ar gmax op erations as diﬀerentiable lay ers. Such lay ers are often formulated as conv ex, constrained programs, with diﬀerentiabilit y ac hieved via the implicit function theorem ap- plied to the KKT conditions [ 11 ]. Parallel work has explored diﬀerentiable op- timization for order-constrained or monotonic outputs, including diﬀerentiable isotonic regression op erators for smo oth, order-aw are learning [ 6 ]. T o the b est of our knowledge, no prior work has in tro duced a diﬀerentiable Euclidean pro- jection on to the ( n, k ) -dimensional h yp ersimplex—computed via the P o ol Ad- jacen t Violators (P A V) algorithm—as a learnable lay er. Our approach ﬁlls this gap, providing an eﬃcien t and theoretically grounded form ulation for in tegrating h yp ersimplex pro jections into mo dern diﬀeren tiable systems. 2.2 Generalization gap A w ell-known challenge in mo dern deep learning is the gener alization gap , where mo dels trained with large batch sizes achiev e low training loss but exhibit de- graded test p erformance. This phenomenon has b een widely observed in neural net works, as large batches tend to con verge to sharp minima that generalize p oorly compared to the ﬂatter solutions found by small-batc h training [ 15 ]. Sub- sequen t work has explored remedies such as adaptiv e learning rate schedules and w armup strategies, noise injection and regularization [ 12 ], and sto c hastic weigh t a veraging [ 14 ] to mitigate this eﬀect. How ever, to the b est of our kno wledge, our w ork is the ﬁrst to address the generalization gap thr ough loss function design , in tro ducing a principled framework that directly links the geometry of the loss landscap e to generalization b eha vior. 3 Preliminaries In sup ervised multiclass classiﬁcation, we are given a dataset D = { ( x i , y i ) } n i =1 , where eac h input x i ∈ X ⊂ R d is asso ciated with a categorical label y i ∈ { 1 , . . . , C } among C p ossible classes. Let f : X → R C denote a prediction 4 C. Gomez et al. function producing a score v ector f ( x i ) = ( f 1 ( x i ) , . . . , f C ( x i )) ⊤ , where eac h comp onen t f c ( x i ) reﬂects the mo del’s conﬁdence for class c . The learning ob jectiv e is to minimize the multiclass zer o–one loss , which measures the fraction of misclassiﬁed samples: L 0 / 1 ( f ) = 1 n n X i =1 I [ ˆ y i  = y i ] , ˆ y i = arg max c ∈{ 1 ,...,C } f c ( x i ) . (1) While L 0 / 1 directly quan tiﬁes classiﬁcation accuracy , it is discon tinuous and non-diﬀeren tiable, making it unsuitable for gradient-based optimization. T o obtain a diﬀerentiable surrogate, con vex losses are commonly employ ed. Common loss functions in mac hine learning—such as squared loss, hinge loss, and logistic loss—are conv ex approximations of the true 0–1 misclassiﬁcation loss [ 3 ]. Among these, the squared loss provides the closest approximation to the 0–1 loss on the interv al (0 , 1) , making it a natural foundation for our formulation. The multiclass squar e d loss p enalizes deviations b et w een predicted scores and the corresp onding one-hot target enco dings, summing ov er all classes: L sq ( f ) = 1 n n X i =1 C X c =1  f c ( x i ) − I [ y i = c ]  2 . (2) Equiv alently , in matrix form, L sq ( f ) = 1 n ∥ F ( X ) − Y ∥ 2 F , where F ( X ) ∈ R n × C collects the mo del outputs and Y is the one-hot lab el matrix. While this smo oth, con vex loss provides analytic gradients and serves as a tractable approximation to L 0 / 1 , it also suﬀers from a ma jor drawbac k: it imp oses a quadratic p enalt y on extreme predicted v alues, leading to sensitivity to outliers [ 8 , 13 ]. This limitation motiv ates our pro jection-based formulation in tro duced next, whic h preserv es smo othness while constraining outputs within a geometrically consistent region. 4 Metho dology 4.1 Ov erview This work b egins by formulating the problem in the binary classiﬁcation set- ting, where the ob jectiv e is to distinguish b et w een p ositiv e and negative out- comes. The same geometric principles, how ever, extend naturally to the m ulti- class setting, as shown in later sections. Common loss functions in mac hine learn- ing—suc h as squared loss, hinge loss, and logistic loss—are conv ex surrogates of the true 0–1 misclassiﬁcation loss [ 3 ]. How ever, the true 0–1 loss minimizer lies at one of the v ertices of the ( n, k ) -dimensional hypersimplex, as it satisﬁes tw o k ey prop erties: 1. Its en tries are binary , i.e., each comp onen t of f ( X ) takes a v alue in { 0 , 1 } . 2. F or any given sample y drawn from the distribution of Y , a p erfect prediction v ector should con tain exactly k p ositive entries, matc hing the num b er of p ositiv es in y ; that is, ∥ f ( X ) ∥ 1 = ∥ y ∥ 1 = k . Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections 5 Motiv ated by the geometry of the optimal solution, we now in tro duce a se- ries of relaxations that make the learning problem tractable. First, w e relax the binary constrain t and allow f ( X ) to tak e real v alues in R n , while encour- aging sparsity—pushing predictions as close as p ossible to 0 or 1 . T o balance smo othness with structural ﬁdelity , we comp ose our diﬀerentiable pro jection op- erator, the soft-binary-argmax@k , which pro duces sparse and nearly binary outputs, with the squared loss, which ensures smooth optimization and stability . This comp osition yields a surrogate ob jective that remains diﬀerentiable while closely aligning with the discrete geometry of the hypersimplex. The remainder of this section is organized as follows. W e b egin by establish- ing the connection betw een binary-argmax@k and thresholding. Next, we formu- late binary-argmax@k as a pro jection onto the n, k hypersimplex, from whic h w e derive its contin uous relaxation, soft-binary-argmax@k, and analyze its key prop erties. Finally , we combine the soft-binary-argmax@k with a squared loss to deﬁne the HyperSimplex loss, and demonstrate its eﬀectiv eness in generalization for large batc h sizes. 4.2 Thresholding and the Binary-Argmax@k This section establishes an in tuitive connection betw een a real-v alued v ector x ∈ R n and its binary counterpart in { 0 , 1 } n . A common discretization metho d is thr esholding , where entries exceeding a ﬁxed b oundary (typically 0 . 5 ) are set to 1 , and the rest to 0 . Although simple, this approach ignores relativ e ordering and oﬀers no con trol ov er the num b er of p ositiv e comp onents. A more structured alternative is the binary-argmax@k op erator, denoted r k , whic h assigns 1 to the k largest entries of x and 0 to the remaining n − k . Here, the threshold is adaptiv ely deﬁned b y the k -th largest v alue of x , reducing to standard thresholding when k = ⌈ n/ 2 ⌉ , where the threshold equals the empirical median of the logits. F ormally , let x ∈ R n b e a vector of scores and k ∈ { 1 , . . . , n } . W e deﬁne the binary-argmax@k op erator as r k ( x ) = I ( x i ≥ T k ( x )) , T k ( x ) = k -th largest v alue of x . (3) This rule ensures exactly k comp onen ts of x are set to 1 , enforcing the con- strain t ∥ r k ( x ) ∥ 1 = k , r k ( x ) ∈ { 0 , 1 } n . (4) The binary-argmax@k mapping do es not pro vide useful deriv atives, hindering gradien t-based optimization. T o address this, we formulate it as a linear opti- mization problem ov er the ( n, k ) -dimensional hypersimplex ∆ n k and introduce a Euclidean regularization term with a temp erature parameter, yielding a smo oth relaxation—the soft-binary-argmax@k. This diﬀerentiable form ulation preserves the hypersimplex geometry while providing informative gradien ts for end-to-end learning. 6 C. Gomez et al. 4.3 Binary-Argmax@k: Euclidean Pro jections onto the Hyp ersimplex The Euclidean pro jection onto the ( n, k ) -dimensional h yp ersimplex can b e ex- pressed as the solution of a simple regularized linear program: argmax y ∈ R n ⟨ x , y ⟩ − ∥ y ∥ 2 2 s.t. 1 ⊤ y = k , 0 ≤ y ≤ 1 . (5) The ﬁrst term encourages alignmen t with the input vector x , while the quadratic regularization term enforces proximit y to the origin, thereb y inducing a balance b et ween sparsit y and ﬁdelity . The aﬃne constraint 1 ⊤ y = k ﬁxes the ℓ 1 mass of y , ensuring exactly k active comp onen ts, while the b o x constraint 0 ≤ y ≤ 1 conﬁnes the solution to the h yp ercub e [0 , 1] n . The feasible region deﬁned by these tw o constraints is precisely the ( n, k ) - dimensional h yp ersimplex: ∆ n k = ( y ∈ [0 , 1] n   n X i =1 y i = k ) . (6) Hence, the optimization problem in ( 5 ) is equiv alent to the Euclidean pro jection o ver the hypersimplex: Π ∆ n k ( x ) = argmin y ∈ ∆ n k ∥ x − y ∥ 2 2 . (7) Since ∆ n k is conv ex and compact, this problem admits a unique solution. Geo- metrically , Π ∆ n k ( x ) corresp onds to the p oint within ∆ n k that lies closest to x in Euclidean dis tance, and algebraically , it coincides with the binary vector that activ ates the k largest comp onen ts of x , the binary-argmax@k. 100 010 001 001 010 100 Fig. 1. Binary-argmax@k of a p oin t x = (0 . 1 , 1 . 6 , 1) into the exterior of the Hyp ersim- plex (left). At k = 1 , the solution is (0 , 1 , 0) . Introducing temp erature to the program yields an in terior solution (righ t), i.e., the soft-binary-argmax@1. In R 3 diﬀeren t k v al- ues yield p oin ts on a standard simplex, but in higher dimensions yields a p oin t on the h yp ersimplex. In R 4 with k = 2 , the solution l ies on an o ctahedron [ 2 ]. Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections 7 4.4 Soft-Binary-Argmax@k: A Diﬀerentiable Approximation The pro jection Π ∆ n k ( x ) pro vides a geometric mapping from a contin uous vec- tor x to its structured binary coun terpart, but it remains piecewise constant and thus non-diﬀerentiable. Small p erturbations in x can abruptly change the iden tity of the top- k elemen ts, yielding discontin uities and zero gradients almost ev erywhere. Consequently , the hard binary-argmax@k op erator is incompatible with gradien t-based optimization. T emp er atur e-Sc ale d R elaxation. T o obtain a smo oth approximation, we intro- duce a temp erature parameter τ > 0 that scales the regularization strength in the pro jection ob jective: argmin y ∈ ∆ n k  τ ∥ y ∥ 2 2 − 2 ⟨ x , y ⟩  = argmin y ∈ ∆ n k  ∥ y ∥ 2 2 − 2 D x τ , y E (8) leading to the compact expression Π τ ( x ) = argmin y ∈ ∆ n k    y − x τ    2 2 = Π ∆ n k  x τ  (9) As τ → 0 , the op erator recov ers the discon tinuous hard pro jection, while larger τ v alues yield smo other outputs closer to the h yp ersimplex—deﬁning the soft- Binary-Argmax@k op erator. Prop osition 1 (Diﬀerentiabilit y a.e). Fix k ∈ { 1 , . . . , n } and τ > 0 . The mapping F τ : R n → ∆ n k , F τ ( x ) := Π ∆ n k  x τ  is (1 /τ ) -Lipschitz, henc e diﬀer entiable almost everywher e (a.e.) in R n . Pr o of. The Euclidean pro jection on to a closed conv ex set in a Hilb ert space is nonexpansiv e: ∥ Π C ( u ) − Π C ( v ) ∥ 2 ≤ ∥ u − v ∥ 2 for all u , v . With C = ∆ n k and u = x /τ , v = z /τ ,   F τ ( x ) − F τ ( z )   =    Π ∆ n k  x τ  − Π ∆ n k  z τ     ≤    x τ − z τ    = 1 τ ∥ x − z ∥ . Th us F τ is (1 /τ ) -Lipschitz. By Rademac her’s theorem, ev ery Lipschitz map on R n is diﬀeren tiable a.e., proving the claim. Prop osition 2 (Order preserv ation). The pr oje ction solution y i = Π ∆ n k  x i τ  is or der pr eserving; that is, if x 1 /τ ≥ x 2 /τ ≥ · · · ≥ x n /τ , then the pr oje cte d c o or dinates satisfy y 1 ≥ y 2 ≥ · · · ≥ y n . Pr o of. F rom the KKT conditions of the Lagrangian asso ciated with ( 9 ), station- arit y and complementarit y yield, for each i , y i = clip  x i /τ − λ 2 , 0 , 1  , where the multiplier λ is uniquely determined to satisfy the equality constraint P i y i = k . Since the mapping t 7→ clip  t − λ 2 , 0 , 1  is monotone nondecreasing in t , it follo ws that x 1 /τ ≥ · · · ≥ x n /τ ⇒ y 1 ≥ · · · ≥ y n . See [ 9 ] for details. 8 C. Gomez et al. Corollary 1 (Computation). Sinc e the pr oje ction is or der pr eserving (Pr op o- sition 2 ), adding a monotonicity c onstr aint do es not change the solution. Henc e, for any sorte d input x /τ , the pr oje ction c an b e c ompute d via a r e duction to iso- tonic r e gr ession: Π ( x /τ ) = arg min y ∈ [0 , 1] n , 1 ⊤ y = k , y 1 ≥···≥ y n   x τ − y   2 . The fe asible set is close d and c onvex, ensuring a unique and diﬀer entiable solu- tion. This r e duc es to a standar d isotonic pr oje ction pr oblem, solvable eﬃciently via the p o ol-adjac ent-violators (P A V) algorithm [ 4 ] in O ( n log n ) time. 4.5 The Hyp erSimplex loss W e no w comp ose our pro jection op erator with the squared loss to deﬁne a smo oth surrogate for b inary classiﬁcation. The squared loss pro vides high ﬁdelit y to the zero–one ob jective within (0 , 1) but can b e dominated by large-magnitude predictions. By comp osing it with the pro jection op erator Π ∆ k n , we constrain predictions to the hypersimplex, preven ting any co ordinate from ov ertaking the loss while preserving the discrete geometry of the solution. F ormally , for x , y ∈ R n , deﬁne ˆ y = Π ∆ k n  x τ  , L ( x , y ) = 1 2 ∥ ˆ y − y ∥ 2 2 , where τ > 0 controls the smo othness of the relaxation. The gradient with resp ect to x follo ws from the chain rule: ∇ x L ( x , y ) = 1 τ J Π  x τ  ( ˆ y − y ) , where J Π denotes the Jacobian of the pro jection op erator Π ∆ k n . Let A = { i : 0 < ˆ y i < 1 } denote the activ e co ordinates. On this set, the Jacobian acts as J Π = I | A | − 1 | A | 11 ⊤ , yielding the comp onen t-wise gradient ( ∇ x L ) i =    1 τ  ( ˆ y i − y i ) − 1 | A | P j ∈ A ( ˆ y j − y j )  , i ∈ A, 0 , i / ∈ A. A t b oundary p oin ts where some ˆ y i ∈ { 0 , 1 } , the mapping is only directionally diﬀeren tiable, and any subgradient consistent with this Jacobian form is v alid. 4.6 Extension to Multiclass Classiﬁcation The formulation extends naturally to the m ulticlass setting. F or each class c ∈ { 1 , . . . , C } with logits x ( c ) ∈ R n , one-hot target y ( c ) , and temp erature τ c > 0 , w e pro ject onto the ( n, k c ) -h yp ersimplex: p ( c ) = Π ∆ n k c  x ( c ) τ c  . Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections 9 The total loss is L ( X, Y ) = 1 2 C X c =1   p ( c ) − y ( c )   2 2 , ∇ x ( c ) L = 1 τ c J Π  x ( c ) τ c  ( p ( c ) − y ( c ) ) . This provides a smo oth, per-class pro jection framework that preserves h yp ersim- plex structure while remaining fully diﬀerentiable. During the learning pro cess eac h k c is set to matc h the exp ected n umber of p ositiv e resp onses for class c . 5 Exp erimen ts F or our exp eriments, we ev aluate the eﬀectiveness of the prop osed Hyp erSim- plex loss in reducing the generalization gap compared to standard classiﬁcation losses, including Cross-Entrop y , Hinge, and Mean Squared Error (MSE, without pro jection). This setup also serves as an ablation study to isolate the contribu- tion of our pro jection lay er, verifying that incorp orating geometric constrain ts on the output logits yields more consistent p erformance across batch sizes than the MSE ob jective alone. 5.1 Datasets W e conduct exp erimen ts on t wo standard image classiﬁcation b enc hmarks: CIF AR- 10 [ 16 ] and F ashion-MNIST [ 21 ]. CIF AR-10 consists of 60,000 color images of size 32 × 32 pixels, split into 50,000 training and 10,000 test samples across 10 ob jec t categories. F ashion-MNIST contains 70,000 grayscale images of size 28 × 28 pixels, divided in to 60,000 training and 10,000 test images from 10 cloth- ing categories, serving as a more c hallenging replacemen t for the original MNIST dataset. 5.2 Exp erimen tal Setup W e emplo yed a standard conv olutional neural netw ork (CNN) for multiclass im- age classiﬁcation, consisting of four conv olutional lay ers, each follo wed by batch normalization, max p o oling, and ReLU activ ation. The ﬁnal feature map is ﬂat- tened and passed through tw o fully connected lay ers, with the last lay er pro duc- ing class logits. The datasets w ere preprocessed using random cropping, horizon- tal ﬂipping, and p er-channel normalization, and randomly split into training and test sets. All exp eriments w ere implemen ted in PyT orch [ 20 ] and executed on 32- core AMD Ryzen Threadripp er PRO 5975WX CPU with 503 GB of RAM and three NVIDIA R TX 6000 Ada Generation GPUs, each with 48 GB of VRAM. T o ensure statistical robustness, each conﬁguration was trained using ﬁve indep enden t random seeds, v arying b oth mo del initialization and data splits. W e ev aluated four loss functions—our prop osed Hyp erSimplex loss and three widely used baselines: Cross-Entrop y , Hinge, and Mean Squared Error (MSE)– across sev en batc h sizes (128, 256, 512, 1024, 2048, 4096 and 8192) on b oth 10 C. Gomez et al. CIF AR-10 and F ashion-MNIST. In total, this resulted in 280 training runs. F or eac h conﬁguration, we recorded the maximum test accuracy achiev ed p er loss function and batc h size, and assessed diﬀerences against the Cross-En tropy base- line using paired t -tests at the 10% signiﬁcance lev el. This exp erimental design pro vides a rigorous and statistically grounded comparison, isolating the con tri- bution of the Hyp erSimplex form ulation to generalization stability under v arying batc h regimes. 5.3 Results 0 1000 2000 3000 4000 5000 6000 7000 8000 Batch size 0.85 0.86 0.87 0.88 0.89 Max A ccuracy CIF AR -10: Mean Max A ccuracy vs. Batch Size (A cr oss Seeds) Cr ossEntr opy HyperSimple x (l2=1.5) Hinge MSE 0 1000 2000 3000 4000 5000 6000 7000 8000 Batch size 0.934 0.936 0.938 0.940 0.942 0.944 0.946 0.948 Max A ccuracy F ashion-MNIST : Mean Max A ccuracy vs. Batch Size (A cr oss Seeds) Cr ossEntr opy HyperSimple x (l2=1.5) Hinge MSE F or CIF AR-10, all seven conﬁgurations rep ort p ositive mean accuracy dif- ferences, and all ( 100% ) show statistically signiﬁcant improv ements at the 10% lev el ( p < 0 . 1 ). F or F ashion-MNIST, six of seven conﬁgurations ( ≈ 86% ) also ac hieve signiﬁcance, with only the smallest batch size ( 128 ) falling abov e the 10% threshold. A cross b oth datasets, therefore, 13 of 14 total comparisons ( ≈ 93% ) demonstrate statistically signiﬁcant gains, indicating that the prop osed loss sys- tematically outp erforms cross-entrop y across a wide range of training conditions. These ﬁndings conﬁrm that the Hyp erSimplex loss maintains accuracy sta- bilit y at smaller batch sizes while mitigating the degradation observed in cross- en tropy as batc h size increases. This supp orts its eﬀectiv eness as a smo oth, geometry-consisten t surrogate that enhances generalization and robustness in large-batc h training regimes. T able 1. Batc h-wise Comparison of Cross-Entrop y (CE) vs. Hyp erSimplex (HS) Losses on CIF AR-10 and F ashion-MNIST. The highlighted scores are statistically signiﬁcant at a 10% level of signiﬁcance. CIF AR-10 F ashionMNIST Batc h CE HS (ours) ∆ t-stat p-v al CE HS (ours) ∆ t-stat p-v al 128 0.8885 0.8917 0.0032 2.29 0.084 0.9456 0.9467 0.0011 1.31 0.262 256 0.8843 0.8874 0.0030 3.47 0.026 0.9434 0.9473 0.0040 4.34 0.012 512 0.8807 0.8857 0.0050 6.08 <0.01 0.9436 0.9469 0.0033 5.51 <0.01 1024 0.8776 0.8821 0.0045 9.78 <0.01 0.9413 0.9453 0.0040 6.67 <0.01 2048 0.8725 0.8791 0.0066 7.81 <0.01 0.9388 0.9446 0.0058 12.97 <0.01 4096 0.8659 0.8750 0.0090 13.14 <0.01 0.9371 0.9439 0.0069 9.68 <0.01 8192 0.8541 0.8648 0.0108 8.52 <0.01 0.9338 0.9415 0.0077 14.99 <0.01 Diﬀeren tiable Zero-One Loss via Hyp ersimplex Pro jections 11 5.4 Other Exp eriments: Cross-domain V alidation A dditional GBR T results on tabular datasets for classiﬁcation are rep orted in App endix A , showing that the Hyp erSimplex loss also improv es out-of-sample generalization b ey ond the neural settings studied in the main text. 6 Conclusion W e introduced the soft-binary-argmax@k, a diﬀerentiable pro jection onto the in terior of the ( n, k ) -dimensional h yp ersimplex, and established its key prop er- ties—diﬀeren tiability , order preserv ation, and eﬃcien t GPU computation. W e sho wed how this op erator integrates naturally into end-to-end learning systems, and used it to construct a surrogate to the zero–one loss for binary and m ulticlass settings, with statistically signiﬁcan t reductions in the generalization gap. Owing to its close alignmen t with the true zero–one ob jectiv e, the prop osed Hyp erSimplex loss impro ves generalization under large-batc h training. Addi- tional cross-domain ev aluations on tabular data further suggest that the beneﬁts of the pro jection extend b ey ond neural mo dels. F uture work will explore appli- cations to con trastive learning ob jectiv es and structured prediction. A App endix: Cross-Domain T abular Results Dataset Higgs Flight KDD10 KDD12 Criteo A v azu KKBox MovieLens Cross Entropy 0.823 0.773 0.826 0.724 0.774 0.738 0.777 0.827 HyperSimplex 0.846 0.778 0.849 0.729 0.796 0.741 0.797 0.828 References 1. Agra wal, A., Amos, B., Barratt, S.T., Boyd, S.P ., Diamond, S., Kolter, J.Z.: Diﬀeren tiable conv ex optimization la yers. CoRR abs/1910.12430 (2019), http: 2. Amos, B., Koltun, V., Kolter, J.Z.: The limited m ulti-lab el pro jection lay er (2019), h 3. Bartlett, P .L., Jordan, M.I., McAuliﬀe, J.D.: Con vexit y , classiﬁcation, and risk b ounds. Journal of the American Statistical Association 101 (473), 138–156 (2006) 4. Best, M.J., Chakra v arti, N.: Activ e set algorithms for isotonic regression: a unifying framew ork. Mathematical Programming 47 (1-3), 425–439 (1990). https://doi.org/ 10.1007/BF01580880 5. Blondel, M., Roulet, V.: The elemen ts of diﬀeren tiable programming (2024), https: 6. Blondel, M., T eb oul, O., Berthet, Q., Djolonga, J.: F ast diﬀerentiable sorting and ranking. In: International Conference on Machine Learning. pp. 950–959. PMLR (2020) 12 C. Gomez et al. 7. De Lo era, J.A., Sturmfels, B., Thomas, R.R.: Gröbner bases and triangulations of the second hypersimplex. Com binatorica 15 (3), 409–424 (1995). https://doi.org/ 10.1007/BF01299745 8. Gneiting, T., Raftery , A.E.: Strictly prop er scoring rules, prediction, and estima- tion. Journal of the American Statistical Asso ciation 102 (477), 359–378 (2007) 9. Gomez, C.: F ast Diﬀeren tiable Pro jection Lay ers on to High-Dimensional Polytopes for Large-Scale Predictive Mo deling. Ph.D. thesis, Universit y of Central Florida (2025), https://stars.library .ucf.edu/etd2024/450 10. Gomez, C., W ang, P ., F u, Y.: Metric-agnostic learning-to-rank via b oosting and rank approximation. In: 2023 IEEE International Conference on Data Mining (ICDM). pp. 1043–1048 (2023). https://doi.org/10.1109/ICDM58522.2023.00121 11. Gould, S., F ernando, B., Cherian, A., Anderson, P ., Cruz, R.S., Guo, E.: On diﬀer- en tiating parameterized argmin and argmax problems with application to bi-level optimization. CoRR abs/1607.05447 (2016), 12. Hoﬀer, E., Hubara, I., Soudry , D.: T rain longer, generalize better: Closing the generalization gap in large batch training of neural netw orks. In: Pro ceedings of the 31st Conference on Neural Information Pro cessing Systems (NeurIPS). pp. 1731–1741 (2017) 13. Hub er, P .J.: Robust estimation of a lo cation parameter. The Annals of Mathemat- ical Statistics 35 (1), 73–101 (1964) 14. Izmailo v, P ., Podoprikhin, D., Garip o v, T., V etrov, D., Wilson, A.G.: A veraging w eights leads to wider optima and better generalization. In: Pro ceedings of the 34th Conference on Uncertain ty in Artiﬁcial Intelligence (UAI). pp. 876–885 (2018), h 15. Kesk ar, N.S., Mudigere, D., No cedal, J., Smelyanskiy , M., T ang, P .T.P .: On large- batc h training for deep learning: Generalization gap and sharp minima. In: Pro- ceedings of the 5th International Conference on Learning Representations (ICLR) (2017), https://openreview.net/forum?id=H1oyRlY gg 16. Krizhevsky , A.: Learning m ultiple la yers of features from tin y images. T ech. Rep. TR-2009, Universit y of T oronto (2009), https://www.cs.toron to.edu/~kriz/ learning- features- 2009- TR.p df 17. Martins, A.F.T., Astudillo, R.F.: F rom softmax to sparsemax: A sparse mo del of attention and multi-label classiﬁcation. CoRR abs/1602.02068 (2016), http: 18. Martins, A.F.T., Kreutzer, J.: Learning what‘s easy: F ully diﬀerentiable neural easy-ﬁrst taggers. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Pro ceedings of the 2017 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 349– 362. Asso ciation for Computational Linguistics, Copenhagen, Denmark (Sep 2017). h ttps://doi.org/10.18653/v1/D17- 1036 , https://aclan thology .org/D17- 1036/ 19. Oy edotun, O.K., Papadopoulos, K., A ouada, D.: A new persp ectiv e for under- standing generalization gap of deep neural netw orks trained with large batch sizes (2022), 20. P aszke, A., Gross, S., Massa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., An tiga, L., Desmaison, A., Köpf, A., Y ang, E., De Vito, Z., Raison, M., T ejani, A., Chilamkurthy , S., Steiner, B., F ang, L., Bai, J., Chintala, S.: Pytorc h: An imp erative style, high-performance deep learning library . In: Adv ances in Neural Information Pro cessing Systems. pp. 8024–8035 (2019) 21. Xiao, H., Rasul, K., V ollgraf, R.: F ashion-mnist: A nov el image dataset for b enc h- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017), h

Differentiable Zero-One Loss via Hypersimplex Projections

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment