Differentiable Knapsack and Top-k Operators via Dynamic Programming
Knapsack and Top-k operators are useful for selecting discrete subsets of variables. However, their integration into neural networks is challenging as they are piecewise constant, yielding gradients that are zero almost everywhere. In this paper, we propose a unified framework casting these operators as dynamic programs, and derive differentiable relaxations by smoothing the underlying recursions. On the algorithmic side, we develop efficient parallel algorithms supporting both deterministic and stochastic forward passes, and vector-Jacobian products for the backward pass. On the theoretical side, we prove that Shannon entropy is the unique regularization choice yielding permutation-equivariant operators, and characterize regularizers inducing sparse selections. Finally, on the experimental side, we demonstrate our framework on a decision-focused learning benchmark, a constrained dynamic assortment RL problem, and an extension of discrete VAEs.
💡 Research Summary
The paper tackles the long‑standing problem of integrating discrete combinatorial operators—specifically the 0/1 knapsack and top‑k selection—into end‑to‑end differentiable models. Both problems share an optimal‑substructure that can be expressed as a dynamic programming (DP) recursion. The authors observe that the classic DP uses a (max, +) semiring, which yields piecewise‑constant solutions and thus zero or undefined gradients when back‑propagated. To obtain meaningful gradients they replace the max operator with a smoothed version max_Ω, defined as the convex conjugate of a strictly convex regularizer Ω over the probability simplex.
The key theoretical contributions are twofold. First, they prove (Proposition 1) that Shannon entropy (Ω = −γ Hₛ) is the unique separable regularizer that makes the relaxed operators permutation‑equivariant. This property guarantees that the output does not depend on the arbitrary ordering of items, a desirable symmetry for many learning tasks. Second, they characterize (Proposition 2) the class of regularizers that can produce sparse selections: any separable, strictly convex Ω with bounded derivative at the simplex boundaries (e.g., Gini or Tsallis entropy) yields operators whose gradients concentrate on vertices of the convex hull of feasible selections, effectively mimicking hard 0/1 decisions while remaining differentiable.
Algorithmically, the smoothed DP recursion retains the same subproblem dependencies as the original DP, allowing a wave‑front parallelization: each DP row depends only on the previous row, so the computation can be performed in O(n) parallel steps, each handling all capacity (or k) values simultaneously. The authors implement this in pure Python using Numba’s just‑in‑time compilation, achieving batch‑wise parallelism without custom CUDA kernels and reducing memory from O(nC) to O(C). For the backward pass they derive closed‑form expressions for the partial derivatives Q = ∂V/∂θ and propagate them through an auxiliary matrix E that encodes the contribution of each decision, resulting in an O(n) vector‑Jacobian product. This is substantially more efficient than black‑box perturbation or multi‑solver approaches that require many forward solves per gradient estimate.
The framework also supports stochastic forward passes. Because max_Ω induces a distribution over selections, the authors provide an ancestral sampling algorithm that draws a concrete subset according to the smoothed DP probabilities. Moreover, they show how to embed the relaxed operators as output layers in supervised learning by employing Fenchel‑Young losses, which treat the smoothed value max_Ω(θ) as a convex surrogate for the original combinatorial objective.
Empirically, three domains are explored. (1) Decision‑focused learning on a benchmark where the loss is defined on the downstream optimization outcome: the Shannon‑regularized operator outperforms existing LP‑based differentiable solvers, achieving higher task‑specific utility. (2) A constrained dynamic assortment problem in reinforcement learning, where non‑uniform item weights are essential: the Gini‑regularized operator yields sparse, interpretable selections and stabilizes training compared to dense alternatives. (3) A discrete variational auto‑encoder augmented with a Fenchel‑Young loss: both Shannon and Tsallis regularizations improve reconstruction quality and latent space disentanglement relative to straight‑through estimators or REINFORCE.
Overall, the paper presents a unified, theoretically grounded, and computationally efficient method for differentiable knapsack and top‑k operators. By framing these combinatorial primitives as smoothed DP recursions, it offers a flexible regularization knob that trades off sparsity, symmetry, and smoothness, while delivering practical algorithms that scale to realistic problem sizes. This work opens the door to broader use of exact combinatorial constraints inside deep learning pipelines without resorting to heuristic relaxations or expensive black‑box gradient estimators.
Comments & Academic Discussion
Loading comments...
Leave a Comment