Optimization on Product Submanifolds of Convolution Kernels
Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental analyses, we employ G-SGD to train CNNs on Cifar-10, Cifar-100 and Imagenet datasets. The results show that geometric adaptive step size computation methods of G-SGD can improve training loss and convergence properties of CNNs. Moreover, we observe that classification performance of baseline CNNs can be boosted using G-SGD on ensembles of PEMs identified by multiple constraints.
💡 Research Summary
The paper introduces a novel framework for training convolutional neural networks (CNNs) that leverages ensembles of product submanifolds (PEMs) to incorporate multiple normalization constraints on convolution kernels simultaneously. Traditional approaches have applied single geometric constraints—such as orthonormality (Stiefel manifold) or unit‑norm (sphere)—to individual kernels, but extending these methods to multiple constraints leads to early divergence, vanishing or exploding gradients. To overcome this, the authors define PEMs as Cartesian products of embedded kernel submanifolds, each submanifold representing a distinct constraint. Three construction strategies are proposed: (1) PEMs for input channels (PI), (2) PEMs for output channels (PO), and (3) PEMs for both input and output channels (PIO). These strategies allow both non‑overlapping and overlapping groupings of kernels, enabling rich combinations of geometric structures within a single layer.
Theoretical analysis (Lemma 3.2) shows that the metric on a PEM is the sum of the metrics of its component manifolds, and its curvature tensor is the sum of the component curvature tensors. Consequently, a PEM never exhibits negative sectional curvature, but it can contain flat directions (zero curvature) when different component manifolds are combined. This property motivates an adaptive step‑size scheme that depends on the local sectional curvature. Theorem 3.3 and Corollary 3.4 derive explicit learning‑rate formulas for common cases, such as products of spheres and Stiefel manifolds.
Building on these insights, the authors design Geometry‑aware Stochastic Gradient Descent (G‑SGD). Each iteration projects the Euclidean gradient onto the tangent spaces of the component manifolds, moves along the tangent direction, and then retracts the updated point back onto the PEM using the appropriate retraction for each submanifold. The step size is computed adaptively per layer, epoch, and kernel based on the curvature information, ensuring stable updates even in highly curved regions.
Empirical evaluations on CIFAR‑10, CIFAR‑100, and ImageNet using standard architectures (ResNet, VGG) demonstrate that G‑SGD on PEM ensembles converges faster and achieves higher classification accuracy than conventional Euclidean SGD or prior manifold‑aware methods. For example, on ImageNet, a ResNet‑50 trained with the PIO strategy improves top‑1 accuracy by roughly 1.2 % and reduces training loss by over 15 % compared to the baseline. Moreover, the curvature‑driven adaptive learning rate reduces the need for extensive hyper‑parameter tuning.
In summary, the work provides a rigorous geometric foundation for multi‑constraint kernel optimization, proposes practical construction strategies for PEM ensembles, and validates a curvature‑aware SGD algorithm that yields both theoretical convergence guarantees and tangible performance gains on large‑scale vision benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment