The loss surface of deep and wide neural networks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

💡 Research Summary

The paper investigates the geometry of the loss surface of fully‑connected feed‑forward neural networks that are both deep and wide. The authors focus on networks trained with a squared‑error loss (or any twice‑differentiable loss whose gradient vanishes only at the global minimum) and analytic, strictly monotone activation functions such as sigmoid, tanh or softplus. Their main structural assumption is that there exists at least one hidden layer whose width (the number of neurons) exceeds the number of training examples N, and that all subsequent layers form a pyramidal architecture – i.e., the widths are non‑increasing, so each weight matrix from layer 2 onward has full column rank.

Under these conditions, the authors prove that almost every critical point of the empirical risk is a global minimum. “Almost every” is meant in the measure‑theoretic sense: the set of non‑optimal critical points has Lebesgue measure zero in the parameter space. The proof proceeds as follows. First, a standard back‑propagation identity (Lemma 2.1) expresses the error matrices Δₖ at each layer in terms of Δₖ₊₁, the weight matrix Wₖ₊₁ and the derivative σ′ of the activation. At a critical point the gradients with respect to the first‑layer weights vanish, which yields XᵀΔ₁ = 0. If the augmented data matrix

The loss surface of deep and wide neural networks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment