The empirical successes of deep learning currently
lack rigorous explanation and the effectiveness of gradient descent in
particular is still mostly a mystery. We focus on training neural
networks with a single hidden layer of unbiased activation units and
find a (surprisingly?) general polynomial-time analysis of gradient
descent, exploiting new connections with tools from spherical harmonics.
On the other hand, when the target function is generated by a network
using arbitrary biases, the problem is intractable. We construct a
family of simple neural networks with a single hidden layer, a smooth
activation function and benign input distributions that is hard to learn
in the sense that any statistical query algorithm (including all known
variants of stochastic gradient descent with any loss function, for any
model architecture) needs an exponential number of queries. We will
discuss a few open problems.
Joint work with John Wilmes (both parts), Le Song and Bo Xie (the lower bound).