It has been experimentally observed in recent years
that multi-layer artificial neural networks have a surprising ability to
generalize, even when trained with far more parameters than
observations. Is there a theoretical basis for this? The best available
bounds on their metric entropy and associated complexity measures are
essentially linear in the number of parameters, which is inadequate to
explain this phenomenon. Here we examine the statistical risk (mean
squared predictive error) of multi-layer networks with $\ell^1$-type
controls on their parameters and with ramp activation functions (also
called lower-rectified linear units). In this setting, the risk is shown
to be upper bounded by $[(L^3 \log d)/n]^{1/2}$, where $d$ is the input
dimension to each layer, $L$ is the number of layers, and $n$ is the
sample size. In this way, the input dimension can be much larger than
the sample size and the estimator can still be accurate, provided the
target function has such $\ell^1$ controls and that the sample size is
at least moderately large compared to $L^3\log d$. The heart of the
analysis is the development of a sampling strategy that demonstrates the
accuracy of a sparse covering of deep ramp networks. Lower bounds show
that the identified risk is close to being optimal. This is joint work
with Andrew R. Barron.
- Tags
-