DescriptionThe information bottleneck theory of neural networks has received a lot of attention in both machine learning and information theory. At the heart of this theory is the assumption that a good classifier creates representations that are minimal sufficient statistics, i.e., they share only as much mutual information with the input features that is necessary to correctly identify the class label. Indeed, it has been claimed that information-theoretic compression is a possible cause of generalization performance and a consequence of learning the weights using stochastic gradient descent. On the one hand, the claims set forth by this theory have been heavily disputed based on conflicting empirical evidence: There exist classes of invertible neural networks with state-of-the-art generalization performance; the compression phase also appears in full batch learning; information-theoretic compression is an artifact of using a saturating activation function. On the other hand, several authors report that training neural networks using a cost function derived from the information bottleneck principle leads to representations that have desirable properties and yields improved operational capabilities, such as generalization performance and adversarial robustness.
In this work we provide yet another perspective on the information bottleneck theory of neural networks. With a focus on training deterministic (i.e., non-Bayesian) neural networks, we show that the information bottleneck framework suffers from two important shortcomings: First, for continuously distributed input features, the information-theoretic compression term is infinite for almost every choice of network weights, making this term problematic during optimization. The second and more important issue is that the information bottleneck functional is invariant under bijective transforms of the representation. Optimizing a neural network w.r.t. this functional thus yields representations that are informative about the class label, but that may still fail to satisfy desirable properties, such as allowing to use simple decision functions or being robust against small perturbations of the input feature. We show that there exist remedies for these shortcomings: Including a decision rule or softmax layer, making the network stochastic by adding noise, or replacing the terms in the information bottleneck functional by more well-behaved quantities. We conclude by showing that the successes reported about training neural networks using the information bottleneck framework can be attributed to exactly these remedies.
(This is joint work with Rana Ali Amjad from Technical University of Munich.)
|Period||30 May 2019|
|Event title||Fifth London Symposium on Information Theory|
|Location||London, United Kingdom|