Adaptability of the Backpropagation Procedure

Authors: Nathalie Japkowicz and Stephen J. Hanson

Abstract:

Purpose: Possible paradigms for concept learning include discrimination and recognition. In feedforward neural networks, discrimination can be implemented by teaching a discrimination network to return a "1" or a "0" at the output layer according to the class of the training instance. Recognition, on the other hand, can be implemented using an autoassociator for learning how to reconstruct training instances of the conceptual class at the output layer. Novel data are then classified depending on the reconstruction error they obtain when processed by that autoassociator: a small reconstruction error indicates that the example is an instance of the conceptual class while a large one indicates that it is a counter-example. An interesting aspect of discrimination- versus recognition- based classification within the feedforward scheme is that the recognition-based system learns the X-OR problem much more efficiently (about thirty-five times faster) than the discrimination-based system. While this difference in efficiency may be caused by an increase or decrease in problem complexity (due to the change of representation in effect when going from one paradigm to the other), it may also be caused by particular phenomena associated with the backpropagation procedure. More specifically, it is conceivable that backpropagation assumes different and unrelated modes of generalization--presenting variable degrees of efficiency--when faced with distinct situations. The purpose of this paper is to find out whether indeed backpropagation does resort to different generalization strategies when applied to the X-OR problem through the discrimination or recognition paradigm, or whether the difference in efficiency witnessed is solely caused by a change of representation.

Method: The particular method used to study this problem is the 1991 Hidden Unit Activation Plane (HUAP) method of Munro. This method consists of plotting the phase space of the hidden units of the feedforward network. Although this strategy is restricted to feedforward neural networks of two hidden units, the X-OR problem can be solved by a two-hidden layer discrimination network as well as by a two-hidden layer autoassociator. In more detail, the methodology consists of plotting the activation of the first hidden unit as a function of the activation of the second, as time progresses. These plots yield trajectories leading to fixed points also known as attractors. The speed at which an attractor is reached and whether this attractor is stable or unstable are revealing of the strategy used by the backpropagation procedure.

Results: The results obtained indicate that the difference in efficiency experienced by the discrimination network and the autoassociator is indeed caused by a difference in generalization strategy adopted by the backpropagation procedure. In particular, the HUAP method reveals that the two hidden units of the discrimination network do not differentiate and tend towards an unstable attractor for the first 3500 epochs of their training phase. Past Epoch 3500, they do differentiate and each hidden unit attempts to reach a stable attractor. In contrast, the HUAP method reveals that the two hidden units of the autoassociator differentiate immediately and tend towards two distinct stable attractors. These stable attractors are reached by Epoch 100. In addition to exposing two different spontaneous modes of action of the backpropagation procedure, this result explains Munro's 1991 observation that the speed of convergence of a discrimination network could be increased by adding extra constraints to a task in the shape of extra output units. Indeed, like Munro's 1991 "constrained" discrimination network, the autoassociator has more output units than the "regular" discrimination network. This situation seems to create strong stable attractors that overwhelm the attractive power of any unstable attractor and cause the network to converge quickly. In the less constrained "regular" discrimination network, the stable attractors are weaker and the backpropagation procedure is sidetracked towards a strong unstable attractor, thus wasting a lot of computational time.

New aspect of this work: The result just described is important because it sheds some light on the inner-workings of the backpropagation procedure by revealing clearly two of its possible modes of action. Practically, this result is also important because it suggests a deterministic way to optimize the efficiency of a backpropagation network.

Conclusions: This work has sought to explain the observation that a two-hidden unit autoassociator learns the X-OR problem much more efficiently than its discrimination counterpart. In particular, it has studied the Hidden Unit Activation Planes of the two systems over time. These experiments have revealed that although the two methods use the same optimization method--backpropagation--, their generalization strategies are different. This illustrates the flexibility of the backpropagation procedure, which spontaneously switches generalization strategies according to the degree of constraint present in the network. This research has important theoretical as well as practical implications.