Authors: Nathalie Japkowicz and Stephen J. Hanson
Abstract:
Purpose:
Possible paradigms for concept learning include discrimination and recognition.
In feedforward neural networks, discrimination can be implemented by teaching
a discrimination network to return a "1" or a "0" at the output layer according
to the class of the training instance. Recognition, on the other hand, can be
implemented using an autoassociator for learning how to reconstruct training
instances of the conceptual class at the output layer. Novel data are then
classified depending on the reconstruction error they obtain when processed by
that autoassociator: a small reconstruction error indicates that the
example is an instance of the conceptual class while a large one indicates that
it is a counter-example. An interesting aspect of discrimination- versus
recognition- based classification within the feedforward scheme is that the
recognition-based system learns the X-OR problem much more efficiently (about
thirty-five times faster) than the discrimination-based system.
While this difference in efficiency may be caused by an increase or decrease in
problem complexity (due to the change of representation in effect when going
from one paradigm to the other), it may also be caused by particular phenomena
associated with the backpropagation procedure. More specifically, it is
conceivable that backpropagation assumes different and unrelated modes of
generalization--presenting variable degrees of efficiency--when faced with
distinct situations. The purpose of this paper is to find out whether indeed
backpropagation does resort to different generalization strategies when
applied to the X-OR problem through the discrimination or recognition paradigm,
or whether the difference in efficiency witnessed is solely caused by a change
of representation.
Method:
The particular method used to study this problem is the 1991 Hidden Unit
Activation Plane (HUAP) method of Munro. This method consists of plotting the
phase space of the hidden units of the feedforward network. Although this
strategy is restricted to feedforward neural networks of two hidden units, the
X-OR problem can be solved by a two-hidden layer discrimination network as well
as by a two-hidden layer autoassociator. In more detail, the methodology
consists of plotting the activation of the first hidden unit as a function of
the activation of the second, as time progresses. These plots yield
trajectories leading to fixed points also known as attractors. The speed at
which an attractor is reached and whether this attractor is stable or unstable
are revealing of the strategy used by the backpropagation procedure.
Results:
The results obtained indicate that the difference in efficiency experienced by
the discrimination network and the autoassociator is indeed caused by a
difference in generalization strategy adopted by the backpropagation procedure.
In particular, the HUAP method reveals that the two hidden units of the
discrimination network do not differentiate and tend towards an unstable
attractor for the first 3500 epochs of their training phase. Past Epoch 3500,
they do differentiate and each hidden unit attempts to reach a stable
attractor. In contrast, the HUAP method reveals that the two hidden units of
the autoassociator differentiate immediately and tend towards two distinct
stable attractors. These stable attractors are reached by Epoch 100. In
addition to exposing two different spontaneous modes of action of the
backpropagation procedure, this result explains Munro's 1991 observation that
the speed of convergence of a discrimination network could be increased by
adding extra constraints to a task in the shape of extra output units. Indeed,
like Munro's 1991 "constrained" discrimination network, the autoassociator has
more output units than the "regular" discrimination network. This situation
seems to create strong stable attractors that overwhelm the attractive power
of any unstable attractor and cause the network to converge quickly. In
the less constrained "regular" discrimination network, the stable attractors
are weaker and the backpropagation procedure is sidetracked towards a strong
unstable attractor, thus wasting a lot of computational time.
New aspect of this work:
The result just described is important because it sheds some light on the
inner-workings of the backpropagation procedure by revealing clearly two of
its possible modes of action. Practically, this result is also important
because it suggests a deterministic way to optimize the efficiency of a
backpropagation network.
Conclusions:
This work has sought to explain the observation that a two-hidden
unit autoassociator learns the X-OR problem much more efficiently than its
discrimination counterpart. In particular, it has studied the Hidden Unit
Activation Planes of the two systems over time. These experiments have
revealed that although the two methods use the same optimization
method--backpropagation--, their generalization strategies are different. This
illustrates the flexibility of the backpropagation procedure, which
spontaneously switches generalization strategies according to the degree of
constraint present in the network. This research has important theoretical as
well as practical implications.