"brain metaphor"
evolution
Reminders: (levels (types) of learning)
Other names for artificial neural networks: (not synonyms, but related)
Computers:
New neural network architectures are characterized by:
How it works in theory: Associative memory
How it works physically: Neurons
"Let us assume as the basis of all our subsequent reasoning this law: When two elementary brain- processes have been active together or in immediate succession, one of them, on reoccuring, tends to propogate its excitement into the other."
"A Logical Calculus of the Ideas Immanent in Nervous Activity," Bulletin of Mathematical Biophysics (1943), 5:115-133*(Oversimplified by today's standards, but used by John vonNeumann in teaching the theory of computing machines)
They also modeled a simple neural network with electrical circuits.
$390 million.
Original funding cut to $33 million over 17 months as a seed program.
Physical - Basics:
A Simple Neuron: fig.4.2
Nerve Structures and Synapses: fig. 4.3
Behaviorally:
The ability to get from one internal representation to another, or to
infer a complex representation from a portion of it, forms the basis of
associative memory.
Associative memory hinges on the concept of the distributed representation of
information.
Human memory operates in an associative manner; a portion of a recollection can
produce a larger related memory.
Consider image retrieval: in a neural net employing the associative memory
concept, the image is its own address and, in fact, presentation of an
approximation of this address will result in the recovery of the actual image
as output.
"Because a neural net is an ensemble of a great number of collectively
interacting elements, it seems more natural to abandon altogether those
physical models of memory in which particular concepts correspond to particular
spatial locations (nodes) in the hardware. Instead, we assume that
representations of concepts and other pieces of information are stored as
collective states of a neural network....they are realized only through
collective effects and are reflected in recall processes...This kind of
collective representation might be called holographic or hologic
[Khanna, 1990]."
These features are achieved as seen in the following example.
(material from ["Artificial Itelligence" 2nd ed., Rich & Knight, McGraw Hill, 1991])
In figure 18.1 consider unit in lower left ...
This network has only four distinct stable states:
(fig. 18.2)
Given any initial state, the network will necessarily settle into one of these
four configurations.
The network is "storing" these patterns.
Hopfield's major contribution - to show that given any set of weights and any
initial state, his parallel relaxation algorithm would eventually steer the
network into a stable state.
Consider:
E.g., "gray, large, fish, eats plankton"
In large systems, units or connections can disappear completely without
adversely affecting the overall behavior.
What is the relationship between the weights on the network's connections and
the state it settles into?
[material from "A Practical Guide to Neural Nets", Nelson & Illingworth, Addison Wesley, 1991]
Linear functions are limited.
Inputs are connected to many nodes with different weights, resulting in a
series of outputs, one per node. (fig. 4.12)
(The connections correspond to the axons and synapses in a biological system,
and they provide a signal transmission pathway between the nodes.)
The network outputs are generated from the output layer.
Any other layers are called hidden layers because they are
internal to the network and have no direct contact with the external
environment. (fig. 4.13)
*Note that there are more connections than nodes.
The example is a mapping, or feedforward network. Such mappings, or
associations of objects in the input set with objects in the output set, are
also called transformations.
Note: we could input any letter's pixel pattern and have the network output the
correct ASCII code.
The internal layers form an intermediate representation of the input
data.
A perceptron models a neuron by taking the weighted
In the perceptron, unlike the Hopfield network, connections are unidirectional.
Fig. 18.5
Whatever a perceptron can compute,
it can learn to compute.
The problem is to determine linearly separable, fig. 18.9 -
How learn to identify x2 with white dots and x1 with black ones?
Weighted sum: g(x) =
Output function: T(x) = 1 if g(x) > 0
g(x) = w0 + w1x1 + w2x2
If the input lies on one side of the line, the perceptron outputs 1. A line that correctly separates the training instances corresponds to a
perfectly functioning perceptron.
Figure 18.11 shows a perceptron learning to classify the instances in 18.9. K
is the number of passes.
Note: (consider)
(1) parallel relaxation is a problem-solving strategy
(2) gradient descent is a learning strategy
BUT: The perceptron learning algorithm can correctly adjust weights
between inputs and outputs, but it cannot adjust weights between
perceptrons.
Other problems:
(Some of these true in other network paradigms as well.)
The major problem is learning.
Note: The knowledge representation employed by neural nets is quite opaque;
the nets must learn their own representations because programming them by hand
is impossible.
First, consider a multilayered, fully connected, feedforward network.
In contrast to Hopfield nets, backpropogation networks perform a simpler
computation.
Because activations flow in only one direction,
there is no need for an iterative relaxation process.
The existance of hidden layers allows the network to develop complex feature
detectors, or internal representations.
Fig. 18.15, identification of the digit "7"
hidden units could indicate: Figure 18.16
It's output: output = 1/(1 + e-sum)
The network adjusts its weights each time it sees an input-output pair.
Each pair requires two stages:
Unlike perceptrons, the backprogation algorithm usually updates its weights
incrementally, after seeing each input-output pair.
Back Propogation Learning
The most important generalization of the perceptron training algorithm is
called the Delta Rule.
The back propogation of errors technique is the most commonly used
generalization of the Delta Rule.
The Delta Rule:(notice the term)
The algorithm can be restated and generalized by introducing the term d, which
is the difference between the desired or target output T and the actual output
A.
The delta rule modifies weights appropriately for target and actual outputs of
either polarity and for both continuous and binary outputs.
Most models consider learning to be an adjustment on the
strengths of connections.
Virtually all learning rules of this type can be considered a variant of the
Hebbian learning rule.
Simplest version:
Note: What do negative and positive activations do?
It is very important to note that the information needed to use the Hebb rule
to determine the value each connection should have is locally available
at the connection.
All a given connection needs to consider is the activation of the units on both
sides of it.
Recall (in Hopfield):
As the net settles, it assigns truth or falsity while violating the fewest
constraints.
Problem:
Hopfield networks settle into local minima.
We need the globally optimal state (satisfy as many constraints as
possible).
If a Hopfield network reaches a stable state, then no single unit is willing to
change its state in order to move uphill.
Hinton and Sejnowski [1986] combined Hopfiled nets and simulated annealing to
produce networks called Boltzmann Machines.
In the annealing process, the distribution of energy states is determined by a
mathematical relationship.
In Boltzmann machines, this relationship is used to update the units individual
states.
The previous training methods we have seen have been deterministic.
(adjusting weights based on current information)
This method is a statistical training method . It makes pseudorandom
changes in the weight values, retaining those changes that result in
improvements.
Annealing is the process of gradually moving from very high temperatures down
to very low ones.. The randomness added by the temperature helps the network
escape the local minima.
(Here the temperature is "artificial" and is defined ahead.)
The probability that any given unit will be active is given by p:
p = 1/(1 + eDE/T)
where DE is the sum of the unit's active input lines and T is the "temperature"
of the network.
For more information about the learning algorithm see the Hinton and Sejnowski
paper.
If the annealing is carried out properly, Boltzmann machines can avoid local
minima and learn to compute any computable function of fixed-sized inputs and
outputs.
-Distributed vs. Localist Representations
advantages:
Distributed representations are:
Localist representations can:
pattern recognition, vision, speech
Hopfield, J. (1982). Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the National Academy of Sciences USA, 79, pp
2554-2558.
Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge MA: MIT Press.
Nelson, M. and Illingworth, W. (1991). A Practical Guide to Neural Nets. Reading MA:
Addison-Wesley.
Quillian, R. (1968). Semantic memory. In Semantic Information Processing, Minsky,
M. (Ed.), Cambridge MA: MIT Press.
Rich, E. and Knight, K. (1991). Artificial Intelligence, Second Edition. New York:
McGraw-Hill.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65, 386-407.
Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms. Washington, D.C.: Spartan Books.
Wasserman, P. (1989). Neural Computing: Theory and Practice. NewYork: Van
Nostrand Reinhold.
Nice tutorial from Mat Buckland (wish I had seen before I wrote these notes!)
AI depot link to Neural Net tutorials and Applications scroll down for uses in Games (Creatures, Black & White, Collin McRay Rally 2.0)
Generation 5
Another Intro to NN
The Brain
(a nerve cell with all its processes)
Synapse Activity: fig. 4.4
(some signals excite (+) ; some inhibit (-) )The Brain:
-Associative Memory-Examples:
The brain metaphor (behavioral implementation):
Related concepts:
In traditional implementations, access by content
Content-addressable memory is hard to achieve in von Neumann machines because
they access items in memory by using their addresses (an address is applied and
the data occupying that address is returned). Therefore it is hard to infer
the address of a particular item in memory if only a partial description of it
is given.
involves expensive searching and matching procedures.
**Massively parallel, distributed networks suggest a
more efficient method for addressing by content.**
Hopfield Networks
A Hopfield network has the above interesting features:
(which provide us with desired properties of the brain)Example of a simple Hopfield Net:
Example:
There can be no divergence or oscillation.
Note: parallel relaxation is really search...
(state B is depicted as being lower than state A because fewer constraints are
violated. A constraint is violated, for example, when two active units are
connected by a negatively weighted connection.)
Unit failure: a unit becomes active (or not) when it should not.
Units surrounding will "set it straight"
The brain metaphor (physical implementation):
Neuron functions:
summation function: all the products are summed
and compared to some threshold to determine output. Fig. 4.9c
Activation functions:
The result of the summation function could be input
to an activation function before being passed to the
transfer function. The purpose would be to allow
outputs to vary over time. (Research area, most use Identity ( = to
none))
Transfer functions:
The threshold is generally nonlinear.
(Perceptrons - only a few problems can
be separated neatly into two categories with a
straight line Table - e.g.,, XOR, Fig. 4.10 ... more later)
is positive or negative. The network could output -1 and 1, or
1 and 0, etc. The transfer function would be a "hard limiter" or "step". It
has only binary output. fig 4.11
Combining elements:
Processing elements can be combined to make layers of these nodes.Combining layers:
The layer that receives the input is the input layer.
A network is fully connected if every output from one
layer is passed along to every node in the next layer.
Connectivity Options:
Connectivity is how the outputs are channeled to become inputs. The
output signal from a node may pass on as input to other processing elements.
Filters:
Layers in a neural net can act as filters. (fig. 4.15).
Example - the input signals are a pixel pattern for the letter A.
The network generates an output pattern of the ASCII code.
We do not have to have a separate network for each element in the set.
Hidden layers hold the key to more complex computations.
Automatic Learning
Learning
Perceptrons [Rosenblatt, 1962]
of its inputs and sending
the output 1 if the sum is greater than some adjustable threshold and 0
otherwise.
(linear, binary transfer function)
(feedforward)
If the presence of some feature xi tends to cause the perceptron to fire, the
weight wi will be positive, if the feature xi inhibits, the weight wi will
be negative.
(Essentially a single-layer network of linear threshold units without
feedback)
fig. 18.6
A group of perceptrons can be trained on sample input-output pairs until it
learns to compute the correct function. (A perceptron with many inputs and outputs fig 18.7)
(This weight can be thought of as the propensity
of the perceptron to fire irrespective of its inputs.)
When perceptrons were first conceived, many thought that intelligent systems could be constructed out of perceptrons fig 18.8
E.g. Pattern classification
(like concept learning)
we can draw a line that separates one class from anotherA training example:
wixi i= 0 to n (and xi is the activation
value)
0 if g(x) < 0
If g(x) = 0, the perceptron does not know if it should fire.
A slight change in inputs could cause the device to go either way. If we solve g(x) = 0, we get
The location of the line is completely determined by the weights w0, w1 , and
w2
the equation of the line x2 = - (w1/w2) x1 - w0/w2
If the input lies on the other side, the perceptron outputs 0.
Such a line is called the decision surface.
(compare to state space search)
(compare to techniques like ID3 for generating decision trees)
Solution: Multilayers XOR
Back-Propogation Networks
What can Multi-layer networks compute? Anything.
levels of input, hidden, and output units
horizontal lines
Note also that a backpropogation unit produces a real valued between 0
and 1 as output. The transfer (activation) function
is continuous and differentiable.
vertical lines
diagonal lines , etc.
Back-propogation
A back-propogation network also typically starts with a random set of
weights.
After it has seen all the pairs (and adjusted its weights accordingly) we say
one epoch has been completed.
Training a backpropogation net usually requires many epochs.
Recall the learning rule for perceptrons:
wt+1 = wt + h * Gradient(J)
d = (T - A)
Now the learning rule becomes:
wi(n+1) = wi(n) + h d xi
wi(n+1) = the value of the weight i after adjustment
wi(n) = the value of the weight i before adjustment
The delta rule implements a gradient descent and thus minimizes the error
function.
Hebb's Rule
(The best known and oldest learning law.)
When unit A and unit B are simultaneously excited, increase the strength of
the connection between them.
A natural extension of this rule to cover the positive and negative activation
values is:
Adjust the strength of the connection between units A
and B in proportion to the product of their
simultaneous activation.
Thus we see:
wi(n+1) = wi(n) + h xixi where wi is the weight between xi and xj -Boltzman Machines
Simulated annealing
A Boltzman machine is a variation of a Hopfield network.
Hopfield nets are good as content-addressable memories
and also for solving constraint satisfaction problems
or mutually supporting hypotheses
incompatible hypothesesSimulated annealing
:
The procedure has a strong resemblence to the annealing of metals (hence its
name).
In a metal raised to a temperature above its melting
point, the atoms are in violent random motion.
As with all physical systems, the atoms tend toward minimum energy state (a
single crystal in this case),
but at high temperatures the vigor of the atomic motion prevents this. As
the metal is gradually cooled, lower
and lower energy states are assumed until finally the
lowest of all possible states, a global minimum, is achieved.
disadvantages:
Potential Uses for Either ...
Problems to solve
Bibliography
Hinton, G. and Sejnowski, T. (1986). Learning and relearning in Boltzmann Machines.
In Parallel Distributed Processing, Rumelhart, J., McClelland, J. and the PDP Research
Group (Eds.), 282-317. Cambridge, MA: MIT Press.
Additional Sources
Laird's notes (pg. 101-118)
Lectures on back-propogation (good fonts so can see Delta rule, etc)