Machine Learning

Neural Nets

  • see also Laird's notes (around pg. 101)

    Associative Memory

    The previous efforts discussed have been symbolically based. In the next sessions, we will consider:

    Reminders: (levels (types) of learning)


    Background:

    Other names for artificial neural networks: (not synonyms, but related)

    Insights: Connectionists conjecture that thinking about computation in terms of the "brain metaphor" rather than the "digital computer metaphor" will lead to insights about the nature of human intelligence.

    Computers:

    Humans:

    What constraints ( + or - ) does brain offer ?

    1. individual neurons are slow (milliseconds)
      • Yet humans perform interpretation of visual scenes in less than a second...
        (We can do in a hundred steps what a computer cannot do in 10 million steps.) ...how?
      • The human brain contains a huge number of processing elements that act in parallel.
        Suggestion: consider massively parallel algorithms
    2. neurons are failure-prone
      • They are constantly dying and their firing patterns are irregular.
      • Digital computers must operate perfectly; the failure of one component means a loss of information.
      • For graceful degradation, distributed representations.
    3. humans handle "fuzzy" situations
      • Brain models are good at approximate matching.
      • The idea behind connectionism:
        • We may see significant advances in AI if we approach problems from the point of view of brain-style computation.

    New neural network architectures are characterized by:


    The brain metaphor:

    How it works in theory: Associative memory

    How it works physically: Neurons

    History

    (
    figure 3.1)
    Jasper Lupo, deputy director of the Tactical Technology Office of DARPA, called the neural network technology "more important than the atom bomb." DARPA embarked on an 8 year neural network program

    $390 million. Original funding cut to $33 million over 17 months as a seed program.

    The Brain

    Physical - Basics:

    A Simple Neuron: fig.4.2
    (a nerve cell with all its processes)

    Nerve Structures and Synapses: fig. 4.3
    Synapse Activity: fig. 4.4
    (some signals excite (+) ; some inhibit (-) )

    The Brain:

    -Associative Memory-

    Behaviorally:

    The ability to get from one internal representation to another, or to infer a complex representation from a portion of it, forms the basis of associative memory.

    Associative memory hinges on the concept of the distributed representation of information.

    Human memory operates in an associative manner; a portion of a recollection can produce a larger related memory.

    Examples:

    The brain metaphor (behavioral implementation):
    Content-addressable memory is hard to achieve in von Neumann machines because they access items in memory by using their addresses (an address is applied and the data occupying that address is returned). Therefore it is hard to infer the address of a particular item in memory if only a partial description of it is given.

    Consider image retrieval: in a neural net employing the associative memory concept, the image is its own address and, in fact, presentation of an approximation of this address will result in the recovery of the actual image as output.

    "Because a neural net is an ensemble of a great number of collectively interacting elements, it seems more natural to abandon altogether those physical models of memory in which particular concepts correspond to particular spatial locations (nodes) in the hardware. Instead, we assume that representations of concepts and other pieces of information are stored as collective states of a neural network....they are realized only through collective effects and are reflected in recall processes...This kind of collective representation might be called holographic or hologic [Khanna, 1990]."

    Hopfield Networks

    A Hopfield network has the above interesting features:
    (which provide us with desired properties of the brain)

    These features are achieved as seen in the following example.

    Example of a simple Hopfield Net:

    (material from ["Artificial Itelligence" 2nd ed., Rich & Knight, McGraw Hill, 1991])

    Example:

    In figure 18.1 consider unit in lower left ... This network has only four distinct stable states: (fig. 18.2)

    Given any initial state, the network will necessarily settle into one of these four configurations.

    The network is "storing" these patterns.

    Hopfield's major contribution - to show that given any set of weights and any initial state, his parallel relaxation algorithm would eventually steer the network into a stable state.
    There can be no divergence or oscillation.

    Note: parallel relaxation is really search...

    Consider:

    (state B is depicted as being lower than state A because fewer constraints are violated. A constraint is violated, for example, when two active units are connected by a negatively weighted connection.)

    E.g., "gray, large, fish, eats plankton"

    Unit failure: a unit becomes active (or not) when it should not.
    Units surrounding will "set it straight"

    In large systems, units or connections can disappear completely without adversely affecting the overall behavior.

    What is the relationship between the weights on the network's connections and the state it settles into?

    The brain metaphor (physical implementation):

    [material from "A Practical Guide to Neural Nets", Nelson & Illingworth, Addison Wesley, 1991]

    1. many inputs (stimulation) ; one output Fig. 4.9a,
    2. each input given a relative weighting Fig. 4.9b - affects the impact of the input

    Neuron functions:

    Activation functions:

    The result of the summation function could be input to an activation function before being passed to the transfer function. The purpose would be to allow outputs to vary over time. (Research area, most use Identity ( = to none))

    Transfer functions:

    1. The transfer function could be 0. Then the output depends upon whether the result of the is positive or negative. The network could output -1 and 1, or 1 and 0, etc. The transfer function would be a "hard limiter" or "step". It has only binary output. fig 4.11

    2. The threshold or ramping function. Mirrors the input within a certain range (say 0 to 1), but functions as a hard limiter outside that range. (Linear, but clipped to min and max values, which makes it nonlinear.)

    3. Sigmoid or S-shaped curve. The curve approaches a min and max value at the asymptotes. (Mathematically nice since both function and derivative are continuous.)       (often the choice)

    Combining elements:

    Processing elements can be combined to make layers of these nodes.

    Inputs are connected to many nodes with different weights, resulting in a series of outputs, one per node. (fig. 4.12)

    (The connections correspond to the axons and synapses in a biological system, and they provide a signal transmission pathway between the nodes.)

    Combining layers:

    The layer that receives the input is the input layer.

    The network outputs are generated from the output layer.

    Any other layers are called hidden layers because they are internal to the network and have no direct contact with the external environment. (fig. 4.13)

    *Note that there are more connections than nodes.

    Connectivity Options:

    Connectivity is how the outputs are channeled to become inputs. The output signal from a node may pass on as input to other processing elements.

    Filters:

    Layers in a neural net can act as filters. (fig. 4.15). Example - the input signals are a pixel pattern for the letter A.
    The network generates an output pattern of the ASCII code.

    The example is a mapping, or feedforward network. Such mappings, or associations of objects in the input set with objects in the output set, are also called transformations.

    Note: we could input any letter's pixel pattern and have the network output the correct ASCII code.
    We do not have to have a separate network for each element in the set.

    The internal layers form an intermediate representation of the input data.

    Hidden layers hold the key to more complex computations.

    Automatic Learning

    Learning

    Perceptrons [Rosenblatt, 1962]

    A perceptron models a neuron by taking the weighted of its inputs and sending the output 1 if the sum is greater than some adjustable threshold and 0 otherwise.
    (linear, binary transfer function)

    In the perceptron, unlike the Hopfield network, connections are unidirectional.
    (feedforward)

    Fig. 18.5
    If the presence of some feature xi tends to cause the perceptron to fire, the weight wi will be positive, if the feature xi inhibits, the weight wi will be negative.

  • The perceptron consists of:
  • Learning is the process of modifying the values of the weights and the threshold.

  • Implement threshold as another weight wo (convenience) A group of perceptrons can be trained on sample input-output pairs until it learns to compute the correct function. (A perceptron with many inputs and outputs fig 18.7)

    Whatever a perceptron can compute, it can learn to compute.
    When perceptrons were first conceived, many thought that intelligent systems could be constructed out of perceptrons fig 18.8

    E.g. Pattern classification

    (like concept learning)

    The problem is to determine linearly separable, fig. 18.9 -
           we can draw a line that separates one class from another

    How learn to identify x2 with white dots and x1 with black ones?

    A training example:

  • Train perceptron to output 1 if in class of white dots, else 0.

    Weighted sum: g(x) = wixi      i= 0 to n (and xi is the activation value)

    Output function: T(x) = 1 if g(x) > 0
                                        0 if g(x) < 0

  • Consider the case with only two inputs:

    g(x) = w0 + w1x1 + w2x2
    If g(x) = 0, the perceptron does not know if it should fire.

    The location of the line is completely determined by the weights w0, w1 , and w2

    If the input lies on one side of the line, the perceptron outputs 1.
    If the input lies on the other side, the perceptron outputs 0.

    A line that correctly separates the training instances corresponds to a perfectly functioning perceptron.

    For specifics on the actual learning process, see the parts I took out since it seemed distracting

    Figure 18.11 shows a perceptron learning to classify the instances in 18.9. K is the number of passes.


    Note: (consider)

    (1) parallel relaxation is a problem-solving strategy
                 (compare to state space search)

    (2) gradient descent is a learning strategy
                 (compare to techniques like ID3 for generating decision trees)

    Back-Propogation Networks

    What can Multi-layer networks compute? Anything.

    The major problem is learning.

    Note: The knowledge representation employed by neural nets is quite opaque; the nets must learn their own representations because programming them by hand is impossible.

    First, consider a multilayered, fully connected, feedforward network.

    Fig. 18.14

  • In the figure xi, hi, and oi, represent unit activation
                 levels of input, hidden, and output units
  • three layers (possible to have more)
  • connected in forward direction

    In contrast to Hopfield nets, backpropogation networks perform a simpler computation.

    Because activations flow in only one direction, there is no need for an iterative relaxation process.

    The existance of hidden layers allows the network to develop complex feature detectors, or internal representations.

    Fig. 18.15, identification of the digit "7" hidden units could indicate:

    Note also that a backpropogation unit produces a real valued between 0 and 1 as output. The transfer (activation) function is continuous and differentiable.

    Figure 18.16 It's output: output = 1/(1 + e-sum)

    Back-propogation

    A back-propogation network also typically starts with a random set of weights.

    The network adjusts its weights each time it sees an input-output pair. Each pair requires two stages:

  • The forward pass involves presenting a sample input to the network and letting activations flow until they reach the output layer.
  • In the backward pass, the network's actual output (from the forward pass) is compared with the target output and error estimates are computed for the output units.
  • The weights connected to the output units can be adjusted in order reduce these errors.
  • Then use the error estimates of the output units to derive error estimates for the hidden layers.
  • Finally, errors are propogated back to the connections stemming from the input units.

    Unlike perceptrons, the backprogation algorithm usually updates its weights incrementally, after seeing each input-output pair.
    After it has seen all the pairs (and adjusted its weights accordingly) we say one epoch has been completed.
    Training a backpropogation net usually requires many epochs.

    Back Propogation Learning

    The most important generalization of the perceptron training algorithm is called the Delta Rule.

    The back propogation of errors technique is the most commonly used generalization of the Delta Rule.

    The Delta Rule:(notice the term)
    Recall the learning rule for perceptrons: wt+1 = wt + h * Gradient(J)

    The algorithm can be restated and generalized by introducing the term d, which is the difference between the desired or target output T and the actual output A.

    Now the learning rule becomes: wi(n+1) = the value of the weight i after adjustment
    wi(n) = the value of the weight i before adjustment

    The delta rule modifies weights appropriately for target and actual outputs of either polarity and for both continuous and binary outputs.
    The delta rule implements a gradient descent and thus minimizes the error function.

    Most models consider learning to be an adjustment on the

    strengths of connections.

    Virtually all learning rules of this type can be considered a variant of the Hebbian learning rule.

    Hebb's Rule

    (The best known and oldest learning law.)

    Simplest version:

    A natural extension of this rule to cover the positive and negative activation values is: Thus we see:
    wi(n+1) = wi(n) + h xixi where wi is the weight between xi and xj

    Note: What do negative and positive activations do?

    It is very important to note that the information needed to use the Hebb rule to determine the value each connection should have is locally available at the connection.

    All a given connection needs to consider is the activation of the units on both sides of it.

    -Boltzman Machines

    Simulated annealing


    A Boltzman machine is a variation of a Hopfield network.

    Recall (in Hopfield):

    Hopfield nets are good as content-addressable memories and also for solving constraint satisfaction problems or mutually supporting hypotheses incompatible hypotheses

    As the net settles, it assigns truth or falsity while violating the fewest constraints.

    Problem: Hopfield networks settle into local minima.

    We need the globally optimal state (satisfy as many constraints as possible).

    If a Hopfield network reaches a stable state, then no single unit is willing to change its state in order to move uphill.

    Hinton and Sejnowski [1986] combined Hopfiled nets and simulated annealing to produce networks called Boltzmann Machines.

    Simulated annealing

    : The procedure has a strong resemblence to the annealing of metals (hence its name).

    In the annealing process, the distribution of energy states is determined by a mathematical relationship.

    In Boltzmann machines, this relationship is used to update the units individual states.

    The previous training methods we have seen have been deterministic. (adjusting weights based on current information)

    This method is a statistical training method . It makes pseudorandom changes in the weight values, retaining those changes that result in improvements.

    Annealing is the process of gradually moving from very high temperatures down to very low ones.. The randomness added by the temperature helps the network escape the local minima.

    (Here the temperature is "artificial" and is defined ahead.)

    The probability that any given unit will be active is given by p:

    p = 1/(1 + eDE/T) where DE is the sum of the unit's active input lines and T is the "temperature" of the network.

    For more information about the learning algorithm see the Hinton and Sejnowski paper.

    If the annealing is carried out properly, Boltzmann machines can avoid local minima and learn to compute any computable function of fixed-sized inputs and outputs.

    -Distributed vs. Localist Representations

    advantages:

    Distributed representations are:

    disadvantages:

    Localist representations can:

    Potential Uses for Either ...
    Problems to solve

    pattern recognition, vision, speech


    Bibliography

    Hinton, G. and Sejnowski, T. (1986). Learning and relearning in Boltzmann Machines. In Parallel Distributed Processing, Rumelhart, J., McClelland, J. and the PDP Research Group (Eds.), 282-317. Cambridge, MA: MIT Press.

    Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, pp 2554-2558.

    Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge MA: MIT Press.

    Nelson, M. and Illingworth, W. (1991). A Practical Guide to Neural Nets. Reading MA: Addison-Wesley.

    Quillian, R. (1968). Semantic memory. In Semantic Information Processing, Minsky, M. (Ed.), Cambridge MA: MIT Press.

    Rich, E. and Knight, K. (1991). Artificial Intelligence, Second Edition. New York: McGraw-Hill.

    Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-407.

    Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington, D.C.: Spartan Books.

    Wasserman, P. (1989). Neural Computing: Theory and Practice. NewYork: Van Nostrand Reinhold.


    Additional Sources

    Laird's notes (pg. 101-118)

    Nice tutorial from Mat Buckland (wish I had seen before I wrote these notes!)

    AI depot link to Neural Net tutorials and Applications scroll down for uses in Games (Creatures, Black & White, Collin McRay Rally 2.0)

    Generation 5

    Another Intro to NN