Machine Learning

II. Learning From Examples:

Inductive Learning

Inductive Learning:

Inductive inference attempts to derive a complete and correct description of a given phenomena from specific observations of the phenomenon or of parts of it.

Two major kinds: learning from examples (concept acquisition) and learning from observation (descriptive generalizations).

(Interpretation and examples taken from [Michalski, 1983, pg. 90; Jackson, 1990, pg. 432])

In concept acquisition the observational statements are characterizations of some objects (situations, processes, etc.) preclassified by a teacher into one or more classes (concepts). The induced hypothesis can be viewed as a concept recognition rule, such that if an object satisfies this rule, then it represents the given concept.

For example, a recognition rule for the concept "philosopher" might be:

"A person who pursues wisdom and gains the knowledge of underlying reality by intellectual means and moral self-discipline is a philosopher."
In descriptive generalization the goal is to determine a general description (a law, a theory) characterizing a collection of observations. For example, observing that the philosophers Aristotle, Plato, and Socrates were Greek, but that Spencer was British, one might conclude:

"Most philosophers were Greek."
Thus, in contrast to concept acquisition that produces descriptions for classifying objects into classes on the basis of the object's properties, descriptive generalization produces descriptions specifying properties of objects belonging to a certain class.

Jackson's examples

Some examples of research in each from [Michalski, 1983]

We will focus on concept acquisition in this session.

Winston's Structural Concept Learning Program, [Winston,1975]

Domain: Blocks world

Goal: To construct representations of the definitions of concepts in the blocks world

Knowledge Structure: Semantic Nets

A main point: To learn how it is possible to learn by analyzing the differences that appear ina sequence of observations

Technique: Use procedures that learn class descriptions from positive and negative examples.

Heuristics: require-link, forbid-link and felicity examples (implied covenants between teachers and students that make learning possible). (E.g., when there is doubt about what to do, do nothing.)

Example: it learned the concepts House, Tent, and Arch

(see figure)

Rich's

The figure also shows examples of near misses for each concept. A near miss is an object that is not an instance of the concept in question but that is very similiar to such instances.

Induction heuristics

Induction occurs when you use particular examples to reach conclusions.

Figure 16.1:

First example derives a general idea of what an arch is

Each subsequent example provides a new point

Last example says brick on top is not important

Basic approach:

  1. Begin with a structural description of one known instance of the concept. Call that description the concept definition (or initial description). During learning the initial description is augmented by information indicating which links are important. The augmented description is called the evolving model.
  2. Examine descriptions of other known instances of the concept. Generalize the definition to include them.
  3. Examine descriptions of near misses (negative examples) of the concept. Restrict (Specialize) the definition to exclude these. (one difference at a time, one determined most important...unless very similar (like two supports))
require-link and forbid-link : Figures 16.2, 16.3

Restrict and relax ... figure 16.4 for relax (enlarge-set routs to a new class, the union of the object's classes)

Points for learning:

Martin's Law: You cannot learn anything unless you almost know it already [William Martin, Dubrovnik, Yugoslovia, 1979]

Problems:

The following learning technique is insensitive to the order in which examples are presented

Version Spaces [Mitchell, 1977, 1978]

Goal (the same): To construct representations of the definitions of concepts ... to produce a description that is consistent with all positive examples but no negative examples.

Technique (a difference): Winston's system accomplished this by evolving a single concept description.

Version spaces work by maintaining a set of possible descriptions and evolving that set as new examples and near misses are presented.

One prevents mistaken modifications by following each interpretation as long as it remains viable.

Constraints: There is a fixed number of attributes and a class-characterizing model can be expressed as a combination of values for those attributes.

*The choice of attributes and values is called the bias of the learning system

(Note: a clear statement of the bias of a learning system is very important to its evaluation.)

A main point (an advantage): learning by managing multiple models. Concepts can be determined instance by instance , without looking back to examine either past training instances or previously rejected concept descriptions.

Heuristic:

(Constraints are exact and guaranteed to cause the algorithm to converge on a solution, thus heuristic search is not employed.)

The search space of possible concept descriptions is not without structure - it contains redundancy. Thus a partial order can be defined over the patterns generated. (general <-> specific)

  • Each positive example makes the S set more general
  • Each negative example makes the G set more specialized
  • If S and G converge, our range of hypotheses narrrows to a single concept description Slide (fig. 20.1)- general idea of version space convergence

    (from [Winston,1992])

    pruning, specializing, generalizing

    The G set must be specialized in such a way that the negative example is no longer in the version space. Specialization involves replacing variables with constants.

    (fig 20.2)

    Note: the G set must be specialized only to descriptions that are within the current version space, not outside it

    (algorithm: Candidate Elimination Algorithm)

    (fig)

    Examples: Concept descriptions and training examples are stated in terms of slots and values.

    Problem 1: (from [Winston, 1992]): Suppose you are a doctor treating a patient who occasionally suffers an allergic reaction. Your intuition tell you it is a direct result of a certain combination of the place where your patient eats, the time of day, the day of the week, and the amount that your patient spends on food.

    Concept: What is the cause of the patient's allergic reaction

    A specific model (representation):

    [place, meal, day, cost]

    An example of the model: [Sam's, Dinner, Thursday, Cheap]

    Most general model: [ ?, ?, ?, ? ]

    Trace: figures 20.3 -20.7

    Problem 2: (from [Rich & Knight,1991]):

    Concept: Japanese economy car

    Figure 17.7 : a frame for an individual car

    Figure 17.8 : discrete values possible for attributes

    Figure 17.9 : possible concept

    Problem: Given a representational language such as in Fig. 17.8, and given positive and negative training examples such as those in 17.7, how can one produce a concept description such as that in Figure 17.9 that is consistent with all the training examples?

    Trace : given examples in 17.12

    Property summary about version spaces:

    compact - one is not explicitly storing every concept description in the space

    easy to update - defining a new space corresponds to moving one or both of the boundaries

    Problems:

    Knowledge needed "on top":

    Example relation: "more specific than or equal to"

    Pattern P1 is more specific than or equal to pattern 2

    ( P1 > P2 ) if and only if P1 matches a subset of all the instances that P2 matches (Jackson,pg 436)

    Mitchell: The chief man-power cost involved in applying version spaces to real problems is the construction of the set of training instances.

    Decision Trees [Quinlan, 1979, 1986]

    A third approach to concept learning is the induction of decision trees, as exemplified in the ID3 program of Quinlan.

    Goal: ID3 is a program that builds decision trees automatically, given positive and negative instances of a concept.

    Structure of the tree:

    Sample tree: (figure 26.3, from [Jackson, 1990])

    Technique:

    ID3 uses an iterative method to build up decision trees

    (Empirical evidence indicates that the iterative strategy is more efficient than considering the whole training set at once.)

    The heart of the algorithm is the procedure for building a decision tree for an arbitrary collection of objects, C, where C is not empty and contains objects of more than one class.

    Let T be any test on an object x, with A1, A2, ... An representing the possible outcomes of T(x). T will therefore produce a partition {C1, C2, ...Cn} of C such that

    Ci = {x | T(x) = Ai}.

    e.g. (fig. 15-4)

    If we continue recursively to replace each Ci with a decision tree, we would have a decision tree for C.

    The crucial point in the problem reduction strategy is the choice of test.

    (Want tree to provide answers (prune) quickly)

    Which attributes to test? (Some attributes will yield more information than others.) For each subtree, one needs to find a good attribute for partitioning the objects.

    Quinlan uses the notion of expected information from communication theory. (also called entropy and uncertainity (not to be confused with certainity factors))

    In communication theory, expected information is a number describing a set of messages, M = {m1,m2, ..., mn}.

    Each message mi in the set has probability p(mi) and contains an amount of information, I(mi), defined as

    I(mi) = -log2 p(mi)

    Thus information is an inverse monotonic function of probability.

    The expected information or entropy of a message set U(M), is the sum of the information in the several possible messages multiplied by their probabilities:

    U(M) = - Si p(mi) log2 p(mi) for i = 1 to n

    Note: Such a formula gives you a high number when a test produces highly inhomogeneous sets and a low number when a test produces completely homogeneous sets.

    Quinlan's assumptions:

    Examples:

    Suppose you had a set that contains members of just two classes, class A and class B. If the number of members from class A and the number of members of class B are perfectly balanced, the measured disorder is 1, the maximum possible value:

    -1/2log21/2 - 1/2log21/2 = 1/2 + 1/2 = 1

    On the other hand, if there were only all A's or all B's, the measured disorder is 0, the minimum possible value:

    -1 log21 -0 log20 = -0 - 0 = 0

    As you move from perfect balance and perfect homogeneity, disorder varies smoothly between 0 and 1. The disorder is 0 when the set is perfectly homogeneous, and the disorder is 1 when the set is perfectly inhomogeneous.

    We want to minimize average disorder.

    It is convenient to write this 'two-message' case of entropy as a function of the two arguments:

    EI(p,n) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)

    Given a test T with outcomes A1, A2, ... An producing a partition {C1, C2, ... Cn} as before, suppose Ci that contains pi objects of class P and ni objects of class N . The entropy of the subtree is given by EI(pi,ni). The entropy of the larger tree with T at its root is then given by the weighted average:

    E(T) = Si F((pi + ni),(p + n)) * EI(pi ,ni) for i=1 to n

    The ratio (pi + ni)/ (p + n) corresponds to the weight for the ith branch in a tree. It represents the proportion of the objects in C that belong to Ci.

    Heuristic: In deciding what attribute to branch on next, ID3 uses select the test that gains the most information.

    The information gained by carrying out a test is given by

    EI(p,n) - E(T)

    Such heuristics are sometimes described as 'minimum entropy' because maximizing information gain corresponds to minimizing uncertainity or disorder.

    Example: [Quinlan,1983]

    Consider collection C [height,hair,eyes] below :

    Selecting the second attribute to form the root yields 15-1.

    Subcollections of 'dark' and 'red' contain objects of only a single class, and so require no more work.

    Choosing the third attribute forms 15-2.

    Now all subcollections contain objects of one class so we replace subcollections with class name (figure 15-3)

    Another example:

    Pluses:

    -- Handling Conflicts

    -- Attributes with numeric values

    Student Task:

    PAIL allows one to use repertory grids as input for the decision tree. It also takes decision trees and can create rules for a rule-based system.

    A. Take your results from the previous section's Mushroom example and create the decision tree and rules for this problem.

    B. ? (rule elimination?)