Inductive inference attempts to derive a complete and correct description of a given phenomena from specific observations of the phenomenon or of parts of it.
Two major kinds: learning from examples (concept acquisition) and learning from observation (descriptive generalizations).
(Interpretation and examples taken from [Michalski, 1983, pg. 90; Jackson, 1990, pg. 432])
In concept acquisition the observational statements are characterizations of some objects (situations, processes, etc.) preclassified by a teacher into one or more classes (concepts). The induced hypothesis can be viewed as a concept recognition rule, such that if an object satisfies this rule, then it represents the given concept.
For example, a recognition rule for the concept "philosopher" might be:
"A person who pursues wisdom and gains the knowledge of underlying reality by intellectual means and moral self-discipline is a philosopher."In descriptive generalization the goal is to determine a general description (a law, a theory) characterizing a collection of observations. For example, observing that the philosophers Aristotle, Plato, and Socrates were Greek, but that Spencer was British, one might conclude:
"Most philosophers were Greek."Thus, in contrast to concept acquisition that produces descriptions for classifying objects into classes on the basis of the object's properties, descriptive generalization produces descriptions specifying properties of objects belonging to a certain class.
Jackson's examples
Some examples of research in each from [Michalski, 1983]
We will focus on concept acquisition in this session.
Winston's Structural Concept Learning Program, [Winston,1975]
Domain: Blocks world
Goal: To construct representations of the definitions of concepts in the blocks world
Knowledge Structure: Semantic Nets
A main point: To learn how it is possible to learn by analyzing the differences that appear ina sequence of observations
Technique: Use procedures that learn class descriptions from positive and negative examples.
Heuristics: require-link, forbid-link and felicity examples (implied covenants between teachers and students that make learning possible). (E.g., when there is doubt about what to do, do nothing.)
Example: it learned the concepts House, Tent, and Arch
(see figure)
Rich's
The figure also shows examples of near misses for each concept. A near miss is an object that is not an instance of the concept in question but that is very similiar to such instances.
Induction heuristics
Induction occurs when you use particular examples to reach conclusions.
Figure 16.1:
First example derives a general idea of what an arch is
Each subsequent example provides a new point
Basic approach:
Restrict and relax ... figure 16.4 for relax (enlarge-set routs to a new class, the union of the object's classes)
Points for learning:
Problems:
Version Spaces [Mitchell, 1977, 1978]
Goal (the same): To construct representations of the definitions of concepts ... to produce a description that is consistent with all positive examples but no negative examples.
Technique (a difference): Winston's system accomplished this by evolving a single concept description.
Version spaces work by maintaining a set of possible descriptions and evolving that set as new examples and near misses are presented.
One prevents mistaken modifications by following each interpretation as long as it remains viable.
Constraints: There is a fixed number of attributes and a class-characterizing model can be expressed as a combination of values for those attributes.
*The choice of attributes and values is called the bias of the learning system
(Note: a clear statement of the bias of a learning system is very important to its evaluation.)
A main point (an advantage): learning by managing multiple models. Concepts can be determined instance by instance , without looking back to examine either past training instances or previously rejected concept descriptions.
Heuristic:
(Constraints are exact and guaranteed to cause the algorithm to converge on a solution, thus heuristic search is not employed.)
The search space of possible concept descriptions is not without structure - it contains redundancy. Thus a partial order can be defined over the patterns generated. (general <-> specific)
(Figure 17.11)
Example relation: "more specific than or equal to"
Pattern P1 is more specific than or equal to pattern 2
( P1 > P2 ) if and only if P1 matches a subset of all the instances that P2 matches
use two subsets of the concept space
most specific
(from [Winston,1992])
pruning, specializing, generalizing
The G set must be specialized in such a way that the negative example is no longer in the version space. Specialization involves replacing variables with constants.
(fig 20.2)
Note: the G set must be specialized only to descriptions that are within the current version space, not outside it
(algorithm: Candidate Elimination Algorithm)
(fig)
Examples: Concept descriptions and training examples are stated in terms of slots and values.
Problem 1: (from [Winston, 1992]): Suppose you are a doctor treating a patient who occasionally suffers an allergic reaction. Your intuition tell you it is a direct result of a certain combination of the place where your patient eats, the time of day, the day of the week, and the amount that your patient spends on food.
Concept: What is the cause of the patient's allergic reaction
A specific model (representation):
[place, meal, day, cost]
An example of the model: [Sam's, Dinner, Thursday, Cheap]
Most general model: [ ?, ?, ?, ? ]
Trace: figures 20.3 -20.7
Problem 2: (from [Rich & Knight,1991]):
Concept: Japanese economy car
Figure 17.7 : a frame for an individual car
Figure 17.8 : discrete values possible for attributes
Figure 17.9 : possible concept
Problem: Given a representational language such as in Fig. 17.8, and given positive and negative training examples such as those in 17.7, how can one produce a concept description such as that in Figure 17.9 that is consistent with all the training examples?
Trace : given examples in 17.12
Property summary about version spaces:
easy to update - defining a new space corresponds to moving one or both of the boundaries
Problems:
large space requirements
pre-maturely
(e.g., car example (3) change to + to -, would never reach)
(e.g., "European car" is a German, British, ...., or Italian car) A specialization could become one large disjunction of all positive instances.
Knowledge needed "on top":
Example relation: "more specific than or equal to"
Pattern P1 is more specific than or equal to pattern 2
( P1 > P2 ) if and only if P1 matches a subset of all the instances that P2 matches (Jackson,pg 436)
Mitchell: The chief man-power cost involved in applying version spaces to real problems is the construction of the set of training instances.
Decision Trees [Quinlan, 1979, 1986]
A third approach to concept learning is the induction of decision trees, as exemplified in the ID3 program of Quinlan.
Goal: ID3 is a program that builds decision trees automatically, given positive and negative instances of a concept.
Structure of the tree:
Sample tree: (figure 26.3, from [Jackson, 1990])
Technique:
ID3 uses an iterative method to build up decision trees
The heart of the algorithm is the procedure for building a decision tree for an arbitrary collection of objects, C, where C is not empty and contains objects of more than one class.
Let T be any test on an object x, with A1, A2, ... An representing the possible outcomes of T(x). T will therefore produce a partition {C1, C2, ...Cn} of C such that
Ci = {x | T(x) = Ai}.
e.g. (fig. 15-4)
If we continue recursively to replace each Ci with a decision tree, we would have a decision tree for C.
The crucial point in the problem reduction strategy is the choice of test.
(Want tree to provide answers (prune) quickly)
Which attributes to test? (Some attributes will yield more information than others.) For each subtree, one needs to find a good attribute for partitioning the objects.
Quinlan uses the notion of expected information from communication theory. (also called entropy and uncertainity (not to be confused with certainity factors))
In communication theory, expected information is a number describing a set of messages, M = {m1,m2, ..., mn}.
Each message mi in the set has probability p(mi) and contains an amount of information, I(mi), defined as
I(mi) = -log2 p(mi)
Thus information is an inverse monotonic function of probability.
The expected information or entropy of a message set U(M), is the sum of the information in the several possible messages multiplied by their probabilities:
U(M) = - Si p(mi) log2 p(mi) for i = 1 to n
Note: Such a formula gives you a high number when a test produces highly inhomogeneous sets and a low number when a test produces completely homogeneous sets.
Quinlan's assumptions:
the same proportion as their representation in C. Suppose that p denotes the number of objects of
class P in C and n denotes the number of objects of class N in C. Thus an unseen object will belong to class P with probability p/(p +n) and to class N with probability n/(p +n).
U(M) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)
Suppose you had a set that contains members of just two classes, class A and class B. If the number of members from class A and the number of members of class B are perfectly balanced, the measured disorder is 1, the maximum possible value:
-1/2log21/2 - 1/2log21/2 = 1/2 + 1/2 = 1
On the other hand, if there were only all A's or all B's, the measured disorder is 0, the minimum possible value:
-1 log21 -0 log20 = -0 - 0 = 0
As you move from perfect balance and perfect homogeneity, disorder varies smoothly between 0 and 1. The disorder is 0 when the set is perfectly homogeneous, and the disorder is 1 when the set is perfectly inhomogeneous.
We want to minimize average disorder.
It is convenient to write this 'two-message' case of entropy as a function of the two arguments:
EI(p,n) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)
Given a test T with outcomes A1, A2, ... An producing a partition {C1, C2, ... Cn} as before, suppose Ci that contains pi objects of class P and ni objects of class N . The entropy of the subtree is given by EI(pi,ni). The entropy of the larger tree with T at its root is then given by the weighted average:
E(T) = Si F((pi + ni),(p + n)) * EI(pi ,ni) for i=1 to n
The ratio (pi + ni)/ (p + n) corresponds to the weight for the ith branch in a tree. It represents the proportion of the objects in C that belong to Ci.
Heuristic: In deciding what attribute to branch on next, ID3 uses select the test that gains the most information.
The information gained by carrying out a test is given by
EI(p,n) - E(T)
Such heuristics are sometimes described as 'minimum entropy' because maximizing information gain corresponds to minimizing uncertainity or disorder.
Example: [Quinlan,1983]
Consider collection C [height,hair,eyes] below :
Selecting the second attribute to form the root yields 15-1.
Subcollections of 'dark' and 'red' contain objects of only a single class, and so require no more work.
Choosing the third attribute forms 15-2.
Now all subcollections contain objects of one class so we replace subcollections with class name (figure 15-3)
Another example:
Pluses:
(Its computational complexity hinges on the cost of choosing the next test to branch on, which is itself a linear function of the product of the number of objects in the training set and the number of attributes used to describe them.)
Drawbacks:
(This is common for numerical approaches; they are efficient and resistant to noise, but they yield rules or concepts which are in general incomprehensible to humans.)
(satisfice vs. optimize)
It is computationally impractical to find the
smallest possible identification tree when many
tests are required, so one must be content with a procedure that tends to build small trees, albeit
trees that are not guaranteed to be the smallest.
General Points:
-- Attributes with numeric values
Student Task:
PAIL allows one to use repertory grids as input for the decision tree. It also takes decision trees and can create rules for a rule-based system.
A. Take your results from the previous section's Mushroom example and create the decision tree and rules for this problem.
B. ? (rule elimination?)