Decision Trees [Quinlan, 1979, 1986]
An approach to concept learning is the induction of decision trees, as exemplified in the ID3 program of Quinlan.
Goal: ID3 is a program that builds decision trees automatically, given positive and negative instances of a concept.
Structure of the tree:
Suppose you are given the training set below [Quinlan, 1985]
A decision tree that correctly classifies each object could be
Sample tree:
another
Technique:
ID3 uses an iterative method to build up decision trees
The heart of the algorithm is the procedure for building a decision tree for an arbitrary collection of objects, C, where C is not empty and contains objects of more than one class.
Let T be any test on an object x, with A1, A2, ... An representing the possible outcomes of T(x). T will therefore produce a partition {C1, C2, ...Cn} of C such that
Ci = {x | T(x) = Ai}.
If we continue recursively to replace each Ci with a decision tree, we would have a decision tree for C.
The crucial point in the problem reduction strategy is the choice of test.
(Want tree to provide answers (prune) quickly)
Which attributes to test? (Some attributes will yield more information than others.) For each subtree, one needs to find a good attribute for partitioning the objects.
Quinlan uses the notion of expected information from communication theory. (also called entropy and uncertainty (not to be confused with certainty factors))
In communication theory, expected information is a number describing a set of messages, M = {m1,m2, ..., mn}.
Each message mi in the set has probability p(mi) and contains an amount of information, I(mi), defined as
I(mi) = -log2 p(mi)
Thus information is an inverse monotonic function of probability.
The expected information or entropy of a message set U(M), is the sum of the information in the several possible messages multiplied by their probabilities:
U(M) = -
i p(mi) log2 p(mi) for i = 1 to n
Note: Such a formula gives you a high number when a test produces highly inhomogeneous sets and a low number when a test produces completely homogeneous sets.
Quinlan's assumptions:
U(M) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)
Suppose you had a set that contains members of just two classes, class A and class B. If the number of members from class A and the number of members of class B are perfectly balanced, the measured disorder is 1, the maximum possible value:
-1/2log21/2 - 1/2log21/2 = 1/2 + 1/2 = 1
On the other hand, if there were only all A's or all B's, the measured disorder is 0, the minimum possible value:
-1 log21 -0 log20 = -0 - 0 = 0
As you move from perfect balance and perfect homogeneity, disorder varies smoothly between 0 and 1. The disorder is 0 when the set is perfectly homogeneous, and the disorder is 1 when the set is perfectly inhomogeneous.
We want to minimize average disorder.
It is convenient to write this 'two-message' (P (+) and N (-)) case of entropy as a function of the two arguments:
EI(p,n) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)
Given a test T with outcomes A1, A2, ... An producing a partition {C1, C2, ... Cn} as before, suppose Ci that contains pi objects of class P and ni objects of class N . The entropy of the subtree is given by EI(pi,ni). The entropy of the larger tree with T at its root is then given by the weighted average:
E(T) =
i [(pi + ni) / (p + n)] * EI(pi ,ni) for i=1 to n
The ratio (pi + ni)/ (p + n) corresponds to the weight for the ith branch in a tree. It represents the proportion of the objects in C that belong to Ci.
Heuristic: In deciding what attribute to branch on next, ID3 uses select the test that gains the most information.
The information gained by carrying out a test is given by
EI(p,n) - E(T)
Such heuristics are sometimes described as 'minimum entropy' because maximizing information gain corresponds to minimizing uncertainity or disorder.
Example: [Quinlan,1983]
Consider collection C [height,hair,eyes] below :
Selecting the second attribute to form the root yields .
Subcollections of 'dark' and 'red' contain objects of only a single class, and so require no more work.
Choosing the third attribute forms .
With class names
Consider a "bad" choice of the first attribute at the start:
Another example (of the calculation):
Consider the choice of the first attribute of the tree:
The collection C of objects contains 3 in class '+' and 5 in '-' so for the entropy of the message set
Testing the first attribute gives the results (consider yourself...)
The entropy of the subtrees (or the information still needed for a rule for the "tall" and "short" branches)
Tall subtree
= -2/5 log2 2/5 - 3/5 log2 3/5
= 0.971
Short subtree
= -1/3 log2 1/3 - 2/3 log2 2/3
= 0.918
Thus the expected information content for this subtree
i [(pi + ni) / (p + n)] * EI(pi ,ni) for i=1 to n
E(C, "height") = 5/8 * 0.971 + 3/8 * 0.918 = 0.951
The information gained by carrying out a test is given by
0.954 - 0.951 = 0.003 (which is negligible)
The tree arising from testing the second attribute:
notice the first two subtrees are homogenous - disorder EI(p,n) for each is 0. The third subtree is evenly distributed, EI(p,n) = 1
E(C, "hair") = 3/8 * 0 + 1/8 * 0 + 4/8 * 1 = 0.5
The information gained by carrying out a test is 0.954-0.5 = 0.454
In a similar way the information gained by 'eyes' comes to 0.347
So we have:
The most information is gained using hair.
We have maximized information gain by minimizing uncertainity or disorder.
Other example (data about suntan lotion needs):
The average disorder produced by the various tests when the complete sample set is used as the window:
Test Disorder
Hair 0.5
Height 0.69
Weight 0.94
Lotion 0.61
Which test should be used first? Why?
Test Disorder
Height 0.5
Weight 1
Lotion 0
Which test?
Note how this techniques also shows that some attributes are irrelevant.
Pluses:
(Its computational complexity hinges on the cost of choosing the next test to branch on, which is itself a linear function of the product of the number of objects in the training set and the number of attributes used to describe them.)
Drawbacks:
(This is common for numerical approaches; they are efficient and resistant to noise, but they yield rules or concepts which are in general incomprehensible to humans.)
(satisfice vs. optimize)
It is computationally impractical to find the smallest possible identification tree when many tests are required, so one must be content with a procedure that tends to build small trees, albeit trees that are not guaranteed to be the smallest.
General Points:
-- Attributes with numeric values