Learning From Examples:

Inductive Learning

Decision Trees [Quinlan, 1979, 1986]

An approach to concept learning is the induction of decision trees, as exemplified in the ID3 program of Quinlan.

Goal: ID3 is a program that builds decision trees automatically, given positive and negative instances of a concept.

Structure of the tree:

Suppose you are given the training set below [Quinlan, 1985]

A decision tree that correctly classifies each object could be
Sample tree:

another

Technique:

ID3 uses an iterative method to build up decision trees

(Empirical evidence indicates that the iterative strategy is more efficient than considering the whole training set at once.)

The heart of the algorithm is the procedure for building a decision tree for an arbitrary collection of objects, C, where C is not empty and contains objects of more than one class.

Let T be any test on an object x, with A1, A2, ... An representing the possible outcomes of T(x). T will therefore produce a partition {C1, C2, ...Cn} of C such that

Ci = {x | T(x) = Ai}.

If we continue recursively to replace each Ci with a decision tree, we would have a decision tree for C.

The crucial point in the problem reduction strategy is the choice of test.

(Want tree to provide answers (prune) quickly)

Which attributes to test? (Some attributes will yield more information than others.) For each subtree, one needs to find a good attribute for partitioning the objects.

Quinlan uses the notion of expected information from communication theory. (also called entropy and uncertainty (not to be confused with certainty factors))

In communication theory, expected information is a number describing a set of messages, M = {m1,m2, ..., mn}.

Each message mi in the set has probability p(mi) and contains an amount of information, I(mi), defined as

I(mi) = -log2 p(mi)

Thus information is an inverse monotonic function of probability.

The expected information or entropy of a message set U(M), is the sum of the information in the several possible messages multiplied by their probabilities:

U(M) = - i p(mi) log2 p(mi)        for i = 1 to n

Note: Such a formula gives you a high number when a test produces highly inhomogeneous sets and a low number when a test produces completely homogeneous sets.

Quinlan's assumptions:

Examples:

Suppose you had a set that contains members of just two classes, class A and class B. If the number of members from class A and the number of members of class B are perfectly balanced, the measured disorder is 1, the maximum possible value:

-1/2log21/2 - 1/2log21/2 = 1/2 + 1/2 = 1

On the other hand, if there were only all A's or all B's, the measured disorder is 0, the minimum possible value:

-1 log21 -0 log20 = -0 - 0 = 0

As you move from perfect balance and perfect homogeneity, disorder varies smoothly between 0 and 1. The disorder is 0 when the set is perfectly homogeneous, and the disorder is 1 when the set is perfectly inhomogeneous.

We want to minimize average disorder.

It is convenient to write this 'two-message' (P (+) and N (-)) case of entropy as a function of the two arguments:

EI(p,n) = -p/(p + n)*log2 p/(p+n) - n/(p + n)*log2 n/(p+n)

Given a test T with outcomes A1, A2, ... An producing a partition {C1, C2, ... Cn} as before, suppose Ci that contains pi objects of class P and ni objects of class N . The entropy of the subtree is given by EI(pi,ni). The entropy of the larger tree with T at its root is then given by the weighted average:

E(T) = i [(pi + ni) / (p + n)] * EI(pi ,ni)     for i=1 to n

The ratio (pi + ni)/ (p + n) corresponds to the weight for the ith branch in a tree. It represents the proportion of the objects in C that belong to Ci.

Heuristic: In deciding what attribute to branch on next, ID3 uses select the test that gains the most information.

The information gained by carrying out a test is given by

EI(p,n) - E(T)

Such heuristics are sometimes described as 'minimum entropy' because maximizing information gain corresponds to minimizing uncertainity or disorder.

Example: [Quinlan,1983]

Consider collection C [height,hair,eyes] below :

Selecting the second attribute to form the root yields .

Subcollections of 'dark' and 'red' contain objects of only a single class, and so require no more work.

Choosing the third attribute forms
.

With class names

Consider a "bad" choice of the first attribute at the start:

Another example (of the calculation):

Consider the choice of the first attribute of the tree:

The collection C of objects contains 3 in class '+' and 5 in '-' so for the entropy of the message set

Testing the first attribute gives the results (consider yourself...)

The entropy of the subtrees (or the information still needed for a rule for the "tall" and "short" branches)

Tall subtree
= -2/5 log2 2/5 - 3/5 log2 3/5 = 0.971

Short subtree
= -1/3 log2 1/3 - 2/3 log2 2/3 = 0.918

Thus the expected information content for this subtree

E(C, "height") = 5/8 * 0.971 + 3/8 * 0.918 = 0.951

The information gained by carrying out a test is given by

0.954 - 0.951 = 0.003 (which is negligible)

The tree arising from testing the second attribute:
         notice the first two subtrees are homogenous - disorder EI(p,n) for each is 0. The third subtree is evenly distributed, EI(p,n) = 1

E(C, "hair") = 3/8 * 0 + 1/8 * 0 + 4/8 * 1 = 0.5

The information gained by carrying out a test is 0.954-0.5 = 0.454

In a similar way the information gained by 'eyes' comes to 0.347

So we have:

The most information is gained using hair.
We have maximized information gain by minimizing uncertainity or disorder.

Other example (data about suntan lotion needs):
The average disorder produced by the various tests when the complete sample set is used as the window:

Test          Disorder

Hair          0.5
Height       0.69
Weight      0.94
Lotion       0.61

Which test should be used first? Why?

Test            Disorder

Height          0.5
Weight         1
Lotion          0

Which test?

Note how this techniques also shows that some attributes are irrelevant.

Pluses:

-- Handling Conflicts

-- Attributes with numeric values

  • see a game example from Laird's notes