Induction of decision trees: Entropy, ID3, C4, and C4.5

Resources

Conditional entropy (pl: entropia warunkowa)
Other ID3 examples: [1], [2]
C4.5: binomial distribution (pl: rozkład dwumianowy), its CDF (cumulative distribution function, pl: dystrybuanta) and pruning
Math details: binomial confidence interval, binomial confidence interval for non-integer parameters, distribution of p of binomial distribution

Video walkthrough

Hidden answers

Behind gray rectangles you will find hidden answers. Click and reveal these answers only after you individually wrote down your own answer or your own solution. Revealing the answer should allow you to verify your knowledge or thinking, and should not be used for initial learning. Remember: what was seen cannot be unseen. If you use answers for initial learning, you will not be able to verify your knowledge.

Entropy

Write down a formula for entropy, i=1..n events with probabilities p_i. H(X)=...
What is the interpretation of entropy?
What is the range of entropy values?
What are the units of entropy?
What is the interpretation and effect of the logarithm base?
Consider examples of a perfect ten-sided dice, a perfect six-sided dice, a perfect coin, an imperfect six-sided dice, an imperfect coin. A perfect e-sided dice?
Calculate entropy for n=2, p₁=p₂=0.5
Calculate entropy for any n, p_i=1/n
What is 0·log 0?
What rule allows us to calculate x·log x for x=0? L'Hôpital's. Calculate this value properly.
Calculate entropy for n=2, p₁=1, p₂=0.
Provide a simple, intuitive interpretation for full conditional entropy. Disorder in subsets weighted by their size.
How to calculate log₂x based on ln x?
Open a spreadsheet (or some programming console) and prepare a formula (or a function) that computes log₂x.
Now that we know everything about entropy, meet its competitor, Gini impurity (explanation).

A sample data set

	Height	Hair	Eyes	Attractiveness
1	short	blond	blue	+
2	tall	blond	brown	−
3	tall	red	blue	+
4	short	dark	blue	−
5	tall	dark	blue	−
6	tall	blond	blue	+
7	tall	dark	brown	−
8	short	blond	brown	−

ID3 ("Iterative Dichotomiser 3") algorithm and information gain

Write down a general formula for information gain of some attribute a that is used to split cases classified as decision d. Information gain(a)=H(d)-H(d|a)
Calculate H(Attractiveness)=-3/8 log 3/8 - 5/8 log 5/8 = 0.955
Draw a tree with these 8 cases split into two subsets of different Height.
Calculate H(Attractiveness|Height)=+3/8 (-1/3 log 1/3 - 2/3 log 2/3) + 5/8 (-2/5 log 2/5 - 3/5 log 3/5) = 0.951
Information gain(Height)=0.955 - 0.951 = 0.004
Now, draw a tree with these 8 cases split into subsets of different Hair color. blond: 1+,2-,6+,8-; red: 3+; dark: 4-,5-,7-
Calculate H(Attractiveness|Hair)=+4/8 (-1/2 log 1/2 - 1/2 log 1/2) + 1/8 * 0 + 3/8 * 0 = 0.5
Information gain(Hair)=0.955 - 0.5 = 0.455
Now, draw a tree with these 8 cases split into subsets of different Eyes color. blue: 1+,3+, 4-, 5-, 6+; brown: 2-, 7-, 8-
Calculate H(Attractiveness|Eyes)=+5/8 (-3/5 log 3/5 - 2/5 log 2/5) + 3/8 * 0 = 0.607
Information gain(Eyes)=0.955 - 0.607 = 0.348
So the best top-level attribute is Hair.
We have to sort out the left leaf, but which attribute will surely not help us in discriminating cases 1,2,6,8? Hair.
1. Consider the following four cases and try to split them according to Height and Eyes.
  
  Height Eyes Attractiveness
  
  1 short blue +
  
  2 tall brown −
  
  6 tall blue +
  
  8 short brown −
2. Calculate H(Attractiveness)=1
3. Calculate H(Attractiveness|Height)=1/2 * 1 + 1/2 * 1 = 1
4. Calculate Information gain(Height)=0
5. Calculate H(Attractiveness|Eyes)=0
6. Calculate Information gain(Eyes)=1
Draw the complete final tree. Which attribute was not used at all? Height.
Which attributes will always be favored by information gain? Which attributes will always be best in discriminating cases? attributes with many values such as IDs, height in cm or fine-grained, etc.

Rules from the tree

Would it be possible to generate decision rules from the decision tree? How?
Could these rules be simplified, given that every part of the tree was grown because it was needed, necessary and non-redundant?
Generate (write down) all rules based on the last tree.
Look carefully and tell if they can be simplified? yes. Why?
Can any tree be transformed into decision rules? If not, which trees cannot?
Can any set of rules be transformed into decision tree? If not, which sets cannot?
Propose at least two algorithms of simplifying rules (dropping conditions) so that the rules remain perfectly correct. Provide the complexity of these algorithms. linear (from left to right, drop and check), O(n), exponential (checking all subsets), O(2ⁿ)
Try to simplify your set of rules using both algorithms.
Which is more general: trees or rules? why? draw a 2D diagram (axes are attribute values) and compare what kind of areas trees and rules are able to cover.

ID3 algorithm and information gain ratio

Information gain causes a bias towards attributes with many values. Information gain ratio(a)=Information gain(a) / H(a)
How entropy of an attribute is different from simply "the number of distinct values of an attribute"? With entropy, the distribution of the number of attribute values is captured (as a continuous value), not just the number of distinct values.
Calculate H(Height)=-3/8 log 3/8 - 5/8 log 5/8 = 0.955
Calculate H(Hair)=-4/8 log 4/8 - 1/8 log 1/8 - 3/8 log 3/8 = 1.406
Calculate H(Eyes)=-3/8 log 3/8 - 5/8 log 5/8 = 0.955
Calculate Information gain ratio(Height)=0.0042
Calculate Information gain ratio(Hair)=0.324
Calculate Information gain ratio(Eyes)=0.364
This time the winning attribute is Eyes.
Draw a tree with Eyes at the root. blue: 1+, 3+, 4-, 5-, 6+ and brown: 2-, 7-, 8-.
1. Sort out the left leaf.
  
  Height Hair Attractiveness
  
  1 short blond +
  
  3 tall red +
  
  4 short dark −
  
  5 tall dark −
  
  6 tall blond +
2. Calculate H(Attractiveness)=-2/5 log 2/5 - 3/5 log 3/5 = 0.971
3. Calculate H(Height)=-2/5 log 2/5 - 3/5 log 3/5 = 0.971
4. Calculate H(Hair)=-2/5 log 2/5 - 1/5 log 1/5 - 2/5 log 2/5= 1.522
5. Calculate H(Attractiveness|Height)=2/5 * 1 + 3/5 * (-1/3 log 1/3 - 2/3 log 2/3) = 0.951
6. Calculate H(Attractiveness|Hair)=0
7. We can calculate Information gain ratios for Height and Hair, but the better one will be Hair.
Draw a complete final tree. Which attribute was not used at all? Height.
Generate (write down) all rules based on the tree. How do they compare to the rules generated from information gain-based tree?
Try to simplify your set of rules. Is it possible? How do they compare to the rules generated from information gain-based tree and simplified?

ID3 algorithm and inconsistency in data

Imagine we have case #9 which is identical to case #7, but Attractive(+).
Run ID3 with the "information gain" measure and you will get
Can you use any information from your data to distinguish between cases 7 and 9? no, Height does not distinguish these cases. By the definition of inconsistency, nothing can discriminate between these cases.
Redraw the tree above without any IDs of cases. Instead of IDs, in every node and in every leaf put two values: a decision class that will minimize classification errors made by the node/leaf, and the number of errors caused in the node/leaf by assuming such decision class.

C4 algorithm and pruning

C4 = ID3 + cost-complexity pruning.
In reality, data is imperfect (inconsistent) and we do not want to perfectly reflect all individual cases in our tree (knowledge). We want to generalize.
We want to avoid overly complex trees. How to measure complexity of a tree? number of nodes, number of leaves, number of levels, ...
An "ideal" tree should be both error-free and simple (small).
Suggest your own approach to balance complexity and inaccuracy of a decision tree.
Cost-complexity: minimize E/N + α L
Instead of aggregating these two values, one could perform a multi-criteria analysis with dominance
What is the interpretation of α? a trade-off, how much one leaf less allows new errors, how much one more error is worth pruned leaves
When we have induced a tree, we get rid of cases and only leave classification. When we have multiple decision classes in some node, we assign the majority class
C4 considers all possible cuts. Assume α=0.01 and do it. Which pruning possibility wins?
- Original unpruned tree: N=9, E=1, L=5, cost=1/9 + 0.01*5 = 0.161
- Another pruning possibility: cost=1/9 + 0.01*4 = 0.151
- Another pruning possibility: cost=3/9 + 0.01*4 = 0.373
- Another pruning possibility: cost=3/9 + 0.01*3 = 0.363
- Prune everything: cost=4/9 + 0.01*1 = 0.454
- So, the best was the tree with pruned Eyes=blue/brown split within Hair=dark.
- After pruning, all people with dark hair (based on cases 4,5,7,9) are considered unattractive.

C4.5 algorithm and pruning

C4.5 = ID3 + pruning based on U_CF(E,N).
"N" is the number of cases in data set.
"E" is the number of classification errors.
Assuming confidence level = 25%, predicted error rate for the future (upper limit on probability of an error) π=U_25%(E,N)
Binomial distribution:
p(i,N)=(N choose i) πⁱ (1-π)^N-i
Sum(i=0..E) p(i,N)=0.25
Calculate U_25%(0,1)=0.75
"If you have one case and you learn to classify it correctly, in the future you will make 75% of errors."
Calculate U_25%(0,2)=0.5
Calculate U_25%(1,2)=sqrt(0.75) ≈ 0.866
U_25%(1,4) ≈ 0.55
U_25%(2,4) ≈ 0.757
Pruning in C4.5 proceeds from lower levels. If all succeeded, it follows to higher levels. If more errors are predicted after pruning, it stops.
Look at the right sub-tree: we have cases 4,5,7,9 spit by Eyes=blue (4,5)− or brown (7,9)+. Should we prune?
1. Calculate the number of errors before pruning: #E = 2*U_25%(0,2)+2*U_25%(1,2) ≈ 2.73
2. Calculate the number of errors after pruning: #E = 4*U_25%(1,4) ≈ 2.2
3. Should we prune? yes
Look at the left sub-tree: we have cases 1,2,6,8 spit by Eyes=blue (1,6)+ or brown (2,8)−. Should we prune?
1. Calculate the number of errors before pruning: #E = 2*U_25%(0,2)+2*U_25%(0,2) = 2
2. Calculate the number of errors after pruning: #E = 4*U_25%(2,4) ≈ 3.028
3. Should we prune? no (and don't try to prune upper levels)
Dropping conditions in rules works the same way.
1. Hair=dark and Eyes=Brown → Attractiveness=− N=2, E=1, U_25%(E,N)=0.87
2. Eyes=Brown → Attractiveness=− N=4, E=1, U_25%(E,N)=0.55
3. Hair=dark → Attractiveness=− N=4, E=1, U_25%(E,N)=0.55
4. Compare predicted error rates. Should we drop? yes. Should we try to drop more? we just succeeded in dropping, so yes.

Missing attribute values

Come up with various ways of avoiding this situation. Discuss advantages and disadvantages of each strategy.
- ignore rows (cases) with missing values
- ignore columns (attributes) with missing values
- interpret missing as a special value "?"
- replace missing with the most frequent value of the attribute
- replace missing with the most frequent value of the attribute for a given decision
- replace missing with all possible values of that attribute
- replace missing with all possible values of that attribute for a given decision
- try to find a similar/identical case that has no missing values, and use this knowledge to fill in the missing value
- ...
C4.5 does not avoid this situation; it handles it!
Let's assume we have 8 cases (as in the table above), but case #5 has unknown Eyes.
When choosing an attribute, C4.5 discriminates attributes with missing values in the following ways:
1. (7 examples) H(Attractiveness)=-3/7 log 3/7 - 4/7 log 4/7 ≈ 0.985
2. (7 examples) H(Attractiveness|Eyes)=4/7 (-1/4 log 1/4 - 3/4 log 3/4) + 3/7 * 0 ≈ 0.463
3. Information gain(Eyes) = 7/8 (0.985 - 0.463) ≈ 0.457
  If we would like to use information gain ratio:
4. (8 examples – bigger entropy in denominator, so gain ratio will be lower) H(Eyes) = -4/8 log 4/8 - 3/8 log 3/8 - 1/8 log 1/8 ≈ 1.406
5. Information gain ratio(Eyes) = ...
6. In our case, despite discrimination, Eyes is still the best attribute. So case #5 is split proportionally to 4/7 and 3/7.
Oh no! Now we have fractions of cases in the tree! What changes are needed in the formulas introduced so far (entropy, information gain) to handle fractional numbers? No changes are needed. As we learned, entropy as a continuous function can be computed for fractions with no problem, and it handles such situations reasonably.
Sort out cases in the left leaf.
- Height:
  H(Attractiveness|Height) ≈ 2/4.6 * 1 + 2.6/4.6 (- 0.6/2.6 log 0.6/2.6 - 2/2.6 log 2/2.6)
- Hair:
  (draw the tree)
  H(Attractiveness|Hair) = ... the tree will tell if it is better.
So we choose Hair. Imagine that the Hair color of case #1 is also missing. The tree would then have two cases split multiple times between branches, but all mathematical formulas would work without the need for any changes.

What's new in C5.0?

Improved in terms of performance and the quality of trees, but the foundations discussed above remain the same.