First decision: at root of tree—which attribute to split?
Choose the one that gives you more information:
If a certain attribute gives the same probability of class, don’t choose it
Can use misclassification error to determine which gives more information; however, not entirely a great heuristic
Could have a definite answer for one feature, but not and still “just as error-like”
Can use entropy to measure uncertainty:
$-\sum_{k=1}^K P(X=a_k)\log P(X=a_k)$
Example:
$X\in \{0,1\}$
$P(X=0) = p$
$P(X=1)= 1- p$
$H_2 (X) = p\cdot\log \dfrac{1}{p} + (1-p)\log \dfrac{1}{1-p}$
(Is a quadratic: closer to $p = 0.5$ more uncertainty.)
Max entropy: uniform distribution
Choosing feature over feature: