Features: $(x, y)$ where $x \in \{0,\,1\}$ or $x\in\mathbb{R}$.
Can decide on features based on binary or threshold on a real value.
We want to build a tree such that we reduce the uncertainty in a label.
Mutual information measures how much a random variable $X$ helps us in finding out about the value of another random variable $Y$
$$ I(X;Y)=H(Y)-H(Y|X) $$
At each point in time, choose one that maximizes mutual information.
Entropy: If $X$ certainly takes a value, entropy of 0. (For binary, $P(X) = 0.5$, entropy is 1)
$\mathcal{X}=\{0,1\}^3$ and $\mathcal{Y} = \{0,1\}$
$((x_1,x_2,x_3),y)$
$((1,1,1),1)$
$((1,0,0),1)$
$((1,1,0),0)$
$((0,0,1),0)$