Idea: Close points have “same/similar” test points
| . . X X
| . X X
| . X
| . * X
| . .
|
| .
---------------------------
What label should *
take?
Questions: What distance function to use? What features?
Select multiple features, measure them against each other, see which have greater clustering.
Inductive bias: label of point (instance) is similar to the label of nearby points.
Instance feature vectors: $x\in \mathbb{R}^D$
Label: $y\in [C] = \{1,2,\, …\,\}$
Function: $y=h(x)$
Training data:
N samples used for learning
Validation data:
M samples used for assessing how well function will do on unseen $x$
Training/test data should not overlap.
Stores entire dataset—no explicit model