Overview
- Nearest neighbor classification
- Use label of nearest neighbor (NN) in dataset
- Based on choice of distance measure
- Can order points based on distance to x
- 1-NN Voronoi tesselation—all have width N from nearest neighbor
- 1-NN chooses based on label of first nearest
- K-NN choose from majority label
- Implication: nearest neighbor could be unstable and susceptible to outliers
- Hyperparameters: choice of k, choice of distance measure
- For categorical data, use one-hot encoding
- Note that no “training is required”
- Nearest neighbor is within factor of 2 of “optimal” classifier
- More dimensions will increase error—so need more datapoints
- Kernel methods
- Use map to go from features to some other dimension, but the feature map might have a much higher dimension
- Increase computation and overfitting?
- Can we get advantages of operating in higher dimensional space while keeping computation the same?
Kernel methods
$\phi(x) = [1, \sqrt{2}x_1, …, x_1^2…x_1x_d,…,x_d^2…] = O(D^2)$, all pairs of $x$.
Would want to compute $\phi(u)^T\phi(x)$
Would need $O(M) = O(D^2)$, right?
How can we do this in $O(D), u^Tx$
$\phi(x) = [1, \sqrt{2}x_1, …, x_1^2…x_1x_d,…,x_d^2…]$
$\phi(u) = [1, \sqrt{2}u_1, …, u_1^2…u_1u_d,…,u_d^2…]$
$\phi(u)^T\phi(x) = 1 + 2u_1x_1 + 2u_2x_2 + ...+ x_1x_2 +...+ x_D$…
$=1 + 2\sum u_ix_i + \sum_i \sum_j x_ix_ju_iu_j$
$\sum_i \sum_j x_ix_ju_iu_j = (\sum_lx_lu_l)^2 = \sum x_iu_i \sum x_ju_j$
$=1 + 2\sum u_i x_i + (\sum u_i x_i)^2$
$= (1+ \sum u_i x_i) ^2$
$= (1+u^Tx)^2$
Constructed a particular feature map that we can compute in $O(D)$ instead of $O(M) = O(D^2)$
Kernelized Ridge Regression
$\tilde J(\theta) = ||y-\Phi\theta||^2 + \lambda ||\theta||^2$