Last lecture
Linear classification
- Perceptron (Rosenblatt) with binary labels $\{-1, 1\}$ and $x \in \mathbb{R}^D$
- Data is linearly separable if below equation true for some $w$
- Classified where $w^Tx + b > 0$ and $<0$
- If there is an error, $\theta_{new} = \theta_{old} + y_nx_n$
- Algorithm converges in at most $\dfrac{R^2}{\gamma^2}$ updates.
- The margin is the minimum distance from a certain point, and maximizing that
Logistic regression
- Same as linear, but with some confidence value
- Soft decisions → learn $h_\theta(x) \in [0,1]$, surrogate for $p(y|x)$ posterior distribution
- Use the sigmoid function: $\sigma(a) = \dfrac{1}{1+e^{-a}}$
- How do we compute $\theta$? $\hat\theta = \mathrm{arg\:min}_\theta = J(\theta)$
Optimization
Given a function $f(w)$, find its minimum (or maximum), where $f$ is called the objective function. Maximizing $f$ is the same as minimizing $-f$.
Gradient Descent
Stationary points: where $\nabla J(\theta) = \vec 0$. When a function is “well-behaved,” (e.g. convex) then we have simple global maximum. Use gradient descent to observe it.
Procedure
- Start at a random point $w_0$ (where $w^*$ is the minimum).