Last lecture

Linear classification

Perceptron (Rosenblatt) with binary labels $\{-1, 1\}$ and $x \in \mathbb{R}^D$
Data is linearly separable if below equation true for some $w$
Classified where $w^Tx + b > 0$ and $<0$
If there is an error, $\theta_{new} = \theta_{old} + y_nx_n$
Algorithm converges in at most $\dfrac{R^2}{\gamma^2}$ updates.
The margin is the minimum distance from a certain point, and maximizing that

Logistic regression

Same as linear, but with some confidence value
Soft decisions → learn $h_\theta(x) \in [0,1]$, surrogate for $p(y|x)$ posterior distribution
Use the sigmoid function: $\sigma(a) = \dfrac{1}{1+e^{-a}}$
How do we compute $\theta$? $\hat\theta = \mathrm{arg\:min}_\theta = J(\theta)$

Optimization

Given a function $f(w)$, find its minimum (or maximum), where $f$ is called the objective function. Maximizing $f$ is the same as minimizing $-f$.

Gradient Descent

Stationary points: where $\nabla J(\theta) = \vec 0$. When a function is “well-behaved,” (e.g. convex) then we have simple global maximum. Use gradient descent to observe it.

Procedure

Start at a random point $w_0$ (where $w^*$ is the minimum).