$J(\theta) = -\sum y_n\log h_\theta (x_n) + (1-y_n)\log[1-h_\theta(x_n)]$ — empirical risk minimization
Numerical optimization: Stationary point → $\nabla J(\theta) = 0$
Gradient descent: $\theta_{n+1} = \theta_n - \eta\nabla J(\theta_n)$
Convex functions: $f(\lambda u + (1-\lambda)v) \leq \lambda f(u) + (1-\lambda) f(v), \, \lambda \in [0,1]$ (function always lies below any chord)
Sufficient condition to find convex functions: $\nabla^2 f \geq 0$ (positive semi-definite) $\forall z, \, z^TAz \geq 0$
Question: Is $J(\theta)$ convex? Yes, Hessian is positive semi definite.
Gradient descent: $\theta_{n+1} = \theta_n - \eta\nabla \sum [h_\theta(x_n)-y_n]x_n$
Now, the labels $y_n$ are real numbers—how close the predictions are to $y_n$ in terms of Euclidean distance.
Predictor: $h_\theta (x) = \theta^Tx$
$J(\theta) = \dfrac{1}{N}\sum |y_i - \theta^Tx_i|^2$ — quadratic cost criterion
Optimization: $\hat \theta = \mathrm{arg\,min}_\theta J(\theta)$, $J(\theta) = ||y-X\theta||_2^2$
Input: $x \in \mathbb{R}^D$
Output: $y\in\mathbb{R}$
Training data: $\mathcal{D} = \{(x_n, y_n),n = 1, 2,…,N\}$
Error: $(\hat y - y)^2$, $\hat y = \theta^Tx$