Using standard empirical risk minimization: $J(\theta) = \dfrac{1}{N}\sum |y_i - \theta^Tx_i|$
Gradient $\nabla J(\theta) = \sum 2(x_n^T\theta - y_n)x_n$
Gradient descent: $\theta_{n+1} = \theta_n - \eta\sum (\theta_n^Tx_i)x_i$
Closed form: $\hat \theta = (X^TX)^{-1} X^Ty$
Computational issue: closed form requires $O(ND^2)$, so use SGD.
Numerical issues: $(X^TX)$ not invertible
Get more data or reduce feature dependency
Use some function $\phi: \mathbb{R}^D \to \mathbb{R}^M$, apply regression to $\phi(x_i)$. Could result in overfitting.
$J(\theta) = ||y-\Phi\theta||^2 + \lambda ||\theta||^2 \to \hat\theta = [\Phi^T\Phi + \lambda I]\Phi^Ty$ where $\Phi = [\phi(x_1)^T\,…]$
Apply to gradient descent: $(1-\eta\lambda)$ (shrinkage)
Can also apply to logistic regression: $-\sum y_n \log_{w,b}(x_n) + (1-y_n)\log[1-h_{w,b}(x_n)] + \lambda ||w||^2_2$
So far, have been using empirical risk minimization (ERM). Instead,
$\mathrm{arg\,min}\, R^{emp}[h_{w,b}(x)] + \lambda R(w,b)$
Squared 2-norm: $||w||^2_2$