Last time

Linear regression

Using standard empirical risk minimization: $J(\theta) = \dfrac{1}{N}\sum |y_i - \theta^Tx_i|$

Gradient $\nabla J(\theta) = \sum 2(x_n^T\theta - y_n)x_n$

Gradient descent: $\theta_{n+1} = \theta_n - \eta\sum (\theta_n^Tx_i)x_i$

Closed form: $\hat \theta = (X^TX)^{-1} X^Ty$

Computational issue: closed form requires $O(ND^2)$, so use SGD.

Numerical issues: $(X^TX)$ not invertible

Get more data or reduce feature dependency

Use some function $\phi: \mathbb{R}^D \to \mathbb{R}^M$, apply regression to $\phi(x_i)$. Could result in overfitting.

$J(\theta) = ||y-\Phi\theta||^2 + \lambda ||\theta||^2 \to \hat\theta = [\Phi^T\Phi + \lambda I]\Phi^Ty$ where $\Phi = [\phi(x_1)^T\,…]$

Apply to gradient descent: $(1-\eta\lambda)$ (shrinkage)

Can also apply to logistic regression: $-\sum y_n \log_{w,b}(x_n) + (1-y_n)\log[1-h_{w,b}(x_n)] + \lambda ||w||^2_2$

So far, have been using empirical risk minimization (ERM). Instead,

$\mathrm{arg\,min}\, R^{emp}[h_{w,b}(x)] + \lambda R(w,b)$

Squared 2-norm: $||w||^2_2$