Machine Learning refresher

Focus on supervised: classification and some regression.

Cost function for linear regression:

$\mathcal{L}(\theta) = \dfrac{1}{2N}\sum_{i=1}^N (y^{(i)}-\hat y^{(i)})^2$

Optimize: differentiate wrt $\theta$

$\dfrac{\partial \mathcal{L}}{\partial \theta} = 0$

Linear Algebra refresher

$f(\mathbf{x}) = \mathbf{x}^T\mathbf{Ax}$

$\nabla_x f = \mathbf{A}^T\mathbf{x} + \mathbf{Ax}$, or if $\mathbf{A}$ is symmetric, $2\mathbf{Ax}$.

$f(\mathbf{x}) = \sum^N_{i} \sum^N_{j} x_i \cdot a_{ij} \cdot x_j$

$\dfrac{\partial f(\mathbf{x})}{\partial x_1} = 2a_{11}x_1 + \sum^N_{j=2}a_{1j} x_j + \sum_{i=2}^n a_{i1}x_i$

$= \sum^N_{j=1} a_{1j}x_j + \sum^N_{i=1}a_{i1}x_i$

The first term is the dot between the first row of $\mathbf{A}$ and $\mathbf{x}$. The second term is the first column.

Hence,

$=(\mathbf{Ax})_1 + (\mathbf{A}^T\mathbf{x})_1$

This is known as “denominator layout” (where derivative of scalar wrt to vector, matrix, etc. has same dimension).

“Numerator layout” is the transpose.