Focus on supervised: classification and some regression.
Cost function for linear regression:
$\mathcal{L}(\theta) = \dfrac{1}{2N}\sum_{i=1}^N (y^{(i)}-\hat y^{(i)})^2$
Optimize: differentiate wrt $\theta$
$\dfrac{\partial \mathcal{L}}{\partial \theta} = 0$
$f(\mathbf{x}) = \mathbf{x}^T\mathbf{Ax}$
$\nabla_x f = \mathbf{A}^T\mathbf{x} + \mathbf{Ax}$, or if $\mathbf{A}$ is symmetric, $2\mathbf{Ax}$.
$f(\mathbf{x}) = \sum^N_{i} \sum^N_{j} x_i \cdot a_{ij} \cdot x_j$
$\dfrac{\partial f(\mathbf{x})}{\partial x_1} = 2a_{11}x_1 + \sum^N_{j=2}a_{1j} x_j + \sum_{i=2}^n a_{i1}x_i$
$= \sum^N_{j=1} a_{1j}x_j + \sum^N_{i=1}a_{i1}x_i$
The first term is the dot between the first row of $\mathbf{A}$ and $\mathbf{x}$. The second term is the first column.
Hence,
$=(\mathbf{Ax})_1 + (\mathbf{A}^T\mathbf{x})_1$
This is known as “denominator layout” (where derivative of scalar wrt to vector, matrix, etc. has same dimension).
“Numerator layout” is the transpose.