“The cost function must be large and predictable enough to serve as a guide for the learning algorithm” — Goodfellow

Nonlinear Units

Sigmoid: $\sigma(x) = \dfrac{1}{1+e^{-x}}$, $\sigma’(x) = \sigma(x)(1-\sigma(x))$
- No negatives, so SGD “zigzags”
Tanh: $\mathrm{tanh}(x) = 2\sigma(x) - 1$
- Zero centered, no zigzag
ReLU!!!!
Leaky ReLU, PReLU ($\alpha$ learned)