Deep Learning 深度学习 Notes Chapter 4 Numerical Computation


1. Overflow and underflow

  • Under?ow occurs when numbers near zero are rounded to zero
  • Over?ow occurswhen numbers with large magnitude are approximated as \(-\infty\) or \(\infty\)

\(\\\)

\(\Large\text{Note:}\) One example of a function that must be stabilized against under?ow and over?ow is the \(\textbf{softmax}\) function. The \(\textbf{softmax}\) function is often used to predict the probabilities associated with a multinoulli distribution:

\[\text{softmax}(\mathbf{x})_i = \frac{\exp(x_i)}{\sum_{j=1}^n \exp(x_j)} \]

2. Pool Conditioning

Consider a function \(f(x) = A^{-1}x\). When \(A\in\mathbb{R}^{n\times n}\) has eigenvalue decomposition, the \(\textbf{Condition Number }\)is:

\[\max_{i,j}\left |\frac{\lambda_i}{\lambda_j} \right| \]

This is the ratio of the magnitude of the largest and smallest eigenvalue. Whenthis number is large, matrix inversion is particularly sensitive to error in the input.

3. Jacobian and Hessian Matrices

Sometimes we need to ?nd all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as a \(\textbf{Jacobian matrix}\). Speci?cally, if we have a function: \(f:\mathbb{R}^m\rightarrow \mathbb{R}^n\), then the Jacobian matrix \(\mathbf{J}\in\mathbb{R}^{n\times m}\) of \(f\) is defined as :

\[J_{i,j} = \frac{\partial f(x)_i}{\partial x_j} \]

\(\textbf{Hessian}\) matrix \(\mathbf{H}(f)(x)\) is defined as:

\[\mathbf{H}(f)(x)_{i,j} = \frac{\partial^2 f(x)}{\partial x_i\partial x_j} \]

Equivalently, the \(\textbf{Hessian}\) is the Jacobian of the gradient.

\(\Large\textbf{Note: the di?erential operators are commutative}\)

\[\frac{\partial^2 f(x)}{\partial x_i\partial x_j} = \frac{\partial^2 f(x)}{\partial x_j\partial x_i} \]

This implies that \(H_{i,j} = H_{j,i}\) \(\Rightarrow \textbf{Symmetric}\).

Because the Hessian matrix is real and symmetric, we can decompose it into a set of real \(\textbf{eigenvalues}\) and an \(\textbf{orthogonal basis}\) of eigenvectors.

4. Constrained Optimization

The Karush–Kuhn–Tucker(KKT) approach provides a very general solution to constrained optimization. In detail,

\[L(x,\lambda,\alpha) = f(x)+\sum_i\lambda_ig^{(i)}(x)+\sum_j\alpha_jh^{(j)}(x) \]

where \(f(x)\) is the objective function, and \(g^{(i)}(x)=0, \forall i\) is the \(\textbf{equality constraint}\), \(h^{(j)}(x)\leq 0, \forall j\) is the \(\textbf{inequality constraint}\).

As long as at least one feasible point exists and \(f(x)\) is not permitted to have a value \(\infty\), then

\[\min_x\max_{\lambda}\max_{\alpha\ge 0}L(x,\lambda,\alpha) \]

has has the same optimal objective function value and set of optimal points \(x\) as

\[\min_x f(x) \]

相关