Feedforward Networks Training Speed Enhancement by Optimal Initialization of the Synaptic Coefficien


目录
  • 主要内容
  • 代码

Yam J. Y. F. and Chow T. W. S. Feedforward networks training speed enhancement by optimal initialization of the synaptic coefficients.

和 超级像的一个工作, 都希望让输出的区域落在激活函数的非饱和区域.

主要内容

先考虑单层:

\[h(x) = \sum_{i=1}^N w_i x_i + w_0. \]

假设该结点所使用的激活函数为

\[\sigma(z) = \frac{1}{1 + \exp(-z)} \in (0, 1), \]

定义其非饱和区域为(此区域内关于\(z\)的导数大于最大导数的\(1/20\)):

\[z \in [-4.36, 4.36]. \]

在输入空间中, 定义超平面

\[P(a) = \sum_{i=1}^H w_i x_i + w_0 - a, \]

则非饱和区域以\(P(-4.36), P(4.36)\)为边界. 该区域的宽度为

\[d = \frac{8.72}{\sqrt{\sum_{i=1}^N w_i^2}}. \]

又该结点的定义域为:

\[x_i \in [x_{i}^{min}, x_{i}^{max}], \]

可知定义域的'宽度'(实际上是对角边)为:

\[D = \sqrt{\sum_{i=1}^N [x_i^{max} - x_i^{min}]^2}. \]

显然, 希望有

\[d \ge D \]

成立, 文中更精确地让 \(d = D\), 即

\[\sqrt{\sum_{i=1}^N w_i^2} = \frac{8.72}{D}. \]

我们希望让 \(w_i\) 采样自

\[w_i \sim \mathcal{U}[-w_{max}, w_{max}], \]

\[\mathbb{E} \sum_{i=1}^N w_i^2 = N \frac{w_{max}^2}{3}. \]

故不妨令

\[\frac{8.72}{D} = N \frac{w_{max}^2}{3}, \]

\[w_{max} = \frac{8.72}{D} \sqrt{\frac{3}{N}}. \]

最后, 对于偏置\(w_0\), 我们希望两个区域的中心是一致的. 其中定义域的中心为:

\[C = (\frac{x_1^{min} + x_1^{max}}{2}, \frac{x_2^{min} + x_2^{max}}{2}, \cdots, \frac{x_N^{min} + x_N^{max}}{2})^T. \]

故需要满足:

\[w_0 + \sum_{i=1}^N w_i C_i = 0 \Rightarrow w_0 = - \sum_{i=1}^N w_i C_i. \]

当再往下考虑的时候, 因为激活函数的值域为\((0, 1)\), 所以下一层权重的初始化为:

\[D = \sqrt{H}, \\ v_{max} = \frac{15.1}{H}, \\ v_0 = - \sum_{i=1}^N 0.5 v_i. \]

代码


import torch.nn as nn
import torch.nn.functional as F

import math


def active_init(weight, bias, low: float = 0., high: float = 1.):
    assert high > low, "high should be greater than low"
    out_channels, in_channels = weight.size()
    D = math.sqrt(in_channels * (high - low))
    w = 8.72 * math.sqrt(3 / in_channels) / D
    nn.init.uniform_(weight, -w, w)
    C = -torch.ones(in_channels) * ((high - low) / 2)
    nn.init.constant_(bias, weight.data[0] @ C)